Code point iterators for strings?

claudiameadows · January 24, 2020, 8:10am

When processing strings in performance-sensitive code, you inevitably end up writing something like one of these two:

// For state machine-based loops and similar
for (let i = 0; i < string.length; i++) {
    let code = string.charCodeAt(i)
    // ...
}

// For recursive-descent parsers
function next() {
    return index === string.length
        ? -1
        : string.charCodeAt(index++)
}

If you need full Unicode support, that invariably gets slightly more complicated:

// For state machine-based loops and similar
for (let i = 0; i < string.length;) {
    let code = string.codePointAt(i++)
    if (code >= 0x10000) i++
    // ...
}

// For recursive-descent parsers
function next() {
    if (index === string.length) return -1
    let code = string.codePointAt(index++)
    if (code >= 0x10000) index++
    return code
}

Problem is, most engines store their strings not as simple byte sequences, so ‘String.prototype.charCodeAtandString.prototype.codePointAtare *not* constant time. So could two new iterator methods be added toString.prototype`?

String.prototype.codePoints() - Iterate all code points in this string
String.prototype.charCodes() - Iterate all character codes in this string

Each of these two would be relatively straightforward to define in JS:

String.prototype.charCodes = function *() {
    let s = "" + this
    for (let i = 0; i < s.length; i++) {
        yield s.charCodeAt(i)
    }
}

String.prototype.codePoints = function *() {
    let s = "" + this
    for (let i = 0; i < s.length; i++) {
        let code = s.codePointAt(i)
        if (code > 0x10000) i++
        yield code
    }
}

An implementation might choose to optimize these to be a fully linear traversal, though, and they could optimize the iterator similarly to how they do with the default iterator.

Note that the code point iterator is likely to see greater use as most parsers use just that, but smaller use cases might not care about surrogates, and so they could skip the overhead.

claudepache · January 24, 2020, 8:38am

Note that ECMAScript has already String.prototype[Symbol.iterator]() that iterates over code points (although I think it was a blunder~~: it should have been named String.prototype.codePoints()~~).

EDIT: I realise after having written my comment that String.prototype[Symbol.iterator]() does not yields the information in the format you want, namely as integer rather than as 1-or-2-byte-string. (But, I’m still thinking it was a blunder.)

jridgewell · January 24, 2020, 8:39am

claudiameadows · January 24, 2020, 5:27pm

I had not seen that before, so very interesting.

Is there any corresponding proposal for UCS-2 character codes?

Topic		Replies	Views
Take 2: generator.prototype[Symbol.mixedIterator] 💡 Ideas	45	802	April 19, 2024
String.prototype.codePoints() 💡 Ideas proposal	2	273	February 12, 2022
String.prototype.entries() 💡 Ideas proposal	7	231	April 22, 2024
Character code literals 💡 Ideas	5	314	January 11, 2022
Code point string 💡 Ideas	7	271	May 25, 2022

Code point iterators for strings?

Related topics