Code point iterators for strings?

When processing strings in performance-sensitive code, you inevitably end up writing something like one of these two:

// For state machine-based loops and similar
for (let i = 0; i < string.length; i++) {
    let code = string.charCodeAt(i)
    // ...
}

// For recursive-descent parsers
function next() {
    return index === string.length
        ? -1
        : string.charCodeAt(index++)
}

If you need full Unicode support, that invariably gets slightly more complicated:

// For state machine-based loops and similar
for (let i = 0; i < string.length;) {
    let code = string.codePointAt(i++)
    if (code >= 0x10000) i++
    // ...
}

// For recursive-descent parsers
function next() {
    if (index === string.length) return -1
    let code = string.codePointAt(index++)
    if (code >= 0x10000) index++
    return code
}

Problem is, most engines store their strings not as simple byte sequences, so β€˜String.prototype.charCodeAtandString.prototype.codePointAtare *not* constant time. So could two new iterator methods be added toString.prototype`?

  • String.prototype.codePoints() - Iterate all code points in this string
  • String.prototype.charCodes() - Iterate all character codes in this string

Each of these two would be relatively straightforward to define in JS:

String.prototype.charCodes = function *() {
    let s = "" + this
    for (let i = 0; i < s.length; i++) {
        yield s.charCodeAt(i)
    }
}

String.prototype.codePoints = function *() {
    let s = "" + this
    for (let i = 0; i < s.length; i++) {
        let code = s.codePointAt(i)
        if (code > 0x10000) i++
        yield code
    }
}

An implementation might choose to optimize these to be a fully linear traversal, though, and they could optimize the iterator similarly to how they do with the default iterator.

Note that the code point iterator is likely to see greater use as most parsers use just that, but smaller use cases might not care about surrogates, and so they could skip the overhead.

1 Like

Note that ECMAScript has already String.prototype[Symbol.iterator]() that iterates over code points (although I think it was a blunder: it should have been named String.prototype.codePoints()).

EDIT: I realise after having written my comment that String.prototype[Symbol.iterator]() does not yields the information in the format you want, namely as integer rather than as 1-or-2-byte-string. (But, I’m still thinking it was a blunder.)

1 Like

I had not seen that before, so very interesting.

Is there any corresponding proposal for UCS-2 character codes?