When processing strings in performance-sensitive code, you inevitably end up writing something like one of these two:
// For state machine-based loops and similar
for (let i = 0; i < string.length; i++) {
let code = string.charCodeAt(i)
// ...
}
// For recursive-descent parsers
function next() {
return index === string.length
? -1
: string.charCodeAt(index++)
}
If you need full Unicode support, that invariably gets slightly more complicated:
// For state machine-based loops and similar
for (let i = 0; i < string.length;) {
let code = string.codePointAt(i++)
if (code >= 0x10000) i++
// ...
}
// For recursive-descent parsers
function next() {
if (index === string.length) return -1
let code = string.codePointAt(index++)
if (code >= 0x10000) index++
return code
}
Problem is, most engines store their strings not as simple byte sequences, so βString.prototype.charCodeAtand
String.prototype.codePointAtare *not* constant time. So could two new iterator methods be added to
String.prototype`?
-
String.prototype.codePoints()
- Iterate all code points in this string -
String.prototype.charCodes()
- Iterate all character codes in this string
Each of these two would be relatively straightforward to define in JS:
String.prototype.charCodes = function *() {
let s = "" + this
for (let i = 0; i < s.length; i++) {
yield s.charCodeAt(i)
}
}
String.prototype.codePoints = function *() {
let s = "" + this
for (let i = 0; i < s.length; i++) {
let code = s.codePointAt(i)
if (code > 0x10000) i++
yield code
}
}
An implementation might choose to optimize these to be a fully linear traversal, though, and they could optimize the iterator similarly to how they do with the default iterator.
Note that the code point iterator is likely to see greater use as most parsers use just that, but smaller use cases might not care about surrogates, and so they could skip the overhead.