In Python, the string data type consists of Unicode Scalar Values, or code points in short, while in ECMAScript the string data type consists of Unicode Code Units. In ECMAScript 4 reference interpreter, I saw the string data type consisting of code points, so I got this idea from there, to start with.
Python runtime uses a technique like this to encode strings: if there's no code point that is ≥ U+100, the string data type is encoded using 1-byte per character. If there's any code point that is ≥ U+100, but no code point ≥ U+10000, it's encoded using 2-byte per character; otherwise it uses 4-byte per character.
Since ECMAScript uses the number
data type, whichs supports 0x10FFFF
value, it's possible to have Python-based string data type by having alternate versions of the string manipulation methods. By alternate versions I mean, there can be an option, like 'use code point'
, which will cause manipulations to be code-point-based, and not code-unit-based.
So the idea is not to add code-point-specific methods, but to allow existing methods to work either for code units or for code points.
So, for example, it'd work like this:
// actual behavior
'\u{10ffff}'.charCodeAt(0); // U+DBFF
'\u{10ffff}'.charCodeAt(1); // U+DFFF
// desired behavior
'use code point';
'\u{10ffff}'.charCodeAt(0); // U+10FFFF
'\u{10ffff}'.charCodeAt(1); // NaN
Some compilers could automatically add this 'use code point'
directive. This could even be set at HTML-<script>
-level.
Implementation
To implement this feature in V8 is required a four-byte representation of the string
type and support 0-0x10ffff
range for character codes.