Code point string

In Python, the string data type consists of Unicode Scalar Values, or code points in short, while in ECMAScript the string data type consists of Unicode Code Units. In ECMAScript 4 reference interpreter, I saw the string data type consisting of code points, so I got this idea from there, to start with.

Python runtime uses a technique like this to encode strings: if there's no code point that is ≥ U+100, the string data type is encoded using 1-byte per character. If there's any code point that is ≥ U+100, but no code point ≥ U+10000, it's encoded using 2-byte per character; otherwise it uses 4-byte per character.

Since ECMAScript uses the number data type, whichs supports 0x10FFFF value, it's possible to have Python-based string data type by having alternate versions of the string manipulation methods. By alternate versions I mean, there can be an option, like 'use code point', which will cause manipulations to be code-point-based, and not code-unit-based.

So the idea is not to add code-point-specific methods, but to allow existing methods to work either for code units or for code points.

So, for example, it'd work like this:

// actual behavior
'\u{10ffff}'.charCodeAt(0); // U+DBFF
'\u{10ffff}'.charCodeAt(1); // U+DFFF

// desired behavior
'use code point';
'\u{10ffff}'.charCodeAt(0); // U+10FFFF
'\u{10ffff}'.charCodeAt(1); // NaN

Some compilers could automatically add this 'use code point' directive. This could even be set at HTML-<script>-level.

Implementation

To implement this feature in V8 is required a four-byte representation of the string type and support 0-0x10ffff range for character codes.

I'm curious about how this would work in JSON. You can't add a raw string at the beginning of a .json file and have it still be valid. If there was a value of "\u{10ffff}" imported to js or otherwise, would it be U+10FFFF or the current value of "\u{10ffff}"?

JSON doesn't manipulate, so that idea doesn't apply to JSON nor does it apply to literal string. When JSON is parsed, \u{10ffff} correctly turns into high-surrogate-n-low-surrogate format, or otherwise in UTF-8 format.

Ah, and JSON doesn't even support this {} escape, but it doesn't determine the encoding of the JSON string.

If there was a value of "\u{10ffff}" imported to js or otherwise, would it be U+10FFFF or the current value of "\u{10ffff}" ?

The idea is that '\u{10ffff}' will be a single-character string with U+10FFFF. But, well, the encoding depends on the JavaScript engine.


So I did look at V8, seems like it supports both one-byte and two-byte encodings. It'd only require four-byte encoding for this feature to work.

JS strings already have .codePointAt which I believe works closer to the way you expect?

The problem of .codePointAt is that it receives an index based on code units... so the following would fail:

// index '1' means 'second' character
'\u{10ffff}a'.codePointAt(1); // instead of U+61 ("a"), gives U+DFFF

Array.from(str)[1] then.

Fine, and then String.fromCodePoint(...array) back. But this would be inefficient if the string is long, especially when working with parsing.

I believe the iterator-helpers proposal would help with the performance issues. If you're dealing with larger strings, you can write this:

const [char] = Iterator.from(str).drop(index - 1);

But, it is unfortunately getting a bit more complicated now. Though, the solution to this complexity could be more iterator helpers.

1 Like