Code point string

Klaider · May 25, 2022, 3:05pm

In Python, the string data type consists of Unicode Scalar Values, or code points in short, while in ECMAScript the string data type consists of Unicode Code Units. In ECMAScript 4 reference interpreter, I saw the string data type consisting of code points, so I got this idea from there, to start with.

Python runtime uses a technique like this to encode strings: if there's no code point that is ≥ U+100, the string data type is encoded using 1-byte per character. If there's any code point that is ≥ U+100, but no code point ≥ U+10000, it's encoded using 2-byte per character; otherwise it uses 4-byte per character.

Since ECMAScript uses the number data type, whichs supports 0x10FFFF value, it's possible to have Python-based string data type by having alternate versions of the string manipulation methods. By alternate versions I mean, there can be an option, like 'use code point', which will cause manipulations to be code-point-based, and not code-unit-based.

So the idea is not to add code-point-specific methods, but to allow existing methods to work either for code units or for code points.

So, for example, it'd work like this:

// actual behavior
'\u{10ffff}'.charCodeAt(0); // U+DBFF
'\u{10ffff}'.charCodeAt(1); // U+DFFF

// desired behavior
'use code point';
'\u{10ffff}'.charCodeAt(0); // U+10FFFF
'\u{10ffff}'.charCodeAt(1); // NaN

Some compilers could automatically add this 'use code point' directive. This could even be set at HTML-<script>-level.

Implementation

To implement this feature in V8 is required a four-byte representation of the string type and support 0-0x10ffff range for character codes.

DiriectorDoc · May 25, 2022, 3:29pm

I'm curious about how this would work in JSON. You can't add a raw string at the beginning of a .json file and have it still be valid. If there was a value of "\u{10ffff}" imported to js or otherwise, would it be U+10FFFF or the current value of "\u{10ffff}"?

Klaider · May 25, 2022, 3:31pm

JSON doesn't manipulate, so that idea doesn't apply to JSON nor does it apply to literal string. When JSON is parsed, \u{10ffff} correctly turns into high-surrogate-n-low-surrogate format, or otherwise in UTF-8 format.

Ah, and JSON doesn't even support this {} escape, but it doesn't determine the encoding of the JSON string.

If there was a value of "\u{10ffff}" imported to js or otherwise, would it be U+10FFFF or the current value of "\u{10ffff}" ?

The idea is that '\u{10ffff}' will be a single-character string with U+10FFFF. But, well, the encoding depends on the JavaScript engine.

So I did look at V8, seems like it supports both one-byte and two-byte encodings. It'd only require four-byte encoding for this feature to work.

ljharb · May 25, 2022, 4:18pm

JS strings already have .codePointAt which I believe works closer to the way you expect?

Klaider · May 25, 2022, 4:22pm

The problem of .codePointAt is that it receives an index based on code units... so the following would fail:

// index '1' means 'second' character
'\u{10ffff}a'.codePointAt(1); // instead of U+61 ("a"), gives U+DFFF

ljharb · May 25, 2022, 4:33pm

Array.from(str)[1] then.

Klaider · May 25, 2022, 4:36pm

Fine, and then String.fromCodePoint(...array) back. But this would be inefficient if the string is long, especially when working with parsing.

theScottyJam · May 25, 2022, 11:45pm

I believe the iterator-helpers proposal would help with the performance issues. If you're dealing with larger strings, you can write this:

const [char] = Iterator.from(str).drop(index - 1);

But, it is unfortunately getting a bit more complicated now. Though, the solution to this complexity could be more iterator helpers.

Topic		Replies	Views
Character code literals 💡 Ideas	5	314	January 11, 2022
How to use UTF16EncodeCodePoint ( cp )? I have questions	9	120	May 12, 2024
String.prototype.codePoints() 💡 Ideas proposal	2	273	February 12, 2022
Why just UTF-16? Add UTF-8 support everywhere! 💡 Ideas	14	1475	September 14, 2022
Code point iterators for strings? 💡 Ideas	3	399	January 24, 2020

Code point string

Implementation

Related topics