Why just UTF-16? Add UTF-8 support everywhere!

d1gital_love · January 26, 2022, 5:12am

UTF-8 is the most common encoding on the web.

How ECMAScript® 2022 Language Specification (January 20, 2022) just ignores UTF-8?
https://tc39.es/ecma262/multipage/ecmascript-data-types-and-values.html
https://tc39.es/ecma262/multipage/text-processing.html

I don't know how old text with mandatory UTF-16 is, but it's just awful and not acceptable.

This may lead to many minor changes almost everywhere where sequences or characters can be.

swhiteman · January 26, 2022, 7:55am

Hi there,

Just like Java and the .NET CLR (and countless other still-modern environments), ES internally represents a String as a sequence of UTF-16 code units. You may be confusing that internal representation with UTF-8 encoding of text output or decoding of text input. But those are different concerns: a String is not the same as "text," despite the terms being used interchangeably at times.

The ES spec itself is agnostic about text encodings used at runtime; those are governed by other standards.

For example, in the browser, you would use the standard TextEncoder/TextDecoder which strongly emphasizes UTF-8 and is also implemented in NodeJS. When you use the Fetch API the encoding is contained in the charset header (which defaults to UTF-8). Ditto the JSON standard. Note Encoding and Fetch are standardized JavaScript APIs, but they are separate from the language itself.

Perhaps you could explain more your proposal to have UTF-8 be used as the internal representation of Strings? It's true that very modern languages (Go, Rust) use UTF-8 internally and there's a slow creep
— enabled of course by faster processors and cheaper storage — toward UTF-8 in brand new technologies. But establishing UTF-16 to be "awful and not acceptable" is a tall order, given its continued dominance.

tabatkins · January 26, 2022, 11:16pm

There is, unfortunately, absolutely zero chance of changing the way that JS strings encode their characters. That would be a massive breaking change across the entire ecosystem.

Modern string APIs often address text as codepoints, like String.prototype.codePointAt(), which is generally what you want.

(You rarely, if ever, actually want UTF-8; encoding details are a detail you rarely want to be aware of, instead of just getting codepoints and/or grapheme clusters. The problem with JS is that it exposes encoding details, and in particular it exposes details of a really bad encoding (UCS-2-ish, not even UTF-16).)

d1gital_love · January 27, 2022, 6:13am

I think we need UTF-8 support badly. UTF-8 can support all of Unicode. UTF-8 is just old ASCII in some sense. Many good programs use UTF-8 by default. ECMAScript and JavaScript are almost completely going against the flow.

I don't use Windows much ATM but some say they still had ANSI in 2010.

aclaymore · January 27, 2022, 10:27am

@d1gital_love did you have a particular use case.

While EcmaScript itself may not reference UTF-8. Many JS platforms do have APIs that support other encodings.

https://nodejs.org/api/string_decoder.html

d1gital_love · January 27, 2022, 10:48am

I don't think that many know about TextEncoder and TextDecoder but many know about String type.

Again:

String.prototype.codePointAt (pos)
Returns a non-negative integral Number less than or equal to 0x10FFFF𝔽 that is the numeric value of the UTF-16 encoded code point

https://tc39.es/ecma262/multipage/text-processing.html#sec-string.prototype.codepointat

theScottyJam · January 27, 2022, 2:47pm

Alright, I think @tabatkins is agreeing with you that this is a real issue, that JavaScript's default encoding behavior isn't the greatest, but he also shared an important point - it's not like we can just change JavaScript strings from UTF-16 to UTF-8 without breaking old websites.

So, do you have a solution to propose of how we should go about adding support for UTF-8 without breaking older JavaScript? If so, please share, and we can have a discussion around it.

tabatkins · January 27, 2022, 5:54pm

Correct.

Again, UTF-8 is not "Unicode", it's an encoding of Unicode; a way of turning unicode characters into bits (and back). JS already supports all of Unicode. The default string indexing ("foo"[0]) is busted, because it indexes the string according to UCS-2 code units, rather than characters. That's unfortunate and bad, but it's impossible to change. JS has many new ways of interacting with strings that do work on characters - [..."foo"] is character-based, "foo".codePointAt(0) is character-based, String.fromCodePoint(0xfffd) is character-based. Regexes also have recently gained ways of interacting with strings properly as Unicode characters (and are continuing to evolve in that direction).

So everything you need is already present or upcoming. We're stuck with the bad parts forever.

d1gital_love · January 27, 2022, 7:23pm

If we take String.prototype.codePointAt(offset) as example then we can add outputEncoding argument with default parameters with old (UTF-16) encoding to make it backward-compatible.

https://tc39.es/ecma262/multipage/ecmascript-language-functions-and-classes.html#sec-function-definitions

tabatkins · January 28, 2022, 11:51pm

Again, that would not do anything like what you want. A string is not an encoding. The TextEncoder API, which outputs a TypedArray of encoded binary data, can output a string encoded as UTF-8.

theScottyJam · January 29, 2022, 12:13am

It looks like that MDN snippet you linked to was recently updated in the last few days to say "unicode" instead of "UTF-16". From what I understand, (git history says it happened 18 days ago, but I'm not sure how often MDN updates its content from the git repo, so perhaps you were viewing the older content?). From what I understand, codePointAt() returns a number that's assigned to a specific unicode character (like a unique id for that character), which is irrelevant to however the engine may be encoding that specific character.

mhofman · January 30, 2022, 4:16am

To be fair, codePointAt is not entirely character based since the position argument is still based on the UCS-2 representation.

d1gital_love · January 30, 2022, 5:49am

Yes. codePointAt was pointless as example but not something like charAt()...

The String object's charAt() method returns a new string consisting of the single UTF-16 code unit located at the specified offset into the string.

claudiameadows · February 2, 2022, 3:00am

If you want models of working with characters:

UTF-8 is the typical interchange format, but is as bad as UTF-16 when it comes to string manipulation and processing.
UTF-32 solves the code point problem and is what Python (as of 3.3) uses internally for strings that contain emojis and the like, but requires a lot of memory to sustain (hence why Python tries to avoid it where it can) and doesn't account for multi-code point graphemes.
Swift addresses that very well by having strings be sequences of extended grapheme clusters with views for both UTF-8, UTF-16, and UTF-32, but it comes at the cost of a mildly bloated API and a number of performance caveats (like most indexed access operations being O(n), including string slicing) that make it very inefficient and suboptimal for structured data parsing.

Rudxain · September 14, 2022, 10:30pm

Topic		Replies	Views
Code point string 💡 Ideas	7	272	May 25, 2022
How to use UTF16EncodeCodePoint ( cp )? I have questions	9	122	May 12, 2024
Character code literals 💡 Ideas	5	314	January 11, 2022
`utf8` / `buffer` import type assertions 💡 Ideas proposal	4	343	December 21, 2022
Text Modules 💡 Ideas	6	504	November 8, 2023

Why just UTF-16? Add UTF-8 support everywhere!

Related topics