Why just UTF-16? Add UTF-8 support everywhere!

UTF-8 is the most common encoding on the web.

How ECMAScript® 2022 Language Specification (January 20, 2022) just ignores UTF-8?
https://tc39.es/ecma262/multipage/ecmascript-data-types-and-values.html
https://tc39.es/ecma262/multipage/text-processing.html

I don't know how old text with mandatory UTF-16 is, but it's just awful and not acceptable.

This may lead to many minor changes almost everywhere where sequences or characters can be.

Hi there,

Just like Java and the .NET CLR (and countless other still-modern environments), ES internally represents a String as a sequence of UTF-16 code units. You may be confusing that internal representation with UTF-8 encoding of text output or decoding of text input. But those are different concerns: a String is not the same as "text," despite the terms being used interchangeably at times.

The ES spec itself is agnostic about text encodings used at runtime; those are governed by other standards.

For example, in the browser, you would use the standard TextEncoder/TextDecoder which strongly emphasizes UTF-8 and is also implemented in NodeJS. When you use the Fetch API the encoding is contained in the charset header (which defaults to UTF-8). Ditto the JSON standard. Note Encoding and Fetch are standardized JavaScript APIs, but they are separate from the language itself.

Perhaps you could explain more your proposal to have UTF-8 be used as the internal representation of Strings? It's true that very modern languages (Go, Rust) use UTF-8 internally and there's a slow creep
— enabled of course by faster processors and cheaper storage — toward UTF-8 in brand new technologies. But establishing UTF-16 to be "awful and not acceptable" is a tall order, given its continued dominance.

1 Like

There is, unfortunately, absolutely zero chance of changing the way that JS strings encode their characters. That would be a massive breaking change across the entire ecosystem.

Modern string APIs often address text as codepoints, like String.prototype.codePointAt(), which is generally what you want.

(You rarely, if ever, actually want UTF-8; encoding details are a detail you rarely want to be aware of, instead of just getting codepoints and/or grapheme clusters. The problem with JS is that it exposes encoding details, and in particular it exposes details of a really bad encoding (UCS-2-ish, not even UTF-16).)

1 Like

I think we need UTF-8 support badly. UTF-8 can support all of Unicode. UTF-8 is just old ASCII in some sense. Many good programs use UTF-8 by default. ECMAScript and JavaScript are almost completely going against the flow.

I don't use Windows much ATM but some say they still had ANSI in 2010.

@d1gital_love did you have a particular use case.

While EcmaScript itself may not reference UTF-8. Many JS platforms do have APIs that support other encodings.

https://nodejs.org/api/string_decoder.html

2 Likes

I don't think that many know about TextEncoder and TextDecoder but many know about String type.

Again:

String.prototype.codePointAt (pos)
Returns a non-negative integral Number less than or equal to 0x10FFFF𝔽 that is the numeric value of the UTF-16 encoded code point

https://tc39.es/ecma262/multipage/text-processing.html#sec-string.prototype.codepointat

Alright, I think @tabatkins is agreeing with you that this is a real issue, that JavaScript's default encoding behavior isn't the greatest, but he also shared an important point - it's not like we can just change JavaScript strings from UTF-16 to UTF-8 without breaking old websites.

So, do you have a solution to propose of how we should go about adding support for UTF-8 without breaking older JavaScript? If so, please share, and we can have a discussion around it.

2 Likes

Correct.

Again, UTF-8 is not "Unicode", it's an encoding of Unicode; a way of turning unicode characters into bits (and back). JS already supports all of Unicode. The default string indexing ("foo"[0]) is busted, because it indexes the string according to UCS-2 code units, rather than characters. That's unfortunate and bad, but it's impossible to change. JS has many new ways of interacting with strings that do work on characters - [..."foo"] is character-based, "foo".codePointAt(0) is character-based, String.fromCodePoint(0xfffd) is character-based. Regexes also have recently gained ways of interacting with strings properly as Unicode characters (and are continuing to evolve in that direction).

So everything you need is already present or upcoming. We're stuck with the bad parts forever.

4 Likes

If we take String.prototype.codePointAt(offset) as example then we can add outputEncoding argument with default parameters with old (UTF-16) encoding to make it backward-compatible.

https://tc39.es/ecma262/multipage/ecmascript-language-functions-and-classes.html#sec-function-definitions

Again, that would not do anything like what you want. A string is not an encoding. The TextEncoder API, which outputs a TypedArray of encoded binary data, can output a string encoded as UTF-8.

It looks like that MDN snippet you linked to was recently updated in the last few days to say "unicode" instead of "UTF-16". From what I understand, (git history says it happened 18 days ago, but I'm not sure how often MDN updates its content from the git repo, so perhaps you were viewing the older content?). From what I understand, codePointAt() returns a number that's assigned to a specific unicode character (like a unique id for that character), which is irrelevant to however the engine may be encoding that specific character.

1 Like

To be fair, codePointAt is not entirely character based since the position argument is still based on the UCS-2 representation.

Yes. codePointAt was pointless as example but not something like charAt()...

The String object's charAt() method returns a new string consisting of the single UTF-16 code unit located at the specified offset into the string.

If you want models of working with characters:

1 Like