What happens if I write a parser that treats all non-ascii characters as being valid in identifiers?

You could argue that that's not a spec-compliant parser and of course you'd be right, but my counterargument for why this isn't a bad idea is: I would bet you anything there would never be any breakage on account of it.

Why?

Well, because for something to break, TC39 would have to add new syntax to JS which uses non-ASCII characters. It seems from the evidence of history that some force has kept any such thing from happening. Is it actually a hard and fast rule that JS syntax comes from the ASCII range? ...and if it is, why such weird rules around what can be an identifier that make it so difficult to write lightweight parsers?

1 Like

Also involving the Unicode DB in the definition of the language's identifiers means that we open ourselves up to a whole bunch of bugs with serious security implications. At the root of the security risk is the possibility that documents that could have different meanings when interpreted against different versions of the Unicode database. Safe when you run threat detection, unsafe when you deploy to prod, perhaps.

If the parser just says "JS reserves the ASCII range for its own syntax and everything else is an identifier" then you have a lightweight parser and no security vulnerabilities. No matter what version of the Unicode DB is loaded it'll give the same parse tree.

You do now have a potential for attacks using lookalike unicode characters to trick you into thinking you've seen one thing when really the meaning is different, but I'd rather have that problem because we can get around it by building tools that provide context and which show the reader the code formatted the way the reader wants to be seeing it rather than the way the attacker wants the reader to be seeing it

let 🚢 = 'it';
1 Like

The non-ASCII characters U+FEFF, U+2028, U+2029, and “any code point in general category Space_Separator” already have special handling due to being white space/line terminators, so it wouldn’t be possible to create a valid superset of JS without including at least some Unicode data (namely the data for the Space_Separator category).

This is also already possible, because identifiers are defined in terms of the Unicode properties ID_Start and ID_Continue, which include non-ASCII word-like chars, including some ASCII lookalikes:

const яблоко = 1
console.log(яблоко) // logs `1`
const аpple = 2 // lookalike, "аpple" != "apple"
console.log(apple) // Uncaught ReferenceError: apple is not defined
1 Like

What happens? You’ll have a very slightly non-compliant parser, that theoretically could misparse a script differently than a JS engine. It probably won’t misparse anything in practice.

What’s the benefit? Presumably just a very slightly simpler code in your ident parser. Is that worth an intentional violation of the standard?

The difference in parser complexity is only small if your parser is already built on top of the JS regex engine. Mine runs on a custom regex engine because I'm parsing streams of input so I have to be able to wait on promises during the regex engine's matching process, which of course the builtin regex engine can't do.

So now I either have to ship my own Unicode DB or I have to put the JS regex engine inside my regex engine to apply its Unicode character class matching rules character by character. While putting one regex engine inside another is definitely possible, it's likely a setback to the portability, speed, and transparency of this system, all of which I value. (Though it's portability that I really care about)