Inverted regex charsets in Unicode mode

tom · July 19, 2022, 5:42pm

I expect inverted charsets to match full supplementary codepoints in Unicode mode, but the following is inconsistent across js engines:

console.log(/^a[^a]$/u.exec("a🌐"));
console.log(/^a(?:🌍|[^🌍a])$/u.exec("a🌐"));

I expect the first example above to match, but it fails in node (v8) and firefox (spidermonkey). It works in bun (jscore). The second example works in all 3. It seems that including an explicit supplemental codepoint in the charset flips v8 and spidermonkey into allowing codepoints overall.

I presume the relevant es spec section is somewhere around here. But it's hard for me to interpret easily to know if the v8 and spidermonkey behavior is compliant or not.

Perhaps this is worth a clarification in the spec and an addition to test262.

mikesamuel · July 19, 2022, 8:51pm

Also relevant I think is ¶ 22.2.5.2.2 RegExpBuiltinExec:

If flags contains "u", let fullUnicode be true; else let fullUnicode be false.
⋮
If fullUnicode is true, let input be StringToCodePoints(S). Otherwise, let input be a List whose elements are the code units that are the elements of S.

AFAICT, fullUnicode-ness should not depend on whether a CharSet contains a supplementary code-point. Only on whether the "u" flag is present.

This seems consistent with ¶ 22.2.2.1 Notation:

A CharSet is a mathematical set of characters. In the context of a Unicode pattern, “all characters” means the CharSet containing all code point values; otherwise “all characters” means the CharSet containing all code unit values.

bakkot · July 20, 2022, 5:06pm

Yeah, I'm pretty sure that's a bug in Chrome and Firefox.

mikesamuel · July 21, 2022, 9:41pm

Yeah, I'm pretty sure that's a bug in Chrome and Firefox.

I'll file a bug against V8 then.
Should we also submit something to test262?
Tom's examples suggest a simple repro that could be adapted:

$ node -e 'console.log(/^a[^a]$/u.exec("a🌐") === /^a(?:🌍|[^🌍a])$/u.exec("a🌐"))'
false

mikesamuel · July 21, 2022, 9:56pm

Filed as 13097 - v8 - V8 JavaScript Engine - Monorail

Topic		Replies	Views
Code point string 💡 Ideas	7	272	May 25, 2022
source text, source character - reg I have questions	2	124	January 30, 2024
Character code literals 💡 Ideas	5	314	January 11, 2022
How to use UTF16EncodeCodePoint ( cp )? I have questions	9	122	May 12, 2024
RegExp set notation 🦋 Proposals	7	557	May 4, 2022

Inverted regex charsets in Unicode mode

Related topics