Inverted regex charsets in Unicode mode

I expect inverted charsets to match full supplementary codepoints in Unicode mode, but the following is inconsistent across js engines:

console.log(/^a[^a]$/u.exec("a๐ŸŒ"));
console.log(/^a(?:๐ŸŒ|[^๐ŸŒa])$/u.exec("a๐ŸŒ"));

I expect the first example above to match, but it fails in node (v8) and firefox (spidermonkey). It works in bun (jscore). The second example works in all 3. It seems that including an explicit supplemental codepoint in the charset flips v8 and spidermonkey into allowing codepoints overall.

I presume the relevant es spec section is somewhere around here. But it's hard for me to interpret easily to know if the v8 and spidermonkey behavior is compliant or not.

Perhaps this is worth a clarification in the spec and an addition to test262.

2 Likes

Also relevant I think is ยถ 22.2.5.2.2 RegExpBuiltinExec:

  1. If flags contains "u", let fullUnicode be true; else let fullUnicode be false.
  2. โ‹ฎ
  3. If fullUnicode is true, let input be StringToCodePoints(S). Otherwise, let input be a List whose elements are the code units that are the elements of S.

AFAICT, fullUnicode-ness should not depend on whether a CharSet contains a supplementary code-point. Only on whether the "u" flag is present.

This seems consistent with ยถ 22.2.2.1 Notation:

  • A CharSet is a mathematical set of characters. In the context of a Unicode pattern, โ€œall charactersโ€ means the CharSet containing all code point values; otherwise โ€œall charactersโ€ means the CharSet containing all code unit values.
2 Likes

Yeah, I'm pretty sure that's a bug in Chrome and Firefox.

Yeah, I'm pretty sure that's a bug in Chrome and Firefox.

I'll file a bug against V8 then.
Should we also submit something to test262?
Tom's examples suggest a simple repro that could be adapted:

$ node -e 'console.log(/^a[^a]$/u.exec("a๐ŸŒ") === /^a(?:๐ŸŒ|[^๐ŸŒa])$/u.exec("a๐ŸŒ"))'
false

Filed as 13097 - v8 - V8 JavaScript Engine - Monorail