Adding an escape hatch to circumvent the parsing restrictions introduced with the RegExp u flag

I'm currently revisiting my compose-regexp library, and I've reached a stumbling block with the backreference validation logic introduced with the u flag.

The API is inspired by parser combinators. The combinators take either string or regexps as input and return a regexp.

When combining, most flags are stripped, except the u flag, which I would like to make contagious because it changes the semantics of the underlying parser, and getting them right is important in order to output clean regexps (e.g. /💩+/ and /💩+/u don't mean the same thing, you need /(?:💩)+/ to match several astral plane characters without the u flag).

The trouble comes with back references though.

// works fine
sequence(capture("a"), maybe(/.\1/))

// throws
sequence(capture("a"), maybe(/.\1/u))

The only solution in the current situation I have would be to create a global "unicode" mode for that sets the atomicity rules, strip the u flag everywhere and let users add it manually after the fact, but it is not the cleanest API there could be.

A possible proper solution would be to remove the back reference validation when using the RegExp constructor. Plain backreferences still wouldn't work, but maybe(/./u, ref(1)) could neatly do the trick.

Alternatively, you could just parse the regexp and replace the problematic escapes so that when you go to upgrade it, it doesn't become a problem.

There may be a caveat with surrogates (I can't remember if Unicode regexps are processed by character code or by Unicode code point), but the rest is simply just parsing out the ambiguities that /u otherwise throws on and/or diverges with.

With the u flag, astral plane characters (surrogate pairs) are processed atomically (so, as code points) with respect to suffix operators and character set ranges, which is why I need the info.

I don't mind the rejection of invalid escape sequences (nor the rest of the syntax validation, it's great), the only issue is orphaned back references (which itself makes a lot of sense for RegExp literals).

Re. replacing, do you mean using a phony sequence (say, &1 for \1, escaping actual & as &&), and replacing them in the final regexp? It would mean that RegExps have to be finalized before they are usable, which is sub-optimal from an API standpoint.

By "replacing", I mean replacing stuff like \a and \z with a and z respectively, among other things. Also, \u{2} in non-/u regexps matches strings with the substring "uu", while in /u regexps, it's equivalent to \x02. So in each case, it's just about stripping redundancy.

This article explains more.

IMHO you should just reject such /u flag mismatches instead. Keeps you out of the mess and keeps it more predictable.

Good point regarding \u{2} :-)

Strictess could be an option with a global u mode that also turns strings into Unicode regexps.

Even If I'm strict regarding flags compatibility, that still leaves me with the problem of orphaned back references: flags('u', ref(1)) will throw.

You'd be better off lazily generating the regexp in that case and building the backreferences internally - you can design the API to throw on execution if you have a backreference without a corresponding capture.

Yeah probably. The current API has this neat property that it is "RegExp in, RegExp out", benefitting from the validation offered by the RegExp() constructor (so I know that there are no structural loose ends). The values returned by the combinators are also necessarily valid, so I guess I could use thunks with a .toRegExp() method, even though I lose in immediacy, which is an important property for me when prototyping.

Another drawback of the current strictness (which doesn't affect me in this case, but will bite people) is that packages like are made obsolete by the u flag, since you can't escape ] or - out of charsets, and the other special characters within a char set...

Edit: pinging @mathiasbynens who was pretty active on the unicode RegExp front.

Edit again: I just ran into the Lone quantifier bracket issue now. The new rules don't make any sense to me. You're being incredibly uptight about not escaping things that don't need to be, then you introduce a new, arbitrarily mandatory escape sequence. ... OTOH /[-[-]/u still works fine.

Edited, hopefully one last time: I know that they are intended as ergonomic improvements, but I'm afraid the new rules make handling escaping more complex in practice. The rules before were

  • you can escape whatever you like
  • but you don't have to unless it is necessary for semantics reasons, with the exception of \).

Now it is

  • you must have perfect knowledge of the JS RegExp syntax to know what to escape
  • you can only escape when strictly necessary
  • except for unmatched parentheses, backets and braces outside character classes

So now you need expert level knowledge of the syntax of the JS dialect to start using them.

The new rules (with the exception of the alphabetic escape sequences which are necessary for forwards compatibility) make RegExps even less approachable for occasional (i.e. most) users.