I'm currently revisiting my compose-regexp library, and I've reached a stumbling block with the backreference validation logic introduced with the u flag.
The API is inspired by parser combinators. The combinators take either string or regexps as input and return a regexp.
// works fine
sequence(capture("a"), maybe(/.\1/))
// throws
sequence(capture("a"), maybe(/.\1/u))
The only solution in the current situation I have would be to create a global "unicode" mode for that sets the atomicity rules, strip the u flag everywhere and let users add it manually after the fact, but it is not the cleanest API there could be.
A possible proper solution would be to remove the back reference validation when using the RegExp constructor. Plain backreferences still wouldn't work, but maybe(/./u, ref(1)) could neatly do the trick.
Alternatively, you could just parse the regexp and replace the problematic escapes so that when you go to upgrade it, it doesn't become a problem.
There may be a caveat with surrogates (I can't remember if Unicode regexps are processed by character code or by Unicode code point), but the rest is simply just parsing out the ambiguities that /u otherwise throws on and/or diverges with.
With the u flag, astral plane characters (surrogate pairs) are processed atomically (so, as code points) with respect to suffix operators and character set ranges, which is why I need the info.
I don't mind the rejection of invalid escape sequences (nor the rest of the syntax validation, it's great), the only issue is orphaned back references (which itself makes a lot of sense for RegExp literals).
Re. replacing, do you mean using a phony sequence (say, &1 for \1, escaping actual & as &&), and replacing them in the final regexp? It would mean that RegExps have to be finalized before they are usable, which is sub-optimal from an API standpoint.
By "replacing", I mean replacing stuff like \a and \z with a and z respectively, among other things. Also, \u{2} in non-/u regexps matches strings with the substring "uu", while in /u regexps, it's equivalent to \x02. So in each case, it's just about stripping redundancy.
You'd be better off lazily generating the regexp in that case and building the backreferences internally - you can design the API to throw on execution if you have a backreference without a corresponding capture.
Yeah probably. The current API has this neat property that it is "RegExp in, RegExp out", benefitting from the validation offered by the RegExp() constructor (so I know that there are no structural loose ends). The values returned by the combinators are also necessarily valid, so I guess I could use thunks with a .toRegExp() method, even though I lose in immediacy, which is an important property for me when prototyping.
Another drawback of the current strictness (which doesn't affect me in this case, but will bite people) is that packages like https://github.com/sindresorhus/escape-string-regexp are made obsolete by the u flag, since you can't escape ] or - out of charsets, and the other special characters within a char set...
Edit: pinging @mathiasbynens who was pretty active on the unicode RegExp front.
Edit again: I just ran into the Lone quantifier bracket issue now. The new rules don't make any sense to me. You're being incredibly uptight about not escaping things that don't need to be, then you introduce a new, arbitrarily mandatory escape sequence. ... OTOH /[-[-]/u still works fine.
Edited, hopefully one last time: I know that they are intended as ergonomic improvements, but I'm afraid the new rules make handling escaping more complex in practice. The rules before were
you can escape whatever you like
but you don't have to unless it is necessary for semantics reasons, with the exception of \).
Now it is
you must have perfect knowledge of the JS RegExp syntax to know what to escape
you can only escape when strictly necessary
except for unmatched parentheses, backets and braces outside character classes
So now you need expert level knowledge of the syntax of the JS dialect to start using them.
The new rules (with the exception of the alphabetic escape sequences which are necessary for forwards compatibility) make RegExps even less approachable for occasional (i.e. most) users.