Adding an escape hatch to circumvent the parsing restrictions introduced with the RegExp u flag

pygy · January 24, 2020, 6:58pm

I'm currently revisiting my compose-regexp library, and I've reached a stumbling block with the backreference validation logic introduced with the u flag.

The API is inspired by parser combinators. The combinators take either string or regexps as input and return a regexp.

When combining, most flags are stripped, except the u flag, which I would like to make contagious because it changes the semantics of the underlying parser, and getting them right is important in order to output clean regexps (e.g. /💩+/ and /💩+/u don't mean the same thing, you need /(?:💩)+/ to match several astral plane characters without the u flag).

The trouble comes with back references though.

// works fine
sequence(capture("a"), maybe(/.\1/))

// throws
sequence(capture("a"), maybe(/.\1/u))

The only solution in the current situation I have would be to create a global "unicode" mode for that sets the atomicity rules, strip the u flag everywhere and let users add it manually after the fact, but it is not the cleanest API there could be.

A possible proper solution would be to remove the back reference validation when using the RegExp constructor. Plain backreferences still wouldn't work, but maybe(/./u, ref(1)) could neatly do the trick.

claudiameadows · January 24, 2020, 7:29pm

Alternatively, you could just parse the regexp and replace the problematic escapes so that when you go to upgrade it, it doesn't become a problem.

There may be a caveat with surrogates (I can't remember if Unicode regexps are processed by character code or by Unicode code point), but the rest is simply just parsing out the ambiguities that /u otherwise throws on and/or diverges with.

pygy · January 24, 2020, 8:25pm

With the u flag, astral plane characters (surrogate pairs) are processed atomically (so, as code points) with respect to suffix operators and character set ranges, which is why I need the info.

I don't mind the rejection of invalid escape sequences (nor the rest of the syntax validation, it's great), the only issue is orphaned back references (which itself makes a lot of sense for RegExp literals).

Re. replacing, do you mean using a phony sequence (say, &1 for \1, escaping actual & as &&), and replacing them in the final regexp? It would mean that RegExps have to be finalized before they are usable, which is sub-optimal from an API standpoint.

claudiameadows · January 25, 2020, 8:56am

By "replacing", I mean replacing stuff like \a and \z with a and z respectively, among other things. Also, \u{2} in non-/u regexps matches strings with the substring "uu", while in /u regexps, it's equivalent to \x02. So in each case, it's just about stripping redundancy.

This article explains more.

IMHO you should just reject such /u flag mismatches instead. Keeps you out of the mess and keeps it more predictable.

pygy · January 25, 2020, 8:58pm

Good point regarding \u{2} :-)

Strictess could be an option with a global u mode that also turns strings into Unicode regexps.

Even If I'm strict regarding flags compatibility, that still leaves me with the problem of orphaned back references: flags('u', ref(1)) will throw.

claudiameadows · January 26, 2020, 3:23am

You'd be better off lazily generating the regexp in that case and building the backreferences internally - you can design the API to throw on execution if you have a backreference without a corresponding capture.

pygy · February 9, 2020, 9:46pm

Yeah probably. The current API has this neat property that it is "RegExp in, RegExp out", benefitting from the validation offered by the RegExp() constructor (so I know that there are no structural loose ends). The values returned by the combinators are also necessarily valid, so I guess I could use thunks with a .toRegExp() method, even though I lose in immediacy, which is an important property for me when prototyping.

Another drawback of the current strictness (which doesn't affect me in this case, but will bite people) is that packages like https://github.com/sindresorhus/escape-string-regexp are made obsolete by the u flag, since you can't escape ] or - out of charsets, and the other special characters within a char set...

Edit: pinging @mathiasbynens who was pretty active on the unicode RegExp front.

Edit again: I just ran into the Lone quantifier bracket issue now. The new rules don't make any sense to me. You're being incredibly uptight about not escaping things that don't need to be, then you introduce a new, arbitrarily mandatory escape sequence. ... OTOH /[-[-]/u still works fine.

Edited, hopefully one last time: I know that they are intended as ergonomic improvements, but I'm afraid the new rules make handling escaping more complex in practice. The rules before were

you can escape whatever you like
but you don't have to unless it is necessary for semantics reasons, with the exception of \).

Now it is

you must have perfect knowledge of the JS RegExp syntax to know what to escape
you can only escape when strictly necessary
except for unmatched parentheses, backets and braces outside character classes

So now you need expert level knowledge of the syntax of the JS dialect to start using them.

The new rules (with the exception of the alphabetic escape sequences which are necessary for forwards compatibility) make RegExps even less approachable for occasional (i.e. most) users.

Topic		Replies	Views
RegExp set notation 🦋 Proposals	7	554	May 4, 2022
Possessive RegExp matching 💡 Ideas	8	1106	February 12, 2020
RegExp composition 💡 Ideas proposal	7	676	November 28, 2022
RegExp constructor and oddly escaped new line characters I have questions	2	275	April 18, 2022
Streaming regexp support 💡 Ideas	1	280	November 22, 2020

Adding an escape hatch to circumvent the parsing restrictions introduced with the RegExp u flag

Related topics