Possessive RegExp matching

pygy · January 23, 2020, 2:21pm

Beside actual implementation bugs (1), I think that backtracking in RegExp is a footgun, and that having a way to match text possessively would make it easier to write predictable parsers.

It would be nice to have either possessive quantifiers (/a*/ => /a*+/, /a?/ => /a?+/, etc...) or a new p flag that would trigger the possessive mode for all operators.

Both options are backwards compatible AFAICT, since they cause errors in current JS.

Not sure how the polyfill story would play out...

Currently /^(?:ab|[^c])*$/ in irregexp (Chromium/Firefox) goes exponential when fed 'ab'.repeat(n) + 'c', even though it has enough info to know that it won't succeed once it reaches the 'c'.

jridgewell · January 23, 2020, 7:20pm

This is actually already possible, but it's unreasonably obscure:

const r = /^(?=((?:ab|[^c])*))\1$/
r.test('ab'.repeat(500) + 'c');

This uses a positive look-ahead with a capturing group inside, and then a back reference to that particular capturing group. It's explained pretty well in https://javascript.info/regexp-catastrophic-backtracking#lookahead-to-the-rescue.

But I would never expect a common JS developer to be able to come up with a solution like that, and I expect many would be confused reading it. So a major , I think having easier control of the regex engine's backtracking would be a major improvement!

pygy · January 23, 2020, 9:40pm

Good to know you can emulate it, but not without polluting the capture array, and the back-reference means you're limited to having them before the ninth capture is done.

RegExps are confusing to read anyways, they haven't been designed for their current use case (Kleene devised the notation to describe grammars mathematically, where sub-patterns were abstracted as one letter variables, and Thompson adopted it for write-only commands to be typed into physical, screen-less terminals. It was optimal to quickly unload ideas into the keyboard). They graduated to their current status through worse is better...

The trick remains obscure though, even with a cleaner syntax (the one I favor to write complex regexps).

const possessive = op => (...args) =>
  sequence(
    lookAhead(capture(suffix(op, ...args))),
    ref(1)
  )

// once defined it is quite usable though...

const zeroPlusP = possessive('*')

const r = sequence(/^/, zeroPlusP(/ab|[^c]/), /$/)

ljharb · January 23, 2020, 10:05pm

You can use named backreferences; but also, i suspect (without verifying right now) that you can have up to 99 backreferences, just like in replacement?

jridgewell · January 23, 2020, 10:16pm

Yup.

pygy · January 23, 2020, 11:31pm

Yeah, that's true, I'm still thinking in ES5 on that front (because AFAIK transpilers don't handle regexps created using the RegExp constructor, and my lib uses it extensively under the hood, and it doesn't have a runtime like XRegExp).

TIL for the backreferences between 10 and 99. That's good to know (and a bug in my lib).

pygy · February 11, 2020, 2:40pm

Even if they are polyfilled with named back-references, the captured strings are generated, and also end up in the array at the spot they would have occupied as an anonymous capture.

result = /(?<foo>[0-9])\k<foo>/.exec('11')

and

result = /(?<foo>[0-9])\1/.exec('11')

work identically, both producing

[ '11', '1', index: 0, input: '11', groups: {foo: '1'} ]

So the polyfill situation doesn't look promising... It is in the ballpark of what XRegExp does though (see how it polyfills sticky matches by adding a disjuntion with an empty capture at the end of the source, then popping the last capture).

------------------------------------------------------------

Anyways, how can I get the ball rolling to get this specified?

I suppose I have to start working strawperson proposal, both with a README that sets the general context, and a proposed update to the spec.

Assuming we use the + suffix for possesive quantifiers like we use ? for reluctant ones, I suppose I should update these sections:

@jridgewell assuming I get the strawperson in good shape, would you accept to champion this?

jridgewell · February 12, 2020, 12:55am

Babel is able to wrap regexs (only the literals, not the calls to RegExp()) for named capture groups, and they'd be able to do the same here.

I already started! (Sorry, I forgot to message here).

After speaking with the Chrome and Safari regex implementors (Firefox and Edge use Chrome's implementation), they want to pursue implementation-only changes to eliminate backtracking. Everyone feels that possessive quantifiers (and atomic groups, which possessives devolve into) are an "extreme power user" feature, and it won't help the average dev. If we can make implementation only changes, then most every regex already in use will just be automatically better.

pygy · February 12, 2020, 1:41pm

It's great that they want to improve the backtracking behavior :-)

As a power user, I still wish this was available tough.

The average user may not benefit directly, but they may benefit from libs that take advantage of it.

Suppose the aforementioned optimizations fail in a corner case, authors could still handle the problem themselves rather than waiting for upstream fixes.

Worse, imagine a corner case regression in the RegExp engine (not caught by the test suite) that causes some client code that used to work fine to undergo exponential backtracking, effectively DOSing the machine.

RegExps are not safe to use IMO without this, and it's a shame because the underlying engines are pretty neat.

Atomic groups and possessive matches are also similar in semantics parser combinators. It would make it trivial to upgrade something built on top of compose-regexp to actual parser combinators, without changing the semantics, and that would be awesome.

Topic		Replies	Views
Adding recursion to regex 💡 Ideas proposal	53	350	September 3, 2024
Adding an escape hatch to circumvent the parsing restrictions introduced with the RegExp u flag 💡 Ideas	6	337	February 9, 2020
Сancelable/async regexp 💡 Ideas	4	408	February 28, 2020
Safe Regex engine to prevent ReDOS Attack 💡 Ideas	10	1054	April 22, 2022
RE2 - Consider having it as alternative engine choice 💡 Ideas proposal	14	998	July 10, 2022

Possessive RegExp matching

Related topics