Adding recursion to regex

It is probably the most important what JavaScript regex doesn't have.
The most famous task - to find closing parenthesis. No way to do this with regex now.
Also, it opens up a new horizon of possibilities and is easy to implement in regex engines.
(?R) - the whole regex pattern.
(?1) - the first group pattern (not result).
(?R) should always be used with "?" or "*" to avoid infinite loop and find something.
Many regex engines support this (if not all except JavaScript).

1 Like

Please.

imagine tons of projects actually ditched JS RegExp in favor of RE2 due possible deadlocks some RegExp can cause already ... you are kinda asking to make RegExp more dangerous than they are already when it comes to too complex logic, with performance penalties and a new "infinite recursion" potential issue there ... I am not sure this is wise, I'd rather ask browsers to expose RE2 instead as native builtin alternative.

1 Like

Infinite recursions may be blocked. (?R) should always be with ? or *.
It is quite easy to implement since it only should replace with some regex pattern.
People should test how long it is going. It is their fault if it has incredible sizes and a lot of backtracks. So test it before using and it will be ok.
Maybe without adding group patterns, only (?R)...

often it's the unknown input that might cause backtracks, not necessarily the RegExp for "desired" or tested use cases ... RE2 exists for this reason too.

1 Like

As I know, catastrophic backtracking is caused mostly or only by (.+)+ etc. It should be avoided.
If not only, I think catastrophic backtracking reasons are quite researched. They should be avoided.
Also, possessive quantifiers may be added to fix this.

I have no voice in TC39 so I am just sharing what I've worked with in the past (RE2 for ad blocks filters and investigated all the reasons for that) and today I'd rather have RE2 exposed as API than have more harakiri prone features to the current RegExp. This is also just my opinion, I hope somebody else from TC39 will actually answer you at some point.

1 Like

I see that many engines give an error "Catastrophic backtracking" (Catastrophic backtracking has been detected and the execution of your expression has been halted.). Why not add this to JavaScript? Then people will see that there is catastrophic backtracking, not think that JavaScript is too slow.
And recursion could be added without concerns.
Goland and Rust got it done without delays.

@oleedd Citation needed for recursive regexps in Rust. Their regex crate uses an algorithm very similar to RE2, just better optimized. In particular, it's also immune to catastrophic backtracking, and like RE2, it does not and cannot support recursion.

Rust does of course have full C interop, so you can easily just link against one providing recursive regexp support, but the broad community default would require a complete ground up rewrite to support it.

I meant Rust got done without delays a task with catastrophic backtracking (about 1 ms).

Why can't if it is so fast? It just has to replace (?R) and others.

If to specify the need for immunity in the standard, Google will find opportunities to make it in V8 and others will do after.

It's not about the speed, but how the regexps are represented. RE2 and Rust's regex crate both lower them into finite state machines: Finite-state machine - Wikipedia

There's no stack or anything in this construction, thus no way to recognize general recursion. Thompson's algorithm can be modified to support stacks for a limited amount of recursion, but that mostly precludes SIMD optimization (which the regex crate uses extensively).

Why should they handle with returns (stack)? Why not replace (?R) with the whole pattern extending the expression?
For example: /t(?R)*y/ => /tt(?R)yy/ and so on.

You mean ad infinitum? Or what is the example about?
/t(?R)*y/ matches "ttytyy", /tt(?R)yy/ doesn't, I don't see how the patterns are related.

It just displayed the idea, if to be precise, it should be:
/t(?R)*y/ => /t(?:t(?R)*y)*y/ and so on.
Finite-state machines seem to be ok for this.

No, a finite state machine cannot recognize /t(?R)*y/.

I mean it could replace... With additional code. It is not a recursion already. It is extending regex.

There are pushdown automata. With stack.

Yeah, but the mere presence of added state interferes with many SIMD-based optimizations.

What about adding to JavaScript?

V8's Irregexp falls back to a similar algorithm of their own after a certain number of backtracks within a given match (100 IIRC).

Also worth noting this would slow down a lot. Thompson's NFA algorithm would require each "thread" have its own stack, and this could result in significant overhead for regexps with it. You could avoid it by using one loop for regexps with an (?R) and one loop for those without, but then you're rolling two regexp implementations And that is inefficient and a regexp vulnerability surface unto itself.


In addition to all this implementation complexity, how do you feel nested results be returned from exec? The current API offers no clear answer here.