Adding recursion to regex

Throwing in rbuckton's work, which he references for his regex syntax proposals (though I don't think one for (?R) exists): Feature: Recursion | regexp-features

1 Like

I am sure that implementing this will not cause problems for Google.

It is much better to have slow than nothing.

I think it should return how other engines do (for example, PCRE and "regex" for Python) to not differ from others.

Lets ask browsers developers. Are they there?

At a certain point you're describing turning regex into its own complete system of parsing. Your results will become trees of arbitrary nesting depth -- parse trees in other words.

I mention it because I do think that JS should head in the direction of offering a built in parsing API. I do not think that this proposal comes close to capturing the complexities entailed in building full-featured parsing support worthy of language integration

1 Like

I proposed just extending, no trees and no stack. No Thompson's algorithm and other difficulties. It can work as it works.
This feature is very useful. And even working 1 ms slower won't spoil it in any way.

Surely the match result would become a tree structure made up of deeply nested arrays (which you will need a stack of some kind both to generate and to consume).

Say you had a pattern with captures 0-9 that recurses on itself, you would need some place for the recursive match's captures to be included in the result, right? Like this:

let result = /<(\w+)>(?R)<\1>/.exec('<h1><h2><h3></h3></h2></h1>');

result[2][2][1] === 'h3'
result[2][1] === 'h2'
result[1] === 'h1'

Recursion with no stack?

Not necessarily. Prior art in Perl returns captures from the top level only:

Information about capture state from the caller for things like backreferences is available to the subpattern, but capture buffers set by the subpattern are not visible to the caller.

@conartist6
Firstly, input should be "<h1><h2><h3><h3><h2><h1>" to work. / is not in \w.
Secondly, it doesn't return groups inside (?R). See how it works in PCRE. And I think it is rightful.

Yes. Because no real recursion.
/t(?R)*y/ => /t(?:t(?R)*y)*y/ => /t(?:t(?:t(?R)*y)*y)*y/
Stack is not required. It doesn't have to return to the beginning with remembering positions before each return and using counters of returns.

Sorry yes I meant for the pattern to be /<(\w+)>(?R)<[/]\1>/.

I don't see how you can say there's no real recursion when the transformation you're applying to the pattern hasn't eliminated the recursive term.

If you can expand all the recursion away so that you have a single fully compiled pattern -- one that does not contain (?R) -- then you can feed that fully compiled pattern to a stackless engine.

Just because the engine keeps that stack by converting the input pattern to state then allowing the pattern to grow larger with recursion, that doesn't mean that the engine isn't keeping a stack at runtime.

It is seeming recursion, not real. For readers, it is recursion, but not for engines if to just replace without returns and stack.

No, because the depth is unknown.

Why does it need stack then? It only has to remember the input regex to replace.

Because you need to maintain some state at each level you recurse.

@conartist6's example was intended to show this, except for it to work, the recursive term must be avoidable: /<(\w+)>(?R)*<\/\1>/

  • each level needs to capture the opening tag, in order to match the closing tag after (?R). So you basically have a stack of "capture group 1".

Another reason you need a stack is simply backtracking: /(a|ab)(?R)*c/

  • when matching "abc", you need to remember you took the "a" alternative, so that when (?R)*c fails at "bc", you can backtrack, take the "ab" alternative and try again.
1 Like

We talked about finite-state machines which don't have stack. For "capture group 1", for remembering which alternative was taken and for all other basic things.
This all occurs with non-recursive regex as well. Replacing (?R) doesn't add anything new to stack if it is present.
/(a|ab)(?:(a|ab)c)*c/

That sounds like you are suggesting to dynamically replace (expand) a part of the state machine while it is running? Well then it no longer is a finite state machine.

1 Like

How a when do you decide how many times you replace the (?R) with the whole expression, or in other words, after how many expansions you omit it? You keep suggesting replacements that don't accept the same language.
"aaazzz" =~ /a(?R)*z/g returns "aaazzz"
"aaazzz" =~ /a(?:az)*z/g returns "aazz"

At some point you need to have a machine that just goes brrrrrr over the input string. Stop bringing up finite state machines, those cannot do recursion (or backreferences, for that matter).

Anyway, it is much easier than adding stack with Thompson's algorithm or something like that.

All as in PCRE.

They can. \1 is a backreference.

Are you guys regex engine developers? If no, why do you block this without exact knowledge of possibility?
And there are stack-based finite state machines actually.

Yes, I am a regex engine developer.

I know it is definitely hard to read tone in these forums, but right now a pool of highly talented people is trying to help you understand something fundamental about how regex is built.

1 Like

I didn't mean you. I meant bergus and lightmare. They alluded that my solution is impossible.
"The engine's implementation is non-backtracking"... I meant developers of fully functional regex engines, especially browser ones. Non-backtracking is too easy if to compare.

I don't need to understand how regex engines are built. I suggested a useful feature and want it to be added. If you want to help, suggest a suitable solution.

I did not read their messages as declaring your solution to be impossible. We are not attacking you and your idea, only trying to help you find the words that have the correct meaning.

This is a formal space -- it is important to make statements that are not just close to being correct, but are actually correct

1 Like

Lets discuss how to add this. I just proposed an easy variant (extending).

How would you feel about experimenting with a library-based solution first before we propose specific changes to the language?