RE2 - Consider having it as alternative engine choice

WebReflection · July 6, 2022, 11:30am

As the repository says, RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

Not only it has a highly demanded Node.js module and there is an unofficial WASM ports too, but it's been used as official RegExp engine for latest Web Extensions Manifest V3.

Proposal

Either bring RE2 constructor to the global scope or allow, through a special flag, to use it behind the scene (i.e. through a 2 flag)

// a case insensitive literal RE2 engine based RegExp
const re2 = /^something valid$/i2;

Thanks for considering helping out every developer that would like to officially use safer and faster and lighter RegExp.

Regards

aclaymore · July 6, 2022, 2:44pm

The 262 specification doesn’t specify which implementation of regex to use. Instead it specifies the features and invariants. It also doesn’t specify how performant an implementation must be, only the expected characteristic (e.g “constant time”).

Are there particular features that this library has? Those might fit better into what the specification could describe.

WebReflection · July 6, 2022, 5:42pm

Nope, we can't ditch the current RegExp engine because it's awesome already, but it has possible issues in terms of memory consumption, performance, and so on, as described already by the reason RE2 engine exists in the first place, and it's being used as alternative RegExp standard out there.

Instead it specifies the features and invariants

That's an issue because RE2 doesn't accept all the flags current RegExp offer, so that it's really about opting out current engine end entering RE2 scope and functionality which is like a strict version of any JavScript engine.

Once again, people caring about possible issues with RegExp are using this library and Google is enforcing this to the entirity of the Web without exposing any API to check, test, use, or consume, such engine ... that standard is behind Web Extensions but it's how every Web Extension out there will need to parse and understand ads blocking filters.

Beside that, they used this engine becuase it's faster, lighter, better, more secure, all things modern JS should be concerned about too.

Thanks for considering this proposal forward in a way or another.

mhofman · July 6, 2022, 7:07pm

I think I'd like to understand the use case a little better here.

Is this intended as a backstop against problematic RegEx under the code author's control, or is it intended as a way to accept external RegEx patterns and execute them?

If the author of the code is the same as the author of the regex pattern, technically they can write RegEx patterns (and avoid using flags) that are not subject to the problematic behaviors. The fact that the RegEx engine supports some features doesn't mean these feature must be used by the program.

The example of Manifest v3 is clearly in the second category of use case: parsing 3rd party patterns without explosion. The same outcome could technically be accomplished by doing a static analysis on the RegEx pattern to detect problematic patterns, but that is notoriously hard. This may however be simplified by proposals such as RegExp Atomic Operators currently at Stage 1 (draft spec, slides)

Edit:
Here is the motivation statement of that proposal:

This proposal seeks to introduce new syntax to allow users fine control the performance characteristics of regular expressions through possessive quantifiers and backtracking control.

WebReflection · July 6, 2022, 8:48pm

The same outcome could technically be accomplished by doing a static analysis on the RegEx pattern to detect problematic patterns

the answer can't always be "solve this in userland" though ... right? 'cause this is what everyone is doing already, there is an npm module and a wasm version of the Google RE2 library.

but that is notoriously hard

Indeeed it is, and it's not just that, the RE2 used by Google has unexposed properties such as options.set_max_mem(2 << 10); that for what I can tell are not part of that proposal neither.

In short: Node.js developers are using an engine born as subset so that everything is faster/better, and so does Google on the Web Extension side which is still becoming a standard for all other engines too, because Safari and Firefox will adopt the same RegExp engine there (I suppose) so that except for browsers internals, nobody can benefit from this different RegExp engines.

Now, if we'd answer the same to every Google engineer that is pushing for RE2 as standard API then it's fine to me, but are we sure if they chose this approach maybe there are better reasons than just parse via JS all RegExpso that you have your own engine to decide what to do is not really an way forward, nor an answer?

mhofman · July 6, 2022, 9:07pm

I maintain that the use cases are different, and that for the case where the author of the code is the same as the author of the pattern, there is no need to specify a completely different engine.

Fine grained memory limits like you mention are also unlikely to ever be exposed as it'd exacerbate the chance that a program would run differently on independent platforms/implementations.

Could you clarify who is pushing for what? I have not seen anyone pushing for such new JavaScript APIs.

WebReflection · July 6, 2022, 9:30pm

Web Extensions standard API ... linked in my very first post but re-linked in here in detail for an isRegexSupported method that is based on RE2 constraints because the whole thing is based on RE2 as specified in the Manifest V3 Web Extension standard all browsers are shipping these days.

Imagine I've spent months already behind this topic to the point I don't see any other way forward for RE2 to be meaningful and understood as client/server JavaScript possible, pretty common, limitation or RegExp alternative engine.

mhofman · July 6, 2022, 10:24pm

First as you know, Web browsers are not the sole environments embedding JavaScript. You are referring to an API which is defined in a Community Group of an independent standard body (W3C), not even the main specification of HTML embedders. (Edit: it seems that this API is not even standardized by that group yet, with the only reference being a request for standardization). Furthermore that API seem purely additive and not conflicting with JavaScript APIs in any way.

What I'm trying to understand is if there are shared motivations and use cases that would warrant a new API or other changes to JavaScript itself. Simply saying some embedders have a feature that use a different RegExp engine is not sufficient.

Looking at the use cases that Web Extensions tries to solve for, it is basically "how do accept 3rd party regexp pattern without exploding", and "will I accept to execute the rexexp pattern you give me". To be pedantic, that isRegexSupported API seem theoretically independent from RE2 or the exact RegExp implementation actually used to execute the regexp.

If the committee decided this motivation was worth exploring, the solution could take different shapes. One of them could be very similar to the WebExtensions API, like a new API to test if a given pattern or RegExp instance might explode. Another might be closer to Atomic Operators, allowing authors to more easily write patterns that won't explode, and leave to user land the validation that certain patterns represent a problem.

To be clear, I'm sympathetic to the use case of accepting and executing regexp patterns from unsafe sources, and it would indeed be nice if the language would make it easier for programs which have that use case, and I'm hopeful we'll ultimately have a better story there, but I don't think it'll take the shape of a flag to use specific RegExp implementation.

WebReflection · July 7, 2022, 5:16am

You focused a lot to a single method maybe sleeping on or ignoring the whole thing is explicitly linked to RE2 syntax because the regexFilter is all about RE2.

Even the syntax is based on RE2 and its caveats so while I agree even just that method to know if a RegExp might "explode" would be an extremely welcome core standard utility, if all engines need to implement RE2 behind the scene and Node.js or bun or deno need to use a module maybe it's worth exploring the possibility to have a smaller, faster, safer RegExp engine even if current RegExp engine can run those rules already.

Exactly because TC39 does not suggest any specific engine to use, my idea of a new flag or constructor is that for ECMAScript it means just a subset of the generic rules already allowed and it's up to engines implementors decide if the engine should be RE2 or something else or just the same engine that will just sanity-check before executing anything.

This would be transparent for users but it could benefit tons of RegExp based use cases, defined by users or not, because RE2 is faster, 'cause safer, hence simpler.

theScottyJam · July 7, 2022, 2:20pm

I actually like the solution of having a specific function to detect if a regular expression might explode. It would let you give helpful errors to the end-user on why their regular expression was rejected. A new flag could work too I suppose - either way you're still having to do something extra when you are constructing the regular expression from user input - either setting an extra flag, or calling the function to check if it might explode. I certainly don't see a need for an entirely different constructor just to restrict some syntax, that seems like overkill.

If you have the ability to detect if the regular expression might explode with a function, then the "safe" part is already handled. You just check if the regular expression will explode, then give the appropriate error, before continuing.
I presume by "small", you're talking about the size of the code that evaluates the regular expression, in which case the "smaller" solution would be to use the existing regular expression engine instead of bundling a second one with all JavaScript runtimes.
And, is RE2 only faster because it doesn't allow regular expressions that could explode? If so, a new willExplode() function will allow you to be sure that the current regular expression engine is just as fast. If not, nothing is stopping the current regular expressions from internally using explosion detection today, then invisibly switching to the RE2 engine if it won't explode - we wouldn't even need a proposal to allow JavaScript runtimes to gain this performance improvement.

Are there other features that RE2 provides that an end-user might want? From what I see, the main thing is the ability to detect if a regular expression could explode, which as @mhofman mentioned, could look like a number of different things, we don't have to verbatim copy their library into JavaScript to providethis feature. performance and size are things that an engine can invisibly handle without us needing to update the spec.

WebReflection · July 7, 2022, 8:10pm

This is kinda ignoring what I've previously mentioned: it is an imposed standard for Web extensions by Google + it is already popular on the backend because it's faster at parsing and executing + people ported it to WASM to make it available no matter what ... can we please stop asking if it's really better? It is ... let's move on, shall we?

If you have the ability to detect if the regular expression might explode with a function, then the "safe" part is already handled.

sure ... we can all also write try { new RegExp(str) } catch (o_O) { console.error('you bad person') } everywhere in every code to see if a RgExp is even working at all, but that's why I've porposed a flag, so that literal regexp won't need any method.

Is x as explode a good flag to use? /(['"]).*?\1/x would throw right away, other non looking around patterns won't, how cool in terms of DX is that?

I presume by "small", you're talking about the size of the code that evaluates the regular expression

actually the compiled library size, which is smaller because the RegExp that accepts is simpler

a new willExplode() function will allow you to be sure that the current regular expression engine is just as fast.
nothing is stopping the current regular expressions from internally using explosion detection today, then invisibly switching to the RE2 engine if it won't explode - we wouldn't even need a proposal to allow JavaScript runtimes to gain this performance improvement.

Now we talk ... and whatever, really, as long as the willExplode() is strictly a RE2 not exploding checking procedure, otherwise anything slightly different will make this whole conversation pointless, because like I've said, RE2 is paving its path behid the developers scene and in awkward and unreachable ways ... if there are more standard APIs imposing its constraints and limits the developer experience with JS and its ecosystem will suffer for no reason whatsoever.

A flag to switch or ensure no exploding RegExp are used is cool both as literal RegEpx, constructors, and at JIT time, so that the engine behind the scene can be the fastest possible if the developer meant to use a non expliding RegExp and the engine can "magically" check that's the case, or throw right away.

A flag is the best way to do this, and a try catch around new RegExp(str, 'x') a no-brainer when it's needed, replacing the need for a method all together, so that would be cool.

Thank you.

theScottyJam · July 8, 2022, 5:28am

I'm not trying to question whether or not RE2 is better than what we have today, I'm just trying to ask how is it better. The JavaScript spec doesn't just move libraries into the language verbatum, instead it describes what the library does and engine implementors choose how to implement it. So, if we like RE2 and want something like in in the spec, we need to start by asking why we like it, and what it offers us. What features does it have that we want to see in the native language? That's what we're trying to get to the bottom of. Once we've described the desired features, we can look at how we could fit those desired features into the language - and we don't have to do it the same way RE2 did it, RE2 had a lot of limitations that the JavaScript language won't have, due to the fact that RE2 is a userland library.

Ok, so you're bringing up something a little different here. Thus far, we were talking about using untrusted strings in a regular expression, and wanting a mechanism to check if the resulting regular expression is safe to run or not. Considering this isn't something a project is going to be doing a lot, it's not that bad for it to be a little verbose. This makes a method approach a little more attractive, as then we don't have to introduce yet-another-flag for developers to memorize, instead, we can use a method with a nice descriptive name to show what's going on.

Here, you're arguing for a desire to prevent explosive regular expressions on even trusted input (as you're using a regex literal, which can only be done with trusted input). And, yes, I can see a desire for wanting to perhaps make every regular expression literal use this flag, so that you're not accidentally constructing an explosive regular expression. You could even add a lint rule that requires the flag to be used (and you have to explicitly use an eslint-ignore if you really have a need to not use the flag, forcing you to understand the dangers of what you're doing).

WebReflection · July 8, 2022, 6:05am

A flag solves both cases:

it can be an explicit literal trusted intent to signal desire for better performance and typo-free safety
it can be used for untrusted inputs via new RegExp(untrusted, 'x')
the engine can know AOT what kind of engine or optimizations it can use by inferring, in both cases, where flags are known

A method, instead, won't tell much:

the new RegExp(untrusted) still needs a try/catch around because the input is untrusted and the regexp possibly invalid
no engine can be hinted to optimize anything unless we pass for a method check we might not need
if the check is done anyway at creation time then maybe it's OK if performance is not affected ...

... but, most importantly:

If this conversation will result into yet another branching from what's being pushed as Web standard regexp engine (that is RE2) I have no interest in this proposal anymore because it will only create a third problem instead of "just 2".

mrjacobbloom · July 10, 2022, 8:13am

If nobody's mentioned it yet, I believe V8 is already experimenting with something like this. (Not that one engine's experiments necessarily dictate the spec, of course.)

They switch to an RE2-based (inspired?) engine for cases where there would otherwise be excessive backtracks. It even supports a non-standard l flag that forces the optimized engine and causes it to throw if the passed expression string isn't optimizable (that is, RE2-compatible) An additional non-backtracking RegExp engine · V8

WebReflection · July 10, 2022, 8:34am

Yes, RE2 also mentioned and Rust too ... and I'd be more than happy with a l flag for linear instead of x for explode.

That also demonstrates there is interest to distinguish and the flag is what's being used so that I can point at that post whenever I need to answer "why using a flag?".

Topic		Replies	Views
Adding recursion to regex 💡 Ideas proposal	53	362	September 3, 2024
Safe Regex engine to prevent ReDOS Attack 💡 Ideas	10	1056	April 22, 2022
Сancelable/async regexp 💡 Ideas	4	408	February 28, 2020
Streaming regexp support 💡 Ideas	1	281	November 22, 2020
Possessive RegExp matching 💡 Ideas	8	1110	February 12, 2020

RE2 - Consider having it as alternative engine choice

Proposal

Related topics