Make it possible to divide regular expressions into chunks

Regular expressions are often hard to read.

After reading an article about verbose regular expressions in Python, and how one could do something like this in ECMAScript, I realised that there is one really simple way how this could be done:

Simply concatenate multiple chunks, which are separated by whitespace, back into one. So /expr1/ /expre2/ is read and executed as /expr1expr2/.

This way it becomes much more readable and we can also easily add comments:

new RegExp(
	/expr1/ // comments on expr1
	/expr2/ // comments on expr2
	, 'i' )

AFAIK this shouldn't be to hard to do.

Doing this with syntax (two regex literals directly next to each other) is actually very difficult, due to the similarities with infix division operator. Actually, just parsing a single regex is difficult due to division operator.

There's a related regex feature called free spacing that would allow just this. It hasn't been discussed in a while, though. Last mention I can find is Allow spaces in curly brackets in RegularExpressions

2 Likes

Thanks for explaining.

I hoped it would be simple. If there is some code that recognizes a regex expression then it may not be to hard to repeat it until no expressions are found.

Free spacing looks like what Verbose Regular Expressions are in Python.

It's a pitty, but we can still do strings that can be broken up and converted to a regular expression.

@jridgewell What about something like @/multline regexp/flags or some variant thereof (using a symbol that's not a valid binary operator)?

1 Like

Just out of curiosity: how does parsing of a regex work?

I assumed (but this maybe naive thinking) that when a forward slash was found, it would just look for the next forward slash that wasn't escaped. And what's in between would then be regarded as a regex.

Except when the forward slash is part of a comment or a division operator. See the details in http://www.ecma-international.org/ecma-262/#sec-ecmascript-language-lexical-grammar:

There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral are permitted.

Notice for example that the expression /a/ /b/g is already permitted by the current grammar, but has a /a/ regex literal that is divided by the variable b and then divided by the variable g.

1 Like

it would just look for the next forward slash that wasn't escaped

You'd also have to exclude forward slashes within character classes: /[/]/ is a legal regex. So it's a bit more complicated. But it's not that much more complicated; see the definition of RegularExpressionLiteral. (Note that there is a second, significantly more complicated grammar used to parse the literals once this simpler grammar has identified them.)

It's a two-step process:

  1. Consume a regexp token and save the inner source with its flags: https://www.ecma-international.org/ecma-262/#sec-literals-regular-expression-literals.
  2. Parse the inner regexp as per https://www.ecma-international.org/ecma-262/#sec-patterns.

In theory, you could merge these two steps, but in practice, it's rather complicated to do (you'd have to do a fair bit of math to tame it to something actually efficiently implementable) and, unless you're writing an engine with a built-in regexp runtime (none of the major engines do, BTW) or a regexp transpiler like regexpu, it's almost never worth the effort.