RegExp composition

It would be useful to have a way to compose RegExps without having to resort to splicing source strings. This would let one build complex parsers out of simpler ones without having to think about non-capturing groups, and let one write test for the pieces. Rather than full expressions used in template strings, one could something along /(?>id)/ where id is a JS variable.

I haven't tracked all propositions, but at least two of the current propositions (pattern modifiers and set notation) that introduce quite a bit of syntax could be subsumed into this.

  • The "modifiers" proposal becomes moot if composition takes into account the flags of the expression that is spliced in.

  • The set operations could be implemented as a plain JS API that operates on RegExps that consume exactly one code point.

This would save a lot on syntax.

The RegExp syntax is already unwieldy, it was invented to create write-only searchers at the command line (itself building on a math notation where sub-patterns were abstracted away as single-letter variables), and it shows. Complex RegExps are a nightmare to edit and update.

If one stumbles on a syntactic construct they don't know they're going to have a tough time searching for it. The escaping rules introduced by the set notation proposal makes readability even worse.

OTOH, the regular formalism is all about composition, regular operators are meant to operate on arbitrary expressions, but the current syntax makes it hard to take advantage of.

Here's an example of a complex RegExp-based parser and here's a port using my compose-regexp lib. The benefits in readbility are hopefully obvious, and would be even better with proper syntactic support.

Edit: alternative syntax: /\e{js.expression}/ could also work.

3 Likes

There was a fair amount of discussion related to this sort of idea in this thread: RegExp: Comments

One of the major hurdles is: How do you compose regular expression flags? Some flags just don't compose at all. Other flags do, but it would be nice if you're able to build the same regular expression without composing as you would build with composing.

In the end, the main differences between composing regular expressions, and composing strings that you pass into a regular expression constructor is this annoying flag situation, and verbosity.

For example, you can do this to achieve composition:

const part1 = String.raw`\d{3}`
const part2 = String.raw`\d{4}`

const pattern1 = new RegExp(part1)
const pattern2 = new RegExp(part2)
conts completePattern = new RegExp(`(${part1}) ${part1)-${part2}`)

You can throw a TypeError for flags that don't make sense.

The string-based approach is fine for trivial patterns, if you want to use quantifiers or disjunctions, you have to either systematically add non-capturing groups (NCG) or parse the input to be spliced in to check if a NCG is needed. ComposeRegExp does the latter to keep the resulting RegExps as short and readable as possible, but it is not trivial. If one doesn't use a dedicated library for this one could easily end up with bugs, especially when refactoring.

const part1 = "a"
const part2 = "b"

const combined1 = `${part1}-${part2}`
const combined2 = `${part1}|${part2}`

const badStar1 =  new RegExp(`${combined1}*`) // /a-b*/
const badStar2 = new RegExp(`${combined2}*`) // /a|b*/
const goodStar2 = new RegExp(`(?:${combined2})*`) // /(:a|b)*/

const badSequence = RegExp(`${combined1}${combined2}`) // /a-ba|b/
const goodSequence = RegExp(`${combined1}(?:${combined2})`) // /a-b(?:a|b)/

Now imagine you set part1 to "a|c". Have fun debugging if you were not careful adding NCGs everywhere.

You can look at the monstrosity that is the SemVer parser used by NPM, this is what you end up with in practice when using the approach you suggest.

@theScottyJam I've skimmed the "regexp comments" proposal. new RegExp(RegExp.fromsource, flags) would be a reasonable solution.

@trusktr has written a lib that implements that idea: GitHub - trusktr/regexr: Easily compose regular expressions without the need for double-escaping inside strings..

IMO dedicated syntax or combinators as a JS API would be better though... new RegExp(RegExp.from is a lot of visual noise. Syntax highlighters would also need to be made aware of RegExp.from whereas /(?>id)/ would be usable out of the box (and would work with non-unicode RegExps).

While the idea of composing RegExps has long been on my mind, the modifiers and set operations proposal are new to me, and I just realized that composition could not entirely subsume the proposals, because I suppose that we'll want to be able to serialize the result of a composition into something that can be parsed by the RegExp constructor.

However, if we provide composition and set operations as JS APIs, the readability of the serialization format become less important, and we could have it live e.g. inside a \op{} block, such that escaping rules within charsets don't have to be updated.

This comes up in discussions of the RegExp.escape proposal sometimes - some delegates would strongly prefer to have a template tag builder instead of a RegExp.escape method. (This was, in fact, one of the main use cases imaged for template tags when they were being added to the language in the first place.) No one has stepped forward to champion this yet, though.

An example of such a builder (now quite out of date with new features) is

1 Like

How about adding a new RegExp Builder Syntax, like Swift, instead of basing it on the existing Template Literal?

let word = OneOrMore(.word)
let emailPattern = Regex {
    Capture {
        ZeroOrMore {
            word
            "."
        }
        word
    }
    "@"
    Capture {
        word
        OneOrMore {
            "."
            word
        }
    }
}

https://developer.apple.com/documentation/RegexBuilder

FWIW, Swift now provides a RegExp builder that is functionally equivalent to compose-regexp.

Edit: That feature was released six months ago, I learned about it today... Am I living under a rock? Yes, why do you ask :-p

1 Like