RegExp set notation

GitHub - tc39/proposal-regexp-set-notation: UTS18 set notation in regular expressions

I'm bringing the discussion here per Re. note in the README about user the friendliness of RegExp syntax 路 Issue #61 路 tc39/proposal-regexp-set-notation 路 GitHub

I wonder if it is a good idea to add more syntax to RegExps, especially if it once again complicates the already non-trivial escaping rules for yet another flag.

It is already possible to express set operations using today's RegExps features. Set union is disjunction (A U B => /A|B/), set difference is negative look ahead (A \ B => /(?!B)A/, or /A(?<!B)/ if you're feeling fancy), and you can get the other operations by composing the former two. Engines can easily detect these two patterns and optimize them under the hood.

If we add a JS API to build and compose RegExps, RegExp syntax can be considered object code for all but trivial patterns, and we'd all gain from it.

I find it weird that all considerations about coding best practices (readability, testability, composability, predictability) fly out the window when RegExps are involved.

RegExps are a DSL for certain kinds of parsing task. As with all DSLs, there's an inherent tradeoff between expressing things concisely within the DSL vs punting to the meta-language. I agree it would be nice to have a way of doing composition with RegExps, but I also think that being able to perform intersection, etc, on character classes is a sufficiently fundamental operation that you really want to be able to do it within the DSL itself. And while it's true you can achieve set operations with lookarounds, I don't think those are nearly as clear as using the standard syntax.

(Also, engines are reluctant to do the sort of optimization you propose, because it leads to surprising "performance cliffs" where a small tweak which gets you off the happy path leads to an unexpectedly sharp decline in performance.)

That said, I agree that RegExps in most languages are an unholy amalgamation of whatever features someone thought would be useful at some point. But that doesn't excuse us from the work of considering whether any particular proposal makes sense on its own merits. This one does, I think.

My main concern with the proposal is that it complicates the escaping rules which are already non-trivial (two sets depending on the flags, - is not considered a syntax character in u mode), which makes the DSL difficult to use for non-experts (and can even surprise delegates, e.g. @domenic thought that you could apply the same rules both inside and outside character classes in unicode mode, but the - must be unescaped outside char classes).

The new syntax is not purely additive, it has far reaching consequences. Specifically, it increases the cyclomatic complexity of the mental model for the whole DSL. Having set operations confined to \Op{} blocks, outside of char classes would be far less invasive. The choice of [[]] was made with the intent of being user friendly, but I don't think it is.

Are there folks with cognitive neuroscience backround on the TC, like https://twitter.com/Felienne ?

As far as composition goes, what do you think of GitHub - compose-regexp/compose-regexp.js: Build and compose maintainable regular expressions in JavaScript. ?

I am much, much less worried about the escaping rules than you are. Escaping rules are mostly annoying for parsers and code generators. In my experience users tend to just escape stuff which seems like it might need to be escaped, and that either works as they expected or gives them a syntax error which lets them know they need to not escape some specific thing, which is only a minor hiccup. I do not expect the minor additional complexity of escaping rules entailed by this proposal to be a major problem in practice.

I'm not entirely sure what you mean by "the cyclomatic complexity of the mental model" here; I know what cyclomatic complexity is, and what it means to talk about the complexity of a mental model, but I don't know how you're intending to relate the two here. In any case, I agree this proposal increases the complexity of RegExps, but again merely acknowledging this fact does not free us from the work of deciding whether it is worth the tradeoff, and again I think it is. In particular, I really do not think having new \Union{} or \Set{} or whatever blocks (assuming that's what you're proposing by \Op{}) would be sufficiently ergonomic to be worth having at all within the RegExp DSL.

There are indeed folks with a cognitive neuroscience background on the TC, including Felienne, but they like anyone else have only limited bandwidth. We cannot realistically commission a study of the empirical cognitive burden of every new feature; we have to rely on our own judgement and experience as designers, educators, and users of programming languages.

As far as composition goes, what do you think of GitHub - compose-regexp/compose-regexp.js: Build and compose maintainable regular expressions in JavaScript.?

I would be more partial to something which reuses the existing regex syntax, personally, rather than having two entirely different ways of expressing every concept. But it's good to start the discussion.

So Felienne is a TC member that's awesome. I'd love to have her input here, but I suppose that, since you didn't @mention her, it would be inadequate to do so.

I mentioned the cognitive burden and worry about increasing it in this case because the situation with RegExp is already IMO pretty dire (SNAFU comes to mind, we're just used to it). I think we should be careful not to make it worse.

There are extrinsic factors that could be fixed:

  • Better editor support would help
  • MDN has the terminology wrong and mixes up concepts
  • The terminology around RegExps is abstruse.
    • "Disjunction" and "quantifier" are needlessly academic ("choice" and "suffix" would be more approachable)
    • Atom per the spec definition clashes with the unrelated atomic groups that will hopefully be added.
    • Atom in the spec includes groups, which are not atomic, conceptually. Quantifiable would be a better name, as a superset of Units (things that match exactly one code point/unit) and Groups.

There are also many things that can't be fixed. RegExps are full of nook and crannies that make sense in the small but make the whole picture pretty complex.

  • The JS API can be confusing or surprising

    • match and matchAll live on String.prototype, whereas test and exec live on RegExp.prototype.
    • what happens with lastIndex is all over the place depending on the methods one is using.
      • test and exec read it before matching and write to it afterwards, setting it to 0 when going past the end of the string.
      • match sets it to 0 before and after matching
      • matchAll clones the RegExp, including its lastIndex value, and works with the clone, leaving the original untouched. It takes the original value as the starting point for the match.
    • results.groups is a misnomer, results.named would have been more meaningful, since it doesn't concern anonymous capturing groups or non-capturing ones.
    • when fed a RegExp, string.split inserts the values of captures in the resulting array, between the chunks that result from the split.
  • Regarding the syntax

    • It is good for two things:
      • Quickly dumping an idea into a search field or at the CLI
      • as a Math notation, where sub-patterns are abstracted as single letters (A*, A|B) as in Kleene's paper on the regular formalism.
    • It is however far too dense to read past a trivial level of complexity.
    • Even in verbose mode (x flag), the quantifiers can be dwarfed visually and are easy to miss in suffix position.
    • The escaping rules of unicode regexps were well intentioned, but they are IMO arbitrary. In non-u regexps, there are a few escape sequences that have a special meaning, and you can escape everything else. When in doubt, just escape, easy. In u mode, reserving letters for future use made a ton of sense, but I'm perplexed by the rules around sigils. Why aren't [ and - considered syntax characters within character classes, making /[-[]/u, /[--]/u and /[---]/u legal? Why is it allowed to escape * inside a character class, but not - outside?

That's a lot of paper cuts, a lot of things to juggle in one's mind. The syntax errors related to escaping rules may not be a big deal in a relaxed setting, but they can become infuriating when you're fixing a critical bug under time pressure at 2AM.

By adding a new flag, with new escaping rules, you're adding another bit of context to front load before writing or decoding any character class (I didn't check, but I suppose that the spec implements it as another grammar parameter, reflecting the mental model situation). The number of things we can hold in our heads is limited, and by adding yet another flag, you're making it harder to parse every character class, reducing the pool of people who will be able to work with RegExp (or feel comfortable doing so).

This is what I meant by increasing the cyclomatic complexity of RegExps. You're adding a boolean condition at the root of the tree.

Even if the recommendation once sets are out is not to use u the mode, there will be legacy regexps hanging around.

\Op{...} (or \Op[...]) on the other hand is more limited in scope. You only have to take special rules into account within the braces, regardless of the flags. No spooky action at a distance.

TC members have cognitive abilities that are in the top 0.1% if not higher, and may not realize the impact of their decisions for common devs.

One thing that makes me question how the committee takes cognitive burden into account is a recent change to the spec, where the U paramter of the RegExp grammar was renamed UnicodeMode. I've been scouring the spec a few weeks ago while working on compose-regexp. I was sick with flu-like symptoms, rediscovering the grammar. That change basically made it unreadable to me at the time. nicodeMode is pure visual noise. You have to alternate between attentively reading the symbol names that you want to memorize and inhibiting your Wernike area when your eyes arrive on the parameters (once you've figured out that it is noise), in order not to push out useful info. It also dwarfs the +~? operators. When you try to figure out what the N parameter is for, this isn't fun.

Today, as I'm now more familiar with the spec and in better health, it doesn't bother me. But at the time, in the state I was in, I ended up using the 2021 version on which I had initially stumbled while searching for the spec. That I could read. It would be useful to spell out the parameters in full at the top of each grammars, but in the BNFs, short params are much better (for the same reason Math papers use one letter names for abstract concerns, or why we use i and j as indices when iterating, engaging the language areas needlessly is mentally taxing).

At last, the fact that the Unicode consortium recommends [[]] doesn't mean we must follow them. JS RegExps in u mode already diverge quite a bit from the TR18.



Aside, I would love to have statistics on RegExp reading and writing proficiency, and on user sentiment about using RegExps (especially modifying existing ones).

I'm pretty sure that most dev understand only a subset of the common syntax and semantics (ignoring obscure annex B tidbits), and that the subset they can write is even smaller. Complex RegExps are rarely written, and when they are it is mostly, as you described, a brute-force, trial and error process until the thing works, and they hope they never have to touch them again, because they're a pain to debug.

I'd be curious to know how many people on the committee can describe the backtracking algorithm in detail (I won't fault anyone, I can't). I'd bet that 99+% of devs are in that situation.

The lack of RegExp knowledge among devs is actually a running joke in programming culture, from this to this...



Aside 2: I opened the VsCode issues above while writing this, and in the mean time, I learned that the RegExp tooling is sub-par because they don't have a proper RegExp tokenizer... Truth be told, writing a JS RegExp parser from scratch is not fun. One has to mentally patch the grammar with the Annex B on the fly... I'm working on a set of tokenizers (+U, ~U~N, ~U+N, char classes) based on the spec, hopefully it will prove useful.

Another thing, flags, themselves are terrible as an API.

We have d for "dot all" i for "indices", s for "sticky" and v for "verbose"... Oh, no, they mean "indices", "case folding", "dot all", and "set operations" respectively. Sticky is y, and verbose would be x if it ever came about.

Single letters are inappropriate to name things that are used infrequently.

I don't really have much to say about most of your comment, but I do want to reply to one specific thing:

Changes like this, which affect only the specification itself, are made unilaterally by the editors (a group of three which includes myself), not by the committee as a whole. I was pretty neutral on that particular change, but several other people who work with this part of the specification were in favor of it. I was happy to defer to their preferences.

Also:

Yeah, I've done that, and I agree it wasn't fun (though it wasn't nearly as bad as the main grammar, tbh). We're in process of moving more of Annex B into the main specification, which will help, but it's slow going.

1 Like

Thanks for the info.

To make my point shorter, RegExps are already full of accidental complexity, and going for a flag adds more, where none is needed.

There are other solutions that don't increase the accidental complexity and I think they should be favored.