So Felienne is a TC member that's awesome. I'd love to have her input here, but I suppose that, since you didn't @mention her, it would be inadequate to do so.
I mentioned the cognitive burden and worry about increasing it in this case because the situation with RegExp is already IMO pretty dire (SNAFU comes to mind, we're just used to it). I think we should be careful not to make it worse.
There are extrinsic factors that could be fixed:
-
Better editor support would help
- MDN has the terminology wrong and mixes up concepts
- The terminology around RegExps is abstruse.
- "Disjunction" and "quantifier" are needlessly academic ("choice" and "suffix" would be more approachable)
-
Atom
per the spec definition clashes with the unrelated atomic groups that will hopefully be added.
-
Atom
in the spec includes groups, which are not atomic, conceptually. Quantifiable
would be a better name, as a superset of Unit
s (things that match exactly one code point/unit) and Group
s.
There are also many things that can't be fixed. RegExps are full of nook and crannies that make sense in the small but make the whole picture pretty complex.
That's a lot of paper cuts, a lot of things to juggle in one's mind. The syntax errors related to escaping rules may not be a big deal in a relaxed setting, but they can become infuriating when you're fixing a critical bug under time pressure at 2AM.
By adding a new flag, with new escaping rules, you're adding another bit of context to front load before writing or decoding any character class (I didn't check, but I suppose that the spec implements it as another grammar parameter, reflecting the mental model situation). The number of things we can hold in our heads is limited, and by adding yet another flag, you're making it harder to parse every character class, reducing the pool of people who will be able to work with RegExp (or feel comfortable doing so).
This is what I meant by increasing the cyclomatic complexity of RegExps. You're adding a boolean condition at the root of the tree.
Even if the recommendation once sets are out is not to use u
the mode, there will be legacy regexps hanging around.
\Op{...}
(or \Op[...]
) on the other hand is more limited in scope. You only have to take special rules into account within the braces, regardless of the flags. No spooky action at a distance.
TC members have cognitive abilities that are in the top 0.1% if not higher, and may not realize the impact of their decisions for common devs.
One thing that makes me question how the committee takes cognitive burden into account is a recent change to the spec, where the U
paramter of the RegExp grammar was renamed UnicodeMode
. I've been scouring the spec a few weeks ago while working on compose-regexp
. I was sick with flu-like symptoms, rediscovering the grammar. That change basically made it unreadable to me at the time. nicodeMode
is pure visual noise. You have to alternate between attentively reading the symbol names that you want to memorize and inhibiting your Wernike area when your eyes arrive on the parameters (once you've figured out that it is noise), in order not to push out useful info. It also dwarfs the +~?
operators. When you try to figure out what the N
parameter is for, this isn't fun.
Today, as I'm now more familiar with the spec and in better health, it doesn't bother me. But at the time, in the state I was in, I ended up using the 2021 version on which I had initially stumbled while searching for the spec. That I could read. It would be useful to spell out the parameters in full at the top of each grammars, but in the BNFs, short params are much better (for the same reason Math papers use one letter names for abstract concerns, or why we use i
and j
as indices when iterating, engaging the language areas needlessly is mentally taxing).
At last, the fact that the Unicode consortium recommends [[]]
doesn't mean we must follow them. JS RegExps in u mode already diverge quite a bit from the TR18.
Aside, I would love to have statistics on RegExp reading and writing proficiency, and on user sentiment about using RegExps (especially modifying existing ones).
I'm pretty sure that most dev understand only a subset of the common syntax and semantics (ignoring obscure annex B tidbits), and that the subset they can write is even smaller. Complex RegExps are rarely written, and when they are it is mostly, as you described, a brute-force, trial and error process until the thing works, and they hope they never have to touch them again, because they're a pain to debug.
I'd be curious to know how many people on the committee can describe the backtracking algorithm in detail (I won't fault anyone, I can't). I'd bet that 99+% of devs are in that situation.
The lack of RegExp knowledge among devs is actually a running joke in programming culture, from this to this...
Aside 2: I opened the VsCode issues above while writing this, and in the mean time, I learned that the RegExp tooling is sub-par because they don't have a proper RegExp tokenizer... Truth be told, writing a JS RegExp parser from scratch is not fun. One has to mentally patch the grammar with the Annex B on the fly... I'm working on a set of tokenizers (+U
, ~U~N
, ~U+N
, char classes) based on the spec, hopefully it will prove useful.