RegExp: Comments

This may be too large a departure from how RegExp are implemented/defined. For example what would the .flags property return for a RegExp with flags at different sections?

Also flags like D, G, and Y do not have clear semantics if they are applied to just one part.

I imagine the most commonly desired flag for just sub sections would be IgnoreCase. So maybe solutions would only need to cater for this.

1 Like

.flags would return the flags that were passed to the constructor, just as it does now. Those are the flags you need to reconstruct new RegExp(source, flags).
The following all match the same thing:

/yes/i.flags === 'i'
/[Yy][Ee][Ss]/.flags === ''
/(?i)yes/.flags === ''
/(?i:yes)/.flags === ''

The first two are currently valid, and you can see that re.flags.includes('i') or re.ignoreCase doesn't mean anything on its own, the flags are only meaningful when parsing the source (reconstruction or composition).

1 Like

But regex-in-a-string is not closer to raw-regex, than regex-ignoring-whitespace. I'd say quite the opposite. The so-called "extended" (Perl) or "verbose" (Python) regex syntax is long established, avoids all the clutter you have around the pieces (+ and quoting each line), and doesn't aggravate leaning toothpick syndrome. Note that in Python it is much easier to write regex-in-a-string (raw strings, no doubling backslashes), and they deemed verbose regex also worth including. Overall they're easier to write, read, understand and maintain.

I took the time to parse your example, and found two parts showcasing what a write-only nightmare regex-in-a-string so often is:

The layout makes it look as if you're matching two independent alternatives here, but no, the second pipe is escaped by the backslash from the previous line. Or this strangely split group:

I would argue it's a hell of a lot easier to make a subtle mistake in this kind of dense regex packed with backslashes, than it is in properly indented and commented verbose one.

1 Like

@aclaymore comment about needing to expose a flags attribute I think is a point against having inline flags for flags such as "g", as these really should be an on-off switch for the whole regex. I don't have a problem with having inline flags for things such as case sensitivity though, especially since it's technically possible to construct a case-insensitive regex without using that flag.

I did a lookaround at all of the different regular expression flags. Here are the ones that change how the regular expresion is interpretted:

  • i: Case insinsitive
  • m: multiline
  • s: Single line (dotall)
  • u: unicode

And here are the ones that ones that relate to how the search is performed and (and really should have been function parameters)

  • d: has indicies
  • g: global
  • y: sticky

It's this second group of flags that I'm starting to think we shouldn't allow as an inline flag, as they don't make sense when only applied to a portion of a regular expression, which means we must provide some other way to apply these flags.

I started looking around at some of the built-in functions that use regular expressions. Here are some that I found:

In all of these functions, it would be easy enough to add an optional options object argument to the end that accepted certain properties to configure how the search is performed. e.g. { global: true, hasIndicies: false }.

The matchAll() function is particularly interesting. because 1. It's a newer function, and 2. In the documentation, I noticed that they will throw an error if you supply a regex that does not have a "g" flag. There seems to be no reason for this enforcement, except to require consistency with how the flags work. They could have alternativly implemented it to ignore the "g" flag on the regular expression, and always do a "global search" anyways, similar to what we are proposing should happen if "global" is set to true on an options argument for one of these other functions, but they decided against this. In other words, they seem to be upholding the idea of having flags such as "d", "g", and "y" be part of the regular expression instead of function parameters, so we'll probably have a difficult time changing this.

I think we would have a much easier time getting this proposal through if we followed suit (at least in the initial draft - conversations about this could certainly happen afterwards). Changing all of these functions would bloat the proposal a good amount anyways.

This leaves us with needing a convenient way to apply flags such as "g" and "d" to a template-string regular expression

  • Doing new RegExp(existingRegex, newFlags) was a good idea, but probably only works well if we didn't need to set flags often. I would be in favor of just using this approach if we do alter the different search functions.
  • There was the Regex.from(<flags>)`the regex` idea that I'm still a fan of
  • Regex.from.d`the regex` was also brought up. This feels a little weird when combining multiple flags though: regex.from.d.g.i`the regex`

I'm also just realizing that we ended up with two separate proposals mashed together. It might be good to separate out the idea of composing regular expressions and adding a verbose flag into two separate proposals, as they both can stand alone, and can both provide value, independent of each other.

The "verbose flag" proposal would just be that. It would enable people to create regular expressions like the following (no custom template tag needed):

new RegExp(String.raw`
  ...
`, 'x')

The composable regular expressions idea would contain other features such as

  • the RegExp.from template tag
  • Inline flags (?i)

Example:

const chunk = RegExp.from('i')`[a-f][a-z][a-z]`
const final = RegExp.from()`${chunk}-${chunk}`

I'm also realizing that composing regular expressions is possible today. You just have to compose strings, instead of RegExp instances (which might make more sense anyways - it gets around issues with flags).

const chunk = String.raw`[a-f][a-z][a-z]\d`
const final = new RegExp(String.raw`${chunk}-${chunk}\d`, 'i')
// or
const chunk = /[a-f][a-z][a-z]\d/
const final = new RegExp(String.raw`${chunk.source}-${chunk.source}\d`, 'i')
1 Like

didn't know String.raw.... learned something new and thx

I think I'll go ahead and pull together a proposal repo for the verbose flag, so we have something a little more concrete to work with.

Alright, here's a proposal repo for the verbose flag. Feel free to discuss it, give feedback, attack it, etc.

I recently learned of XRegExp which might be good to look at and possibly learn from for supporting comments, verbose mode, etc. (prior art)

https://xregexp.com/

3 Likes

Thanks @mfulton26, that's a great example of prior art. I see they've following similar design choices that we've reached (the "x" flag, which supports both whitespace and "#" as a comment character).

I've updated the proposal reop to include it.

1 Like

Good news! It looks like a delegate (Ron Buckton) has independently decided to make a bunch of improvements to regular expressions, one of which is the X flag.

You can find his proposal repo here: GitHub - tc39-transfer/proposal-regexp-x-mode

It's currently at stage 1.

5 Likes