RegExp: Comments

I would love to have Perl compatible regular expression comments.

Comments begin with (?# and end at the next ).

Example

text.match(/true|false|undefined(?#indeterminate)/);
2 Likes

need to prove this feature necessity

Personally I would like to see a built in way of composing RegExps. So variable names and existing JS comment syntax can be used to describe the pattern.

Something like:

let version = 4;
let variant = /[89aAbB]/; // https://en.wikipedia.org/wiki/Universally_unique_identifier#Variants 
let hex = /[a-f0-9]/;
let firstHalf = RegExp.from`${hex}{8}-${hex}{4}-${version}${hex}{3}`;
let secondHalf = RegExp.from`${variant}${hex}{3}-${hex}{12}`;
let uuids = RegExp.from`${firstHalf}-${secondHalf}`;

assert(uuids.source === /[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}/.source);
Implementation
RegExp.from = function ({ raw }, ...vars) {
  function varToString(v) {
    let isRegex = RegExp(v) === v;
    return isRegex ? v.source : String(v);
  }

  return RegExp(
      raw
        .map((v, i) => (i === raw.length - 1 ? v : v + varToString(vars[i])))
        .join("")
    );
};
2 Likes

Regex can already feel so cluttered sometimes - maybe it needs something more than just comments. I personally love coffeescript's multiline regex, which allows for inline comments and it ignores whitespace. All of that together could really help to make a regex more readable.

Example from their website:

NUMBER     = ///
  ^ 0b[01]+    |              # binary
  ^ 0o[0-7]+   |              # octal
  ^ 0x[\da-f]+ |              # hex
  ^ \d*\.?\d+ (?:e[+-]?\d+)?  # decimal
///i

(The # is the inline comment character in coffeescript).

1 Like

I'd be okay with comments provided regexps spanning multiple lines is also permitted. Without it, I don't see how it's any more useful than a single-line comment above it explaining it.

2 Likes

I agree that being able to define a RegExp literal over multiple lines would be helpful but even without it a RegExp with multiple comments throughout it can be helpful so that you can insert logical breaks and say "the following is for X", etc.

I don't think you need to change JS grammar for this. Just finding a way to allow multi-line /regex/ without ambiguities would not be trivial. /// is a comment. Perl's qr/foo/ix is already (qr) / (foo) / (ix) in JS.

@aclaymore's suggestion above is a good start, but it doesn't respect flags. Since there's no way to pass flags to the tagged template, and also currently no way to "scope" flags to a portion of a RegExp, you'd need inline flags and/or non-capturing group flags supported in RegExp grammar.

I would even suggest putting this directly on RegExp, making it work as a template tag โ€” right now it's completely unaware of tagged template literals; treats the template chunk array as a regular string, the first ${expression} as flags, and ignores the rest.

Then add verbose flag, and one of the examples above would look like this:

NUMBER     = RegExp`(?xi)
  ^ 0b[01]+    |              # binary
  ^ 0o[0-7]+   |              # octal
  ^ 0x[\da-f]+ |              # hex
  ^ \d*\.?\d+ (?:e[+-]?\d+)?  # decimal
`
1 Like

The following examples already do something actually:

const validatePhone = RegExp`\\d{3}-\\d{3}-\\d{4}`
console.log(validatePhone.test('123-456-7890')) // true

const badButValidRegex = RegExp`${'g'}\\d{3}-\\d{3}-\\d{4}`
console.log(badButValidRegex.test(',123-456-7890')) // true

These work because a template tag is just a function that accepts two arrays. RegExp happens to also be a function that accepts two parameters and stringifies them.

So, we would need to do something like Regex.from(). I'm also seeing from the above examples that back-slashes don't play well with tagged-template-literals, which makes it an inconvenient solution.

However, to continue with the Array.from() idea, one way to allow flags would be to curry the function a bit - you call it with your desired flags and it returns another function that you can use as a tag for a tagged-template literal.

Array.from('gi')`This is my regex`

It was also an interesting point to bring up about combining flags. If one regex used the "case-insensitive" flag and another did not, it wouldn't really be compatible to combine those two, unless we found some way to make a chunk of the regex obey the case-insensitive flag. Other flags, like the "g" flag wouldn't make any sense when applied to just a portion of the regex. So, that's something interesting to chew on.

Minor nitpick: tag function accepts an array and varargs. But yes, I'm aware this currently does something, and I have no sympathy for code abusing that.

Not really, tag function also has access to raw template text; no need to duplicate backslashes like with new RegExp("\\d+").

That doesn't solve the issue of composing regexes with different flags:

let reIdentifier = RegExp.from('i')`[$a-z_][$0-9a-z_]*`
let reType = RegExp.from()`(int|string)`
let reDecl = RegExp.from()`declare +${reIdentifier} *: *${reType}`

You need scoped flags in order to compose the final regex:

reDecl == RegExp.from`declare +(?i:[$a-z_][$0-9a-z_]*) *: *(int|string)`

So you'd only need the curried variant for certain flags like 'g' or 'd', which almost sounds like it should be a distinct function, i.e. you'd have:

let re1 = RegExp.from`match this`
let re2 = RegExp.from.g`all of this`

But frankly, 'g' or 'd' should never have been attached to the RegExp object, they should've been flags passed to matching functions. So this could be an opportunity to remedy that as well.

Thanks for correcting me on how tagged-template literals work - I didn't look them up first like I should have.

My biggest worry isn't that people are currently using RegExp as a tagged-template literal (but I wouldn't be surprised if someone out there was), my biggest worry is that we're overloading a single function to do two different things, and applying some magic type-checking to distinguish the two behaviors. What would this magic be? Do we decide to go into tagged-template mode if Array.isArray() is true on the first argument? Let's just keep things simple: The Regex constructor takes two arguments and stringifies them (no type-checking to do special-case behaviors, this behavior will always hold), and Regex.from() is meant to be used as a template tag.

With the RegExp('i')`some regex` format, I'm implying that we erase all of the flags of the composed RegExp and apply the same new flags to it all. This certainly isn't ideal, but it works fine when you're in control of all of the sub-regular-expressions (which isn't always the case). But, I'm certainly open to other ideas of how to deal with these flags, and you bring up some good points.

Fair enough. I wouldn't check the type, only whether there's .raw on the first arg. I also don't consider that doing two different things: you're constructing a RegExp in either case. It's only about how you interpret a non-string argument. Somewhat similar to how String.prototype.match automagically constructs a RegExp if you give it something else.

Now that I think of it, RegExp constructor already is overloaded, you can give it a string, or another RegExp.

Good point. I guess it wouldn't be all that bad to do RegExp`...` - I think I would still feel more comfortable with RegExp.from(), but I can also think of a number of examples where, in Javascript, we make one function act in two different ways given different parameters.

For completeness a tagged template approach wouldnโ€™t necessarily need to support specifying flags as they could be added using a subsequent constructor.

RegExp(RegExp.from`...`, flags)
3 Likes

I think it shouldn't be possible to express things in a composed regular expression that are not possible to express in a non-composed one.

Currently, there's no way to represent the following composed regular expression using a regex literal:

Regex.from`ab${/cd/i}ef${/gh/}ij`

To fix this, we either need to allow flags to be set for just parts of a regular expression (edit: as @lightmare was suggesting - sorry, I was slow to catch on), using syntax such as (?i), or, we need to discard these flags when composing them.

I can see arguments for lots of directions here. I'm currently favoring the idea of adding support for (?i) syntax, and discarding irrelevant flags when composing, such as 'g'.

I don't know much about this mode-modifier syntax (?i), but I believe it would allow us to set flags such as g inside the tagged-template as well, which would solve the issue of setting flags in the template.

I agree that g should never have been a property of regular expressions, and it could be worth looking into what it would take to add arguments to the different search functions so that we can use the g flag (and friends) behaviors without setting it as a property on the regex.

in the wild, jslint internally uses tagged-templates @ jslint/jslint.js at 76a5f3fa7947180dc86327371a87cfe758a5a5ad ยท jslint-org/jslint ยท GitHub

function tag_regexp(strings) {
    return new RegExp(strings.raw[0].replace(/\s/g, ""));
}
...
// carriage return, carriage return linefeed, or linefeed
const rx_crlf = tag_regexp `
      \n
    | \r \n?
`;
...
// token
const rx_token = tag_regexp ` ^ (
    (\s+)
  | (
      [ a-z A-Z _ $ ]
      [ a-z A-Z 0-9 _ $ ]*
    )
  | [
      ( ) { } \[ \] , : ; ' " ~ \`
  ]
  | \? [ ? . ]?
  | = (?:
        = =?
      | >
    )?
  | \.+
  | \* [ * \/ = ]?
  | \/ [ * \/ ]?
  | \+ [ = + ]?
  | - [ = \- ]?
  | [ \^ % ] =?
  | & [ & = ]?
  | \| [ | = ]?
  | >{1,3} =?
  | < <? =?
  | ! (?:
        !
      | = =?
    )?
  | (
        0
      | [ 1-9 ] [ 0-9 ]*
    )
) ( .* ) $ `;
...

i don't like them as they're too magical -- always have uneasy feeling when editing above code, that i made some subtle mistake. would prefer it reverted to something closer to raw-regexp (which i feel more comfortable editing, even if its less pretty):

// token
const rx_token = new RegExp(
    "^("
    + "(\\s+)"
    + "|([a-zA-Z_$][a-zA-Z0-9_$]*)"
    + "|[(){}\\[\\],:;'\"~\\`]"
    + "|\\?[?.]?"
    + "|=(?:==?|>)?"
    + "|\\.+"
    + "|\\*[*\\/=]?"
    + "|\\/[*\\/]?"
    + "|\\+[=+]?"
    + "|-[=\\-]?"
    + "|[\\^%]=?"
    + "|&[&=]?"
    + "|\\"
    + "|[|=]?"
    + "|>{1,3}=?"
    + "|<<?=?"
    + "|!(?:!"
    + "|==?)?"
    + "|(0|[1-9][0-9]*)"
    + ")"
    + "(.*)$"
);

What part do you find magical about them? Is it related to the fact that it's a template (i.e. would special multi-line syntax feel more comfortable), or the fact that whitespace can be used inside the regex (so you wouldn't want the suggested verbose flag)?

both. when diagnosing parsing errors, i now have additional sources of bugs to worry about (pretty as they may look).

I'm not sure I'm seeing what you're seeing yet @kaizhu256 - what is this additional source of bugs?

  • What kinds of bugs might come from using a tagged template, that wouldn't come from using the normal Regex() constructor?
  • How would a verbose flag be much different from a case-insensitive flag - they both change how the internal regex is interpreted?

On another note, if we go forwards with a verbose flag that's similar to python's (which adds whitespace and comment support), I think it would be nice if we kept # as the comment character. It would allow these regular expressions to be interchangeable between other languages that support this flag. Also, a multi-character comment (like //) feels too much like a potential hazard in regular expression syntax, plus, that would be awkward to escape (should you escape both slashes? Just one? Which one?). I admit that none of these are particularly strong reasons, but I think they're still strong enough to outweigh using // as a comment character.