Unicode-aware regular expressions

In ECMAScript 2015, RegExp.prototype.flags was extended with the u-flag.

Was it ever discussed to have the u-flag imply that \w, \b, and \d use the corresponding unicode character for all languages?

So for example, this would return true in such a scenario:

/ä/u.test('\w')

As it is now, it seems to me very English-centric. Having to do \p{L} seems like a hacky workaround.

The Python community even went a step further here and made regular expressions unicode-aware per default, so \w, \b, \d match all corresponding unicode characters. You can opt-out by passing in the a-flag to make it ASCII-only. I think this is the correct international solution.

Were there ever any discussions around this?

See also:

That would be a breaking change that would break webpages, so it’s not a viable option.

1 Like

I think that there is no simple or univocal “correct international solution”. Per UAX #29: Unicode Text Segmentation, finding a “word boundary” (the intention of \b) is... challenging.

The Python documentation you linked says that \w matches “Unicode word characters”, without a precise definition of what it means. For exemple, does it include the middle-dot (·), which is definitely part of words in Catalan (where it is used to separate two Ls of different syllables within a word)?

1 Like

@claudepache You're right. Correct is not a helpful word here. Thank you for pointing this out and for sharing the article. :)

Python: a Unicode word character is the underscore-character or an alphanumeric character, which, if I understand the source code correctly, is one of these:[1][2][3]

  • Lowercase_Letter (Ll)
  • Uppercase_Letter (Lu)
  • Titlecase_Letter (Lt)
  • Other_Letter (Lo)
  • Modifier_Letter (Lm)
  • Decimal (De)
  • Digit (Di)
  • Numeric (Nu)

The middle-dot (·) seems to be in the category Other_Punctuation (Po), so if I understand the Python source code correctly, this character is not included.

P.S. I had to remove most links to the unicode.org tool "Unicode Utilities: UnicodeSet", since apparently new users can only post at most 5 links.

@ljharb: your comment about breaking change was eye-opening for me. Now I more clearly understand the constraints with evolving ES.

To clarify: are you saying the meaning of the u-flag would have had to be changed before it got released, or are you commenting on the "unicode is default in Python"-comment.

I'm saying that Python was able to make the change it did because breaking changes were deemed acceptable (whether it was worth the decade-plus of ecosystem churn is a discussion that's off-topic for this discourse).

JS has no such option, so any defaults now will forever remain the default.

1 Like

@ljharb thanks for clarifying.

My main curiosity was less around breaking the default, but more around if it ever was discussed to have the u-flag imply that \w, \b, \d use their corresponding Unicode character codes? (Or some other new flag which hasn't been introduced yet.)

I understand that the backwards-compatibility criteria won't allow current defaults to change.

1 Like

GitHub - tc39/proposal-regexp-v-flag: UTS18 set notation in regular expressions is still at stage 3, is it feasible to include this in the proposal (or do we need a new flag for this because it’s too late)?

I suppose it would be possible by doing a search for all boundaries in the string beforehand if the regex contains any \b. I understand that it’s somehow heavy though, but web engines are expected to do this quite a lot already.