Unicode-aware regular expressions

mikez · September 5, 2022, 11:45am

In ECMAScript 2015, RegExp.prototype.flags was extended with the u-flag.

Was it ever discussed to have the u-flag imply that \w, \b, and \d use the corresponding unicode character for all languages?

So for example, this would return true in such a scenario:

/ä/u.test('\w')

As it is now, it seems to me very English-centric. Having to do \p{L} seems like a hacky workaround.

The Python community even went a step further here and made regular expressions unicode-aware per default, so \w, \b, \d match all corresponding unicode characters. You can opt-out by passing in the a-flag to make it ASCII-only. I think this is the correct international solution.

Were there ever any discussions around this?

The Python documentation you linked says that \w matches “Unicode word characters”, without a precise definition of what it means. For exemple, does it include the middle-dot (·), which is definitely part of words in Catalan (where it is used to separate two Ls of different syllables within a word)?

mikez · September 5, 2022, 9:23pm

@claudepache You're right. Correct is not a helpful word here. Thank you for pointing this out and for sharing the article. :)

Python: a Unicode word character is the underscore-character or an alphanumeric character, which, if I understand the source code correctly, is one of these:^[1]^[2]^[3]

Lowercase_Letter (Ll)
Uppercase_Letter (Lu)
Titlecase_Letter (Lt)
Other_Letter (Lo)
Modifier_Letter (Lm)
Decimal (De)
Digit (Di)
Numeric (Nu)

The middle-dot (·) seems to be in the category Other_Punctuation (Po), so if I understand the Python source code correctly, this character is not included.

P.S. I had to remove most links to the unicode.org tool "Unicode Utilities: UnicodeSet", since apparently new users can only post at most 5 links.

mikez · September 5, 2022, 9:27pm

@ljharb: your comment about breaking change was eye-opening for me. Now I more clearly understand the constraints with evolving ES.

To clarify: are you saying the meaning of the u-flag would have had to be changed before it got released, or are you commenting on the "unicode is default in Python"-comment.

ljharb · September 5, 2022, 10:18pm

I'm saying that Python was able to make the change it did because breaking changes were deemed acceptable (whether it was worth the decade-plus of ecosystem churn is a discussion that's off-topic for this discourse).

JS has no such option, so any defaults now will forever remain the default.

mikez · September 6, 2022, 8:04am

@ljharb thanks for clarifying.

My main curiosity was less around breaking the default, but more around if it ever was discussed to have the u-flag imply that \w, \b, \d use their corresponding Unicode character codes? (Or some other new flag which hasn't been introduced yet.)

I understand that the backwards-compatibility criteria won't allow current defaults to change.

graphemecluster · September 8, 2022, 2:11pm

GitHub - tc39/proposal-regexp-v-flag: UTS18 set notation in regular expressions is still at stage 3, is it feasible to include this in the proposal (or do we need a new flag for this because it’s too late)?

graphemecluster · September 8, 2022, 2:22pm

I suppose it would be possible by doing a search for all boundaries in the string beforehand if the regex contains any \b. I understand that it’s somehow heavy though, but web engines are expected to do this quite a lot already.

Topic		Replies	Views
Adding an escape hatch to circumvent the parsing restrictions introduced with the RegExp u flag 💡 Ideas	6	337	February 9, 2020
Should String.prototype.trim remove zero-width space? (\u200b) I have questions	4	603	December 10, 2021
What is the Unicode Default Case Conversion algorithm? I have questions	1	495	January 11, 2021
Make it possible to divide regular expressions into chunks 💡 Ideas	7	573	July 28, 2020
Inverted regex charsets in Unicode mode I have questions	4	320	July 21, 2022

Unicode-aware regular expressions

See also:

Related topics