Was it ever discussed to have the u-flag imply that \w, \b, and \d use the corresponding unicode character for all languages?
So for example, this would return true in such a scenario:
/ä/u.test('\w')
As it is now, it seems to me very English-centric. Having to do \p{L} seems like a hacky workaround.
The Python community even went a step further here and made regular expressions unicode-aware per default, so \w, \b, \d match all corresponding unicode characters. You can opt-out by passing in the a-flag to make it ASCII-only. I think this is the correct international solution.
I think that there is no simple or univocal “correct international solution”. Per UAX #29: Unicode Text Segmentation, finding a “word boundary” (the intention of \b) is... challenging.
The Python documentation you linked says that \w matches “Unicode word characters”, without a precise definition of what it means. For exemple, does it include the middle-dot (·), which is definitely part of words in Catalan (where it is used to separate two Ls of different syllables within a word)?
@claudepache You're right. Correct is not a helpful word here. Thank you for pointing this out and for sharing the article. :)
Python: a Unicode word character is the underscore-character or an alphanumeric character, which, if I understand the source code correctly, is one of these:[1][2][3]
The middle-dot (·) seems to be in the category Other_Punctuation (Po), so if I understand the Python source code correctly, this character is not included.
P.S. I had to remove most links to the unicode.org tool "Unicode Utilities: UnicodeSet", since apparently new users can only post at most 5 links.
@ljharb: your comment about breaking change was eye-opening for me. Now I more clearly understand the constraints with evolving ES.
To clarify: are you saying the meaning of the u-flag would have had to be changed before it got released, or are you commenting on the "unicode is default in Python"-comment.
I'm saying that Python was able to make the change it did because breaking changes were deemed acceptable (whether it was worth the decade-plus of ecosystem churn is a discussion that's off-topic for this discourse).
JS has no such option, so any defaults now will forever remain the default.
My main curiosity was less around breaking the default, but more around if it ever was discussed to have the u-flag imply that \w, \b, \d use their corresponding Unicode character codes? (Or some other new flag which hasn't been introduced yet.)
I understand that the backwards-compatibility criteria won't allow current defaults to change.
I suppose it would be possible by doing a search for all boundaries in the string beforehand if the regex contains any \b. I understand that it’s somehow heavy though, but web engines are expected to do this quite a lot already.