Since \q{…}
is not an option, the latest idea is to use \m{…}
instead. So now we have to decide between \p{…}
vs. \m{…}
.
At TC39, a decision was made to work with the Unicode Technical Committee (UTC) to resolve this upstream, since other languages are interested in this new functionality as well. As a result of this work, a proposed update to UTS18 says:
Properties of Strings (note: the TC39 proposal calls these sequence properties) are properties that can apply to, or match, sequences of two or more characters (in addition to single characters). This is in contrast to the more common case of properties of characters, which are functions of individual code points only. Those properties marked with an asterisk in the Full Properties table are properties of strings. See, for example, Basic_Emoji.
The preferred notation for properties of strings is \p{Property_Name}
, the same as for the traditional properties of characters. For regular expressions, properties of strings may appear both within and outside of character class expressions.
As described in Annex E, some character class expressions are invalid when they contain properties of strings. Detection of such invalid expressions should be happen early, when the regular expression is first compiled or processed.
Implementations that are constrained in that they do not support strings in character classes should use \m{Property_Name}
as an alternate notation for properties of strings appearing outside of character class expressions. \m
should also accept ordinary properties of characters; it can be limited in where it may appear, not in what properties it allows.
Implementations with full support for \p
and properties of strings in character class expressions may also optionally support the \m
syntax.
Implementations that initially adopt \m
only for properties of strings, then later add support for strings in character classes, should also add support for \p
as alternate syntax for properties of strings.
The problem
Sequence properties, unlike non-sequence properties, cannot be negated. That is, using them with \P{…}
must throw an exception. Similarly, they can’t be used within a character class.
Up until now, JavaScript has only supported non-sequence properties, and so wherever you can use \p{Foo}
, you can also \P{Foo}
or [\p{Foo]
. If we re-use the existing \p{…}
syntax for sequence properties, that will no longer be the case.
There is disagreement whether this difference warrants a dedicated new syntax for sequence properties, instead of re-using the existing \p{…}
syntax.
Unified syntax with \p{…}
I am strongly in favor of continuing to use \p{…}
syntax, even for the new sequence properties. JavaScript already uses \p{…}
for what Unicode calls binary properties and enumeration properties. Unicode also defines numeric properties (also using \p{…}
) and now string properties (syntax up for discussion).
The current mental model for developers then remains unchanged:
\p{Foo}
refers to the Unicode property Foo
, and the way that property behaves depends on Unicode's definition of Foo
.
Examples, using Emoji
as a non-sequence property, and RGI_Emoji_ZWJ_Sequence
as a sequence property:
// Unified syntax with \p{…}
\p{Emoji} // works
\P{Emoji} // works
[\p{Emoji}] // works
[^\p{Emoji}] // works
\p{RGI_Emoji_ZWJ_Sequence} // works
\P{RGI_Emoji_ZWJ_Sequence} // throws an exception
[\p{RGI_Emoji_ZWJ_Sequence}] // throws an exception
[^\p{RGI_Emoji_ZWJ_Sequence}] // throws an exception
\p{InVaLiD} // throws an exception
\P{InVaLiD} // throws an exception
Disunified syntax with \p{…}
and \m{…}
The alternative is to introduce new \m{…}
syntax for sequence properties alongside the existing \p{…}
syntax. There would be no negated \M{…}
syntax. This makes the distinction clear, but IMHO complicates the mental model for developers:
\p{Foo}
refers to the Unicode non-sequence property Foo
, while \m{Bar}
refers to the Unicode property Bar
which can be either a non-sequence or a sequence property. The way these properties behave depends on Unicode's definition of Foo
and Bar
.
Examples, using Emoji
as a non-sequence property, and RGI_Emoji_ZWJ_Sequence
as a sequence property:
// Disunified syntax with \p{…} and \m{…}
\p{Emoji} // works
\P{Emoji} // works
[\p{Emoji}] // works
[^\p{Emoji}] // works
\m{Emoji} // works
[\m{Emoji}] // throws an exception
[^\m{Emoji}] // throws an exception
\p{RGI_Emoji_ZWJ_Sequence} // throws an exception
\P{RGI_Emoji_ZWJ_Sequence} // throws an exception
\m{RGI_Emoji_ZWJ_Sequence} // works
\M{RGI_Emoji_ZWJ_Sequence} // throws an exception
[\p{RGI_Emoji_ZWJ_Sequence}] // throws an exception
[^\p{RGI_Emoji_ZWJ_Sequence}] // throws an exception
[\m{RGI_Emoji_ZWJ_Sequence}] // throws an exception
[^\M{RGI_Emoji_ZWJ_Sequence}] // throws an exception
\p{InVaLiD} // throws an exception
\P{InVaLiD} // throws an exception
\m{InVaLiD} // throws an exception
\M{InVaLiD} // throws an exception
Note that with this apprach, \p{Emoji}
and \m{Emoji}
must both work: the \m{…}
syntax is only limited in where it can appear (not within character classes), but it accepts a superset of the properties that \p{…}
accepts.
With disunified syntax, the list of possible scenarios grows. IMHO there’s no added value, only increased complexity, because the developer now has to distinguish the two syntactic forms, even when the difference doesn’t matter (which is the common case). Also, for each property/value or property/value alias that JavaScript already supports, there would now be two different syntactic ways to refer to them (\p{ASCII}
and \m{ASCII}
instead of just \p{ASCII}
).
What do others think?
We're hoping to learn what the developer community prefers. Which syntax seems most intuitive?