Support sequence properties in Unicode property escapes

yulia · April 17, 2020, 8:11am

Proposal: https://github.com/tc39/proposal-regexp-unicode-sequence-properties

Scope: Identify if \p or \m works better [edited - \q is no longer an option]

Issue: https://github.com/tc39/proposal-regexp-unicode-sequence-properties/issues/10

Summary: @msaboff suggested not overloading the meaning of \p on the grounds that it behaves differently than existing \p for non-sequence properties. This proposal breaks the invariant that \p always expands to a character class. He suggested \q for se q uence. Unfortunately, \Q is a modifier in Perl regular expressions so using \q might be confusing.

Updated summary below from Mathias

mathiasbynens · April 20, 2020, 5:27am

Since \q{…} is not an option, the latest idea is to use \m{…} instead. So now we have to decide between \p{…} vs. \m{…}.

At TC39, a decision was made to work with the Unicode Technical Committee (UTC) to resolve this upstream, since other languages are interested in this new functionality as well. As a result of this work, a proposed update to UTS18 says:

Properties of Strings (note: the TC39 proposal calls these sequence properties) are properties that can apply to, or match, sequences of two or more characters (in addition to single characters). This is in contrast to the more common case of properties of characters, which are functions of individual code points only. Those properties marked with an asterisk in the Full Properties table are properties of strings. See, for example, Basic_Emoji.

The preferred notation for properties of strings is \p{Property_Name}, the same as for the traditional properties of characters. For regular expressions, properties of strings may appear both within and outside of character class expressions.

As described in Annex E, some character class expressions are invalid when they contain properties of strings. Detection of such invalid expressions should be happen early, when the regular expression is first compiled or processed.

Implementations that are constrained in that they do not support strings in character classes should use \m{Property_Name} as an alternate notation for properties of strings appearing outside of character class expressions. \m should also accept ordinary properties of characters; it can be limited in where it may appear, not in what properties it allows.

Implementations with full support for \p and properties of strings in character class expressions may also optionally support the \m syntax.

Implementations that initially adopt \m only for properties of strings, then later add support for strings in character classes, should also add support for \p as alternate syntax for properties of strings.

The problem

Sequence properties, unlike non-sequence properties, cannot be negated. That is, using them with \P{…} must throw an exception. Similarly, they can’t be used within a character class.

Up until now, JavaScript has only supported non-sequence properties, and so wherever you can use \p{Foo}, you can also \P{Foo} or [\p{Foo]. If we re-use the existing \p{…} syntax for sequence properties, that will no longer be the case.

There is disagreement whether this difference warrants a dedicated new syntax for sequence properties, instead of re-using the existing \p{…} syntax.

Unified syntax with `\p{…}`

I am strongly in favor of continuing to use \p{…} syntax, even for the new sequence properties. JavaScript already uses \p{…} for what Unicode calls binary properties and enumeration properties. Unicode also defines numeric properties (also using \p{…}) and now string properties (syntax up for discussion).

The current mental model for developers then remains unchanged:

\p{Foo} refers to the Unicode property Foo, and the way that property behaves depends on Unicode's definition of Foo.

Examples, using Emoji as a non-sequence property, and RGI_Emoji_ZWJ_Sequence as a sequence property:

// Unified syntax with \p{…}
\p{Emoji}						// works
\P{Emoji}						// works
[\p{Emoji}]						// works
[^\p{Emoji}]					// works



\p{RGI_Emoji_ZWJ_Sequence}		// works
\P{RGI_Emoji_ZWJ_Sequence}		// throws an exception


[\p{RGI_Emoji_ZWJ_Sequence}]	// throws an exception
[^\p{RGI_Emoji_ZWJ_Sequence}]	// throws an exception


\p{InVaLiD}						// throws an exception
\P{InVaLiD}						// throws an exception

Disunified syntax with `\p{…}` and `\m{…}`

The alternative is to introduce new \m{…} syntax for sequence properties alongside the existing \p{…} syntax. There would be no negated \M{…} syntax. This makes the distinction clear, but IMHO complicates the mental model for developers:

\p{Foo} refers to the Unicode non-sequence property Foo, while \m{Bar} refers to the Unicode property Bar which can be either a non-sequence or a sequence property. The way these properties behave depends on Unicode's definition of Foo and Bar.

Examples, using Emoji as a non-sequence property, and RGI_Emoji_ZWJ_Sequence as a sequence property:

// Disunified syntax with \p{…} and \m{…}
\p{Emoji}						// works
\P{Emoji}						// works
[\p{Emoji}]						// works
[^\p{Emoji}]					// works
\m{Emoji}						// works
[\m{Emoji}]						// throws an exception
[^\m{Emoji}]					// throws an exception
\p{RGI_Emoji_ZWJ_Sequence}		// throws an exception
\P{RGI_Emoji_ZWJ_Sequence}		// throws an exception
\m{RGI_Emoji_ZWJ_Sequence}		// works
\M{RGI_Emoji_ZWJ_Sequence}		// throws an exception
[\p{RGI_Emoji_ZWJ_Sequence}]	// throws an exception
[^\p{RGI_Emoji_ZWJ_Sequence}]	// throws an exception
[\m{RGI_Emoji_ZWJ_Sequence}]	// throws an exception
[^\M{RGI_Emoji_ZWJ_Sequence}]	// throws an exception
\p{InVaLiD}						// throws an exception
\P{InVaLiD}						// throws an exception
\m{InVaLiD}						// throws an exception
\M{InVaLiD}						// throws an exception

Note that with this apprach, \p{Emoji} and \m{Emoji} must both work: the \m{…} syntax is only limited in where it can appear (not within character classes), but it accepts a superset of the properties that \p{…} accepts.

With disunified syntax, the list of possible scenarios grows. IMHO there’s no added value, only increased complexity, because the developer now has to distinguish the two syntactic forms, even when the difference doesn’t matter (which is the common case). Also, for each property/value or property/value alias that JavaScript already supports, there would now be two different syntactic ways to refer to them (\p{ASCII} and \m{ASCII} instead of just \p{ASCII}).

What do others think?

We're hoping to learn what the developer community prefers. Which syntax seems most intuitive?

bakkot · April 20, 2020, 6:08am

Copying this bit of discussion over from the linked github thread:

I dispute that this is "the" current mental model for \p.

My own mental model is that \p{Foo} matches a character with the Foo property. I need to know Unicode's definition of Foo to know which characters it will match, but I be confident it will match exactly one.

The "unified" syntax violates this model, because now \p sometimes matches a character and sometimes matches a sequence of characters, and to know which I need to know whether or not Foo is a sequence property. Which means that as a reader of code I need to know whether or not Foo is a sequence property to have any hope of understanding how the regular expression containing it will behave, because I cannot reason about how much of the string \p might match without this knowledge.

mathiasbynens · April 20, 2020, 8:29am

The same applies to the disunified syntax: reading \m{Foo} doesn’t tell you whether it matches a single code point or a sequence of code points, since Foo could be a non-sequence property, or it could be a sequence property that matches some single-code-point sequences. All it tells you that it potentially matches a sequence of code points. So whatever logic you apply in this scenario, you could choose to apply to \p{…} in general to correct your model w.r.t. this new proposal. It seems easier to update your idea of \p{…} to “this matches a sequence of code points” than to introduce new syntax where that same principle would then apply.

msaboff · April 20, 2020, 4:47pm

Expecting \m{Foo} to imply how many code points it matches is not what it is designed to do. It is designed to match any of the sequences with the Foo property. The length of a match of a component of a RegExp is best served by capture groups.

Although not totally clear, Annex E does allow an implementation to use \m only for sequence properties. I advocate that we do that in JavaScript. In that case, it is clear that \m{...} matches Properties of Strings (aka Sequence Properties) and \p{...} / \P{...} are used for the more common Properties of Characters.

I'm with @bakkot on what is claimed as the current mental model. The RegExp grammar would be clear and make it easier for developer to understand how to use the feature. As I said all along, the syntax conveys usage. If we introduce \m{...} as the only and exclusive escape for sequence properties, we maintain the current mental model and extend it logically. The mental model for developers remains unchanged for \p{...} and \P{...} and the introduction of \m{...} doesn't confuse that mental model.

If however we overload \p{} for sequence properties, we confuse the current mental model of developers. Adding new syntax errors for the \p and \P escapes changes the current mental model. Restricting the use of \p{...} depending on its context (in a character class or not AND depending on the enclosed property type) along with the more restrictive constraints of \P{...}, depending on the type of property enclosed, requires developers to throw out their old mental model and form a new one.

JLHwung · April 20, 2020, 7:31pm

I agree with @msaboff to use \m only for sequence properties. But it is not considered as a full implementation by the draft UTS #18, which states that

\m should also accept ordinary properties of characters; it can be limited in where it may appear, not in what properties it allows.

While the unified approach does acknowledge the design that the property matcher returns a set of characters instead of one, this approach could cause cognitive loads to developers, especially when currently sequence properties are only addressed to the emoji, i.e. It is not straightforward to tell which of the following matches more emoji (sequences) than the other

\p{Emoji}
\p{RGI_Emoji}

While \p{RGI_Emoji} looks like a subset of \p{Emoji}, it is in fact quite opposite. If we use the \m-prefixed alternative \m{RGI_Emoji}, developers are alerted that RGI_Emoji is a sequence property and it should cover all the single characters matched by \p{Emoji}.

I admit that if we have to implement the full disunified syntax where \m supports a feature supset of \p, the trouble it causes outweighs the benefit of preserving the \p single matcher behaviour, as both @msaboff and @bakkot have pointed out. Although we can implement linters rules for proper usage of \p and \m but it should be resolved at the language design step.

yulia · April 24, 2020, 10:34am

What is the concrete question we want to answer here? I see two ( I might have misunderstood things, so this is not prescriptive, this is an attempt at describing):

If we introduce \m, is the use of \p clear? Or, will having two similar regex property escapes increase confusion?
if we don't introduce \m, will overloading \p as \P have negative effects on the mental model of \p?

That said, this might not be verifiable, as not many developers may be already familiar with this. Is it worth testing, is another question we should answer.

gibson042 · June 4, 2020, 1:54am

The primary concrete question that I would like answered is "Of the three syntax options¹, which one minimizes mistakes in comprehension and construction of regular expressions utilizing Unicode string properties?". I am most concerned about syntactically valid regular expression literals that do not mean what was intended/claimed/interpreted/etc., and less concerned about syntactically invalid regular expressions because the latter do a good job of indicating presence of a problem.

¹ syntax options:

\p{…} for any Unicode property. Attempted use of a string property inside a character class or with \P{…} negation throws an exception.
\p{…} for any single-character property and \m{…} for any string property. Attempted use of \p{…} or \P{…} with a string property throws an exception. Attempted use of \m{…} with a non-string property throws an exception. Attempted use of \m{…} inside a character class or \M{…} anywhere throws an exception.
\p{…} for any single-character property and \m{…} for any property. Attempted use of \p{…} or \P{…} with a string property throws an exception. Attempted use of \m{…} inside a character class or \M{…} anywhere throws an exception.

Topic		Replies	Views
RegExp set notation 🦋 Proposals	7	560	May 4, 2022
RegExp: Comments 💡 Ideas	30	1469	November 1, 2021
RE2 - Consider having it as alternative engine choice 💡 Ideas proposal	14	1002	July 10, 2022
Simpler pattern matching proposal 🦋 Proposals	5	199	March 24, 2024
Raw string literals that can contain any arbitrary text without the need for special escape sequences 💡 Ideas proposal	4	1360	July 29, 2023

Support sequence properties in Unicode property escapes

The problem

Unified syntax with \p{…}

Disunified syntax with \p{…} and \m{…}

What do others think?

Related topics

Unified syntax with `\p{…}`

Disunified syntax with `\p{…}` and `\m{…}`