Summary
This proposal aims to introduce an efficient way to match a RegExp
against a subset of a string without the overhead of creating a substring copy with String.prototype.slice
.
The Problem
For some string processing use cases, such as parsing source code, it is useful or necessary to split a string into tokens of different types (for example numbers, identifiers, operators, etc.). Most naturally this is done by repeatedly matching the beginning of the string against a regular expression for each possible token type and, once a match is found, to produce that token and move forward in the string:
const numberPattern = /^(-?[0-9]+)/;
const identifierPattern = /^(\p{L}[\p{L}\p{N}]*)/u;
// …
function* tokenize(string) {
let rest = string;
while (string) {
let match;
if ((match = numberPattern.exec(rest))) {
yield { type: 'number', text: match[1] };
} else if ((match = identifierPattern.exec(rest))) {
yield { type: 'identifier', text: match[1] };
} // …
rest = rest.slice(match[0].length);
}
}
The issue with this approach is that each call to rest.slice(…)
potentially allocates a copy of most of the input string. This means that if the input string is large, a tokenization algorithm like the one above may be very inefficient. (While some ECMAScript implementations like V8 do not actually copy the original string’s contents on slice
, this is not guaranteed by the ECMAScript specification.)
To work around this problem, developers have two choices, both unsatisfactory:
- Remove the start-of-string anchor (
^
) from each regular expression and make sure to only process matches whoseindex
equals the position right after the previous match. The downside is that in the case of a failed match,exec
will traverse much more of the string than necessary, degrading performance. - Avoid regular expressions and match manually on the character level. This is not only much more verbose, but also often requires duplicating a lot of functionality supported by the regular expression engine, but is not otherwise exposed by the ECMAScript standard library, such as matching codepoints against Unicode general categories.
Proposed Solution
The idea of this proposal is to address the problem above by extending RegExp
and String
with support for substring matching. The following methods would be affected:
RegExp.prototype.exec
RegExp.prototype.test
String.prototype.match
Each of these methods would accept two additional optional arguments start
and end
. If passed, these define the start (inclusive) and end (exclusive) index of the substring against which the regular expression should be matched. Matching happens as if this substring were the full input string; for instance, the ^
and $
anchors match the start and end of the substring.
const str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
const match = /^[A-Z]+$/.exec(str, 9, 12);
match // => "JKL"
start
and end
are interpreted the same as if they were passed to String.prototype.slice
. This means that negative indices are counted from the end of the string, a start
without end
extends the substring to the end of the input, and if end
is equal to or before start
, then the matching is performed against an empty substring.
Handling of Match Indices
When performing a substring match, the indices stored in the returned match (index
/ indices
) are relative to the start of the original input string, not the substring.
const str = "123 456";
const match = /[0-9]+$/.exec(str, 3);
match // => "456"
match.index // => 4
Similarly, if a substring match is performed with a RegExp
that has the g
or y
flag and a lastIndex
, then this index is also interpreted and updated relative to the full string’s start. If the lastIndex
is outside of the substring bounds, there is no match and lastIndex
is set to 0.
Example
The code below shows the same tokenizer as above, implemented with substring matching instead of String.prototype.slice
.
const numberPattern = /^(-?[0-9]+)/;
const identifierPattern = /^(\p{L}[\p{L}\p{N}]*)/u;
// …
function* tokenize(string) {
let position = 0;
while (position < string.length) {
let match;
if ((match = numberPattern.exec(rest, position))) {
yield { type: 'number', text: match[1] };
} else if ((match = identifierPattern.exec(rest, position))) {
yield { type: 'identifier', text: match[1] };
} // …
position += match[0].length;
}
}