Proposal: RegExp Substring Matching

Summary

This proposal aims to introduce an efficient way to match a RegExp against a subset of a string without the overhead of creating a substring copy with String.prototype.slice.

The Problem

For some string processing use cases, such as parsing source code, it is useful or necessary to split a string into tokens of different types (for example numbers, identifiers, operators, etc.). Most naturally this is done by repeatedly matching the beginning of the string against a regular expression for each possible token type and, once a match is found, to produce that token and move forward in the string:

const numberPattern = /^(-?[0-9]+)/;
const identifierPattern = /^(\p{L}[\p{L}\p{N}]*)/u;
// …

function* tokenize(string) {
  let rest = string;

  while (string) {
    let match;

    if ((match = numberPattern.exec(rest))) {
       yield { type: 'number', text: match[1] };
    } else if ((match = identifierPattern.exec(rest))) {
   	  yield { type: 'identifier', text: match[1] };
    } // …

    rest = rest.slice(match[0].length);
  }
}

The issue with this approach is that each call to rest.slice(…) potentially allocates a copy of most of the input string. This means that if the input string is large, a tokenization algorithm like the one above may be very inefficient. (While some ECMAScript implementations like V8 do not actually copy the original string’s contents on slice, this is not guaranteed by the ECMAScript specification.)

To work around this problem, developers have two choices, both unsatisfactory:

  1. Remove the start-of-string anchor (^) from each regular expression and make sure to only process matches whose index equals the position right after the previous match. The downside is that in the case of a failed match, exec will traverse much more of the string than necessary, degrading performance.
  2. Avoid regular expressions and match manually on the character level. This is not only much more verbose, but also often requires duplicating a lot of functionality supported by the regular expression engine, but is not otherwise exposed by the ECMAScript standard library, such as matching codepoints against Unicode general categories.

Proposed Solution

The idea of this proposal is to address the problem above by extending RegExp and String with support for substring matching. The following methods would be affected:

  • RegExp.prototype.exec
  • RegExp.prototype.test
  • String.prototype.match

Each of these methods would accept two additional optional arguments start and end. If passed, these define the start (inclusive) and end (exclusive) index of the substring against which the regular expression should be matched. Matching happens as if this substring were the full input string; for instance, the ^ and $ anchors match the start and end of the substring.

const str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
const match = /^[A-Z]+$/.exec(str, 9, 12);
match // => "JKL"

start and end are interpreted the same as if they were passed to String.prototype.slice. This means that negative indices are counted from the end of the string, a start without end extends the substring to the end of the input, and if end is equal to or before start, then the matching is performed against an empty substring.

Handling of Match Indices

When performing a substring match, the indices stored in the returned match (index / indices) are relative to the start of the original input string, not the substring.

const str = "123 456";
const match = /[0-9]+$/.exec(str, 3);
match // => "456"
match.index // => 4

Similarly, if a substring match is performed with a RegExp that has the g or y flag and a lastIndex, then this index is also interpreted and updated relative to the full string’s start. If the lastIndex is outside of the substring bounds, there is no match and lastIndex is set to 0.

Example

The code below shows the same tokenizer as above, implemented with substring matching instead of String.prototype.slice.

const numberPattern = /^(-?[0-9]+)/;
const identifierPattern = /^(\p{L}[\p{L}\p{N}]*)/u;
// …

function* tokenize(string) {
  let position = 0;

  while (position < string.length) {
    let match;

    if ((match = numberPattern.exec(rest, position))) {
       yield { type: 'number', text: match[1] };
    } else if ((match = identifierPattern.exec(rest, position))) {
   	  yield { type: 'identifier', text: match[1] };
    } // …

    position += match[0].length;
  }
}

Isn't this what the sticky flag /y and .lastIndex already solve?

// sticky, not anchored
const numberPattern = /(-?[0-9]+)/y;
const identifierPattern = /(\p{L}[\p{L}\p{N}]*)/yu;
// …

function* tokenize(string) {
  let position = 0;

  while (position < string.length) {
    let match;
    numberPattern.lastIndex = position;
    identifierPattern.lastIndex = position;
    if ((match = numberPattern.exec(rest))) {
       yield { type: 'number', text: match[1] };
    } else if ((match = identifierPattern.exec(rest))) {
   	  yield { type: 'identifier', text: match[1] };
    } // …

    position += match[0].length;
  }
}

(but I agree that passing the offset as an argument to .exec() is more elegant than assigning the .lastIndex)

1 Like

Can you explain your evidence that "the overhead of creating a substring copy" is a significant obstacle for any use cases? What benchmarks have you run, and on which engines?

Isn't this what the sticky flag /y and .lastIndex already solve?

The sticky flag does not fully solve the problem because ^ is always anchored to the beginning of the string, not to the current lastIndex. (Interestingly, Firefox did once anchor ^ to lastIndex, but this was a bug.)

If one leaves away ^ as in your example, then there is still the problem mentioned for workaround #2 in the “Problem” section: if the regular expression does not match directly at lastIndex, then RegExp.prototype.exec will attempt to match the expression against all later start indexes of the string, which slows down processing and requires an additional match.index === lastIndex check.

No. That's precisely what the sticky flag prevents. It will attempt to only match at the single position indicated by .lastindex.

The primary use case I have in mind is regex-based parsing / lexical analysis, which is relevant for implementing tooling for JS or compile-to-JS languages, for instance.

I admit that I have not run any benchmarks yet, and will try to do so across some JS engines. However, I‘m aware that at least V8 and SpiderMonkey avoid copying by reusing the original string’s buffer for slices, so the problem motivating this proposal is likely small or non-existent there. (I assume that JavaScriptCore works the same, but I couldn‘t find any info on that and didn‘t dig into the codebase yet.)

Why do I still think the proposal is a good idea?

  1. While the „big three“ may implement slice without copying, this is not guaranteed by the ECMAScript spec. There are non-toy engines that always create copies (e.g. QuickJS, Hermes), so this is not only a theoretical issue.
  2. The non-copying behavior is somewhat controversial as it can prevent big strings from being collected even if only a small slice is still referenced. There is no reliable way to avoid this issue - forcing the slice to be an actual copy is not possible. Therefore, it is possible that implementers of the major engines choose to go back to copying slice some day (like Oracle did for in OpenJDK removed character buffer sharing from OpenJDK’s String implementation), especially if the use cases requiring efficient slicing are eventually covered by other mechanisms like the one proposed here.

Sorry, yes, you are of course right. I have misunderstood how lastIndex works with the sticky flag.

So I stand corrected: there is a solution available today for the problem I explained. This weakens the case for the proposal, of course. :sweat_smile: The question now is whether whether there are remaining uses cases not covered by lastIndex, and if not, whether the readability benefit of direct substring matching still makes this worth pursuing.

Nit: this only anchors the start. There's no way to similarly anchor the end.

In high-performance runtimes, I feel the perf argument is pointless since string slicing is extremely cheap (bump allocation out of the nursery, zero copying needed and only maybe some tree descent that the regexp engine would necessarily have to do anyways).

In embedded runtimes targeting small MCUs, I could see the use, since some of them (notably JerryScript) actually copy the sliced bytes into a new byte buffer. But on these, you rarely have a need to parse a lot of strings like that.

All told, I'm not convinced there's a serious need here.