RegExp.prototype.exec alternative that doesn't allocate a new object on every execution

A common pattern when parsing is to use a global RegExp as lexer, and loop while the value returned by lexer.exec(source) isn't null.

This is efficient as far as parsing goes, but it allocates an array on every iteration which isn't optimal.

Proposed alternatives (assume RegExp.success is a symbol):

const result = []
while (lexer.exec(input, result) &&  result[RegExp.success]) {
  // parse here
}
const result = []
const into = {result}
while (lexer.exec(input, into) &&  result[RegExp.success]) {
  // parse here
}

In both cases, exec would blank the previous results before matching.

Another possibility:

const result = lexer.result
while (lexer.execInResult(input) && result[RegExp.success]) {
  // do the parsing here
}

...where result would be an object with non-configurable getters into the actual results.

The GC cost here is fairly minimal - 90% of the time spent is just string processing, and the object is likely pooled in and reused from the nursery in subsequent iterations anyways since it's never retained on subsequent calls. It's also not nearly large enough for most GCs to promote it immediately.

Personally, .matchAll(...) is more useful than this, since it just becomes a simple for loop. It may seem heavier, but it's really not. And in my experience, when GC allocation actually matters, you've already stopped using regexps and so almost 100% of allocations are already explicit.

.matchAll() is far less flexible. You can't switch lexers mid-stream (e.g. when matching RegExps, one for character sets vs one for the rest of the source).

Also, you can't beat native code that relies on SIMD operations with JS.

It's not just GC, re-using the same object improves cache locality.

What SIMD operations with JS? SIMD was withdrawn, and is only intended to be available via WASM.

Let's rephrase that, it was indeed ambiguous:

"Also, with JS, you can't beat native code that relies on SIMD operations."

1 Like

The speedup you'd get from vectorization is minimal compared to not allocating hundreds to thousands of small string slices you don't need. :wink:

If you use character codes, you get that naturally, but the moment you do a string traversal, you've just thrown all that effort away because the cache is now optimizing for that. Oh, and string allocation disrupts that further. So pretty much all your hopes and dreams for that are destroyed before it returns to JS and it's then all a wash and you're back to square one.

You have a similar issue with String.fromCharCode, but it's a lot easier to recover from that when all you have are raw integers and occasional booleans with almost no garbage.

The small string slices are not necessarily useless.

Also, by having a dedicated result object with getters, you could defer the creation of the slices (indeed, often result[0] isn't useful when you're after the captures in disjunction arms). Most of the captures will be undefined. Assuming you don't need some of the lexemes, rather than capturing them, you can create empty captures at the end of the disjunction arm.

It's also a code size/runtime trade off, and a readbility tradeoff. I'm trying to get the best possible perf out of the RegExp engine while keeping readable parsers.

  1. The memory has to exist to cache the result in either way.
  2. The array would in practice likely be generated as a complete step anyways, and the engine will allocate an extra backing array for it even if it's not exposed directly - that way it can index the array (which is faster to access) to access and cache individual groups.
  3. What kills performance is the memory indirection more so than just the allocations. GC allocators are efficient, and they can reallocate stuff in the nursery with as few as a couple instructions in some cases. But try reading any of those generated strings and you will see performance falter quick. Edit: To summarize: the allocations are death by a thousand papercuts. The indirections are death by a hundred thousand deep knife wounds.

Also, you only normally get about a 1.5-2x speedup using SSE4 on x86 regexp implementations - not an 8-16x (16- or 8-bit lanes) speedup. Sounds like a lot, but you can't parallelize parsing very well in the general case - it's inherently a serial action and vectorization only helps so much.

You don't get what I mean.

... and the engine will allocate an extra backing array for it even if it's not exposed directly ...

It has to do it once, assuming this API:

const result = lexer.result
while (lexer.execInResult(input) && result[RegExp.success]) {}

or zero times, assuming this API

const result = []
while (lexer.exec(input, result) &&  result[RegExp.success])

... since in that case the allocation happens once, in client code.

The current API returns fresh arrays for each step in the loop. Even if the array is recycled, unless the very same array is immediately recycled, both the matcher and the client code end up with a result at a different memory address from invocation to invocation. With my proposed API, the result is stable for the whole loop.

I haven't looked, but I'd be surprised if a function with the lexer.execInto(input, result) signature doesn't already exist in every engine out there wrapped by RegExp.prototype.exec() which provides the fresh array when called.

Edit: Rather than discussing this abstractly, I'm going to try and implement said API in the various engines and see if it does bring better perf. The proof will be in the pudding...

Specifically, I'll give it a stab in SpiderMonkey and JavaScriptCore; building v8 on a Mac requires
one to remove the CLI tools so that XCode takes precedence and I don't want to mess up my dev setup. Both SM and V8 use irregexp under the hood, so I expect similar gains, if any, from the approach I suggest.