Take 2: generator.prototype[Symbol.mixedIterator]

I'm still not really convinced that using iterators of bytes is the right approach. I would like to see the performance of a sync iterator of bytes compared to a manual loop and direct indexed access to a typed array.

Now where I believe we have a missing abstraction is in the ability to create typed array like views backed by multiple discrete array buffers. If we had that, then you could keep simple loops over the data, and only await to update your view when you reach a boundary.

Good thing I spent all this time building perf test cases then. A sync iterator of bytes is #3.

You're talking more or less about a two-loop example which is #2. I removed the encoding from that example to get buffers inside instead of strings and then did indexed access on them and the runtime dropped from 161ms to 147ms. You can see that test case here.

Your TypedArray-like views don't seem to solve any problem at all to me. I don't see that they solve this problem because as you admit, you still have to await and update your view when you reach a boundary.

There are really only two solutions to that problem:

  • You forbid lookahead. Only the current character in the input is safe to access because the next character might need to be awaited and that won't be done until the next iteration of your main loop.
  • You acknowledge that any access might be async and figure out how to write code that can deal with that possibility.

Thanks for test cases. In all those but the memory baseline you build your output character by character by concatenating strings. That causes unnecessary memory pressure by allocating intermediary strings.

I would have written this by saving the index and generating a sub array once the begin and end of a cell have been found. This is where dealing with multiple array buffers comes in, because your cell may overlap multiple chunks.

I also ran into perf issues while parsing large amount of streamed data (we're talking GB of data, with sometimes lines being multiple MB), and had to write my own stream transform to do this as no library I found was simple and performant enough. It's a simple splitter transform, so should be applicable to your example.

In your CSV case you actually have nested splitters, where you can technically avoid double allocations by splitting the initial stream of chunks by both new line and comma, and stack the cell in the rows without first splitting by row.

I am honestly not sure how you can abstract things like this, but in any case, reverting to character by character processing will always be less efficient as you've shown. One idea might be to write a splitter that yields both elements and break value in case multiple break values are configured.

Also once you've generated your cells / elements, I don't believe only dealing with an async stream is an issue. It's mostly at the byte granularity when this matters.

Ah ok, that explains a lot about why we're not on the same page. You're talking about squeezing every last drop of perf out of something, and I'm definitely not. I'm talking about allowing people to build powerful abstractions that incur some necessary perf costs without incurring unnecessary ones.

In the memory baseline I also build my output character by character in fact (using str), and it adds 15ms milliseconds to the cost (35ms -> 50ms). If you allow your output to be string slices of the input you'll never be able to release the input from memory!! (As long as any of the output is held in memory that is.)

The reason I think I don't need to be ultra perf conscious is that for most people the convenience of writing code that deals in streams will dictate whether they write code that deals in streams at all. Most people aren't going to write the kind of code you wrote, and if they do write it they'll write it as a one-off (as you did), and if they do that they'll make mistakes and assumptions, as I believe you did also. For example here you assume that the next chunk will always be larger than breakLength. While this is probably actually always true the way you use this code, it isn't necessarily true!

Now you could fix that bug in your logic by switching over to a while loop to ensure that you advance as many chunks as necessary to avoid this edge case but a) you're presumably a top engineer and you missed this and b) you probably won't change the code because it's not library code and it works well enough for what you need it to do.

Finally my code can done one thing that yours just absolutely cannot: I can safely match a regular expression anchored at any point in the stream. Your code would never be able to do this, because it needs to know some fixed number of characters breakLength which it might need to read ahead. A regex can't tell you how many characters it might need to see before matching or failing. This might not be very important for parsing CSVs, but there are other languages where the distinction is much more important!

So to sum up, I'm not talking about code for the most absolutely perf-critical work. I'm talking about making it easy to write and read code that has basically the right perf characteristics and incurs some moderate costs for the benefits provided. But currently the costs are not moderate, they are extreme.

Nope I don't. This code is explicitly there in case the break value could span multiple chunks (aka is more than one byte). It then preemptively concatenates chunks so that finding the break value is a matter of lookup in the single pending chunk. I could have written it to avoid pre-concatenation, but that was just not worth the effort.

That's not my reading of it. Here you seem to be building it cell by cell.

Edit: sorry I had missed the str helper and didn't know what it did. I'm still very confused by its purpose.

Say what? There is no language reason this would be the case for strings. If dealing with subarray views of an array buffer then yes. That's where you need to start considering the tradeoff of allocating a new buffer to make a copy of the smaller content you're interested in.

The built-in regex sure doesn't work on an iterable. You'd need a custom regex engine to handle this, at which point, you can make it work over any kind of input structure you want.

I think we both agree that the low level bits are hard to get right, and are the ones that require performance. I remain doubtful that an iterator approach for the low-level will ever have the satisfying performance. Once you start producing higher level elements, i believe an async iterator is entirely acceptable.

Sorry maybe I wasn't clear. You assume that if breakLength cannot match because the break pattern would cross a chunk boundary, you assume that loading one additional chunk will eliminate that condition. It's usually a very safe bet in your usecase, but it's not necessarily true, and especially when you start to compose functions it's not necessarily true.

The str helper is actually used wrong in that code! Oops. I was turning each row into a str, instead of each value. If I stringify each value the new cost of the memory algorithm is 195ms. Wow! That's definitely why I put it in there -- to see where the costs really come from.

My understanding is that string.slice() basically works exactly like that.

Yes, I have built that!!! Indeed it works over any kind of input structure I want.

Nope, still not it. It will do this over and over for each new chunk until the break value is found in the concatenated chunks. Basically like an async accumulator.

Ah ok.

There is no language reason it has to. Unlike typed arrays from which can retrieve the underlying array buffer, you can never go back to the original string. It's possible a smart implementation will try to avoid copies, but the GC should be able to collect when needed.

I see. I've updated my memory benchmark to remove str completely.

Our conversation also got me thinking about exactly how my proposed parser tooling would work when used on async streams. Unfortunately it seems to require await?: An async streaming parser with n-lookahead ยท GitHub (and here's the sync version again for comparison and syntax highlighting)

We agree that the costs are acceptable at the higher level, but I think my benchmarks show precisely that an iterator approach for the low-level can have satisfying performance. My optimization is so powerful that I made a transpiled loop 5x faster than a native one! A native sync loop is 10x faster than an async loop over sync inputs.

Sorry I'm still not following what the need for await? is.

Didn't you say your regex engine could work with an async stream of chunks? If it's only the internals of such specialized transforms that need to deal with that complexity, what is the motivation of await?

To put it another way, the bar for syntax is very high, and a niche use case does not usually pass the bar.

The problem is that you have to maintain a completely separate set of infrastructure for dealing with character sequences composed of chunks. Any tools for working with sequences, you need a completely separate copy of them written in a difficult and different style just for working efficiently with streams of chunks.

If there's anything I've learned it's that you can gain a lot of sanity by using abstraction -- that is to say by factoring things in such a way that certain concerns become fully decoupled from others. So that's what I'm planning to sell to the community -- I guess we'll both see just how popular or "niche" such an approach might really be. I'm betting that telling people they can work in this way is going to be an easy home run, and that it will lead to a lot of powerful new code being written that was just to difficult or annoying to write before.

I guess I'm going to go ahead and build all this. You've convinced me that the language isn't ready to support it, but not that any other way would be better. My best counterargument to all this discussion about how support for mixing sync and async is so dangerous and best discouraged is: you opened the box. You don't yet believe that there's any way to use that ability in a way that's safe and effective, but that doesn't mean I won't be able to convince other people. Always build from first principles! Anyway, thanks for your time.

The 4x slow down in the perf tests, maybe that shows that there could be room for engines to optimise the existing syntax? Iterators used to be many times slower than c-styles loops but are closer these days. Maybe similar optimisations are yet to be implemented for asyncIterators.

It is possible, but so far my attempts to reach out to engine maintainers and find out if this is something they're considering (or if there are technical obstacles) have fallen flat. My impression is that it doesn't much matter to me right now, because right now my choice is between building an ecosystem for working with character iterators or one for working with chunk iterators. I'm choosing character iterators, and making this choice is the only way to create the popular demand that could lead to either engine optimizations or a language change.

1 Like

The difficulty that's certain to arise for engine optimization is contention for the microtask queue which the spec dictates must cause character-by-character bouncing when each microtask generates another microtask. @bergus shared this code which causes that perf-destroying problem in order to maintain compliance with the spec:

const range = Array.from({length: 100}, (_, i) => i);
(async () => { for await (const i of range) console.log('a', i); })();
(async () => { for await (const i of range) console.log('b', i); })();

That gives me serious pause over the idea that the engine can completely remove this concern, however it's worth noting that the example itself is also extremely contrived. In the real world if you're using a for-await loop it really aught to be because at least some of your data is only available asynchronously. Any data that is really asynchronous will resolve in the task queue (not the microtask queue) which I take to mean that the microtask queue should be empty to start with, and that other tasks will be delayed until all microtask processing is fnished. I am thus cautiously optimistic that it would really be possible for engines to optimize this pattern (by proceeding synchronously when the microtask queue is empty) in a majority of cases.

I don't see the need for new syntax here. With the new AsyncIterator.prototype methods, yielding Iterators from AsyncIterators becomes way easier:

function* processBits(chunk) {
  for(const char of chunk) {
    yield char;
  }
}

const result = await readFile().map(processBits).reduce(join);

Also performance wise combining a tight loop (sync iterator) over chunks of data (async iterator) is usually the best choice in a system with concurrent users.

@Jonas_Wilms I agree that a combination of sync and async is absolutely essential for performance! Since my use case is writing streaming parsers though I don't think the way you're proposing to compose operations is useful for me. For example in recognizing multi-character tokens like => will require addressing the possibility that the chunk boundary falls between = and >. Of course it's always possible to pass some state around, but likely the result is code that's harder to reason about.