Streaming JSON

JSON.parse and JSON.stringify are eager (and blazingly fast), but doesn't scale beyond a few megabytes on servers and a few hundred megabytes on clients. It's also not especially useful for resource-constrained IoT in general for similar reasons.

I would like to see a streaming variant similar in spirit to Jackson for JVM apps. Here's a rough sketch as a starting point:

  • enumerator = JSON.tokenize(iterable) - Tokenize an iterable/async iterable of strings and/or array buffers into a token enumerator*
    • Newline-delimited JSON is accepted by default, allowing it to restart and be usable in such an environment. If you only want a single value, just read that one value and immediately call enumerator.close(), and it'll just work.
    • No reviver is supported as it's...kinda useless in this context. (That data path is already exposed, so it doesn't matter.)
    • enumerator.nextToken(typeMask = JSON.ALLOW_ALL, maxScale = BigInt(Number.MAX_SAFE_VALUE)) gets the next token matching a type bit mask*, returning JSON.INVALID on mismatch
      • JSON.ALLOW_NULL - Accept null
      • JSON.ALLOW_TRUE - Accept true
      • JSON.ALLOW_FALSE - Accept false
      • JSON.ALLOW_STRING - Allow strings
      • JSON.ALLOW_NUMBER - Allow number
      • JSON.ALLOW_INTEGER - Allow integer and distinguish integers from floats by returning integers as bigints
      • JSON.ALLOW_OBJECT_START - Allow that relevant token
      • JSON.ALLOW_OBJECT_END - Allow that relevant token
      • JSON.ALLOW_OBJECT_SEEK - Allow object end and ignore entries leading up to it
      • JSON.ALLOW_ARRAY_START - Allow that relevant token
      • JSON.ALLOW_ARRAY_END - Allow that relevant token
      • JSON.ALLOW_ARRAY_SEEK - Allow array end and ignore entries leading up to it
      • JSON.ALLOW_ALL - Sugar for all the above OR'd together except ALLOW_INTEGER
      • A bit mask is used to reduce GC overhead, as this is a very perf-sensitive operation.
      • Max scale (applies to bigints and strings) can be passed to optimize for things like dates, Ethereum hashes, and compiled schema enum variants, to reject obviously invalid data early.
    • enumerator.ready() returns a promise that resolves once it either has data to parse or completes
  • {writer, output: iterable} = JSON.generate(opts?) - Generate an async iterator of strings or array buffers (by option)
    • This generates newline-delimited JSON. writer.close() can be used to terminate such a stream, and a single value can be written followed by writer.close() to ignore subsequent values.
    • writer.ready() returns a promise that resolves once it is ready to output more
    • writer.closed returns true if either a full value has been written or if iterable.return() has been called.
    • writer.write(token, treatNumberAsNonInteger = false) - Write a token
      • treatNumberAsNonInteger, if true, ensures that all numbers are suffixed with a .0 if they're integer numbers.
      • Returns false if the iterable isn't ready for more.
  • Async iterators here can have their next method invoked with an optional chunk size in bytes, to help resolve backpressure.
  • Replacers and revivers are not supported as I don't want to require state beyond the bare minimum necessary to sustain this - it's supposed to be lightweight.**
  • Tokens are immediate values (if neither objects nor arrays), or the following symbols:
    • JSON.INVALID - can only be returned from enumerator.nextToken and is not valid as an argument for writer.write or enumerator.maybeConsume
  • Note: objects are simply alternating string + value pairs, to simplify the API. Users are expected to handle these appropriately, even though there's defined behavior if they don't.

* This explicitly eschews the existing reviver/replacer idioms in favor of manual handling of values to avoid the overhead, and it also uses bit masks for the type specification to avoid garbage collection overhead. This is because the code lies on a critical performance path, and necessarily needs to be decently fast to compete with JSON.stringify at all. (It's a fairly data-driven API.)

** The stack could be implemented as a bit stack, where 0 = array and 1 = object value - an engine could optimize for the common case by simply using a 31-bit integer initially and only upgrading to a heap bit array later. Everything else could be tracked via simple static states, resulting in near zero memory overhead aside from buffered values. Replacers require tracking cycles, and I want to avoid the overhead of that in this API.

*** This obviously pairs well with the yield.sent proposal, but neither depends on the other.

I'm working on userspace tooling for building streaming parsers at the moment using my streaming regex engine. It wouldn't have anything like the level of speed that what you're proposing would, but maybe it's of use to someone in a resource-constrained environment anyway: A human-friendly json parser with parserate · GitHub

Note that this approach isn't really actually viable yet because 1) I haven't written the library and 2) it's 10x slower than it should be for reasons (tons of awaits) that I'm still trying to get to the bottom of.

I have always found this problem (processing large and asynchronously loading JSON boluses) and this solution (a streaming parser API) to be interesting.

However, I do not currently have the mental bandwidth to champion this in the foreseeable future. My apologies.

As an aside, I note that we would have to decide whether to go with an async-generator-based API or an HTML5 ReadableStream-based API. This would have ramifications for in which specification would standard JSON streaming live. I do think that it would be appropriate to live in TC39’s Ecma262, rather than WHATWG’s HTML5, given that JSON is such a widespread need independently of host environment. But that would imply that the API would be based on async iterators, which would lack ReadableStream’s cancellation etc.

I've been excessively busy myself, hence why I've not followed up on this.