Streaming JSON

JSON.parse and JSON.stringify are eager (and blazingly fast), but doesn't scale beyond a few megabytes on servers and a few hundred megabytes on clients. It's also not especially useful for resource-constrained IoT in general for similar reasons.

I would like to see a streaming variant similar in spirit to Jackson for JVM apps. Here's a rough sketch as a starting point:

  • enumerator = JSON.tokenize(iterable) - Tokenize an iterable/async iterable of strings and/or array buffers into a token enumerator*
    • Newline-delimited JSON is accepted by default, allowing it to restart and be usable in such an environment. If you only want a single value, just read that one value and immediately call enumerator.close(), and it'll just work.
    • No reviver is supported as it's...kinda useless in this context. (That data path is already exposed, so it doesn't matter.)
    • enumerator.nextToken(typeMask = JSON.ALLOW_ALL, maxScale = BigInt(Number.MAX_SAFE_VALUE)) gets the next token matching a type bit mask*, returning JSON.INVALID on mismatch
      • JSON.ALLOW_NULL - Accept null
      • JSON.ALLOW_TRUE - Accept true
      • JSON.ALLOW_FALSE - Accept false
      • JSON.ALLOW_BOOLEAN - Sugar for ALLOW_TRUE | ALLOW_FALSE
      • JSON.ALLOW_STRING - Allow strings
      • JSON.ALLOW_NUMBER - Allow number
      • JSON.ALLOW_INTEGER - Allow integer and distinguish integers from floats by returning integers as bigints
      • JSON.ALLOW_OBJECT_START - Allow that relevant token
      • JSON.ALLOW_OBJECT_END - Allow that relevant token
      • JSON.ALLOW_OBJECT_SEEK - Allow object end and ignore entries leading up to it
      • JSON.ALLOW_ARRAY_START - Allow that relevant token
      • JSON.ALLOW_ARRAY_END - Allow that relevant token
      • JSON.ALLOW_ARRAY_SEEK - Allow array end and ignore entries leading up to it
      • JSON.ALLOW_ALL - Sugar for all the above OR'd together except ALLOW_INTEGER
      • A bit mask is used to reduce GC overhead, as this is a very perf-sensitive operation.
      • Max scale (applies to bigints and strings) can be passed to optimize for things like dates, Ethereum hashes, and compiled schema enum variants, to reject obviously invalid data early.
    • enumerator.ready() returns a promise that resolves once it either has data to parse or completes
  • {writer, output: iterable} = JSON.generate(opts?) - Generate an async iterator of strings or array buffers (by option)
    • This generates newline-delimited JSON. writer.close() can be used to terminate such a stream, and a single value can be written followed by writer.close() to ignore subsequent values.
    • writer.ready() returns a promise that resolves once it is ready to output more
    • writer.closed returns true if either a full value has been written or if iterable.return() has been called.
    • writer.write(token, treatNumberAsNonInteger = false) - Write a token
      • treatNumberAsNonInteger, if true, ensures that all numbers are suffixed with a .0 if they're integer numbers.
      • Returns false if the iterable isn't ready for more.
  • Async iterators here can have their next method invoked with an optional chunk size in bytes, to help resolve backpressure.
  • Replacers and revivers are not supported as I don't want to require state beyond the bare minimum necessary to sustain this - it's supposed to be lightweight.**
  • Tokens are immediate values (if neither objects nor arrays), or the following symbols:
    • JSON.INVALID - can only be returned from enumerator.nextToken and is not valid as an argument for writer.write or enumerator.maybeConsume
    • JSON.CLOSED
    • JSON.UNAVAILABLE
    • JSON.OBJECT_START
    • JSON.OBJECT_END
    • JSON.ARRAY_START
    • JSON.ARRAY_END
  • Note: objects are simply alternating string + value pairs, to simplify the API. Users are expected to handle these appropriately, even though there's defined behavior if they don't.

* This explicitly eschews the existing reviver/replacer idioms in favor of manual handling of values to avoid the overhead, and it also uses bit masks for the type specification to avoid garbage collection overhead. This is because the code lies on a critical performance path, and necessarily needs to be decently fast to compete with JSON.stringify at all. (It's a fairly data-driven API.)

** The stack could be implemented as a bit stack, where 0 = array and 1 = object value - an engine could optimize for the common case by simply using a 31-bit integer initially and only upgrading to a heap bit array later. Everything else could be tracked via simple static states, resulting in near zero memory overhead aside from buffered values. Replacers require tracking cycles, and I want to avoid the overhead of that in this API.

*** This obviously pairs well with the yield.sent proposal, but neither depends on the other.