Schema-based JSON parsing?

Currently, there's a few main issues with the current state of JSON in JS:

  1. It requires quite a bit of memory, often unnecessarily if not all members are used. Additionally, this can become a vulnerability when you don't limit the message size enough, but even absent that, engines don't normally optimize for structure.
  2. It's often too free-form. On clients, this can result in strange bugs when misinterpreting responses, and on servers, this can sometimes even result in security bugs. (I know the latter from experience, unfortunately.)
  3. We have fast validators and fast untyped parsers, but engines don't get the feedback needed to optimize for their shape in such a flow. This strips out several potential optimizations.
  4. The translation to strings itself is completely unnecessary for the common case on servers, and major performance wins could be gained by operating on array buffers and/or streams of them. Clients have WHATWG's request.json() where the browser does this internally, but that only addresses this one point.

Here's what I propose:

  1. Provide a mechanism to process JSON data as either array buffers of UTF-8 bytes or async iterables of them. This need not be synchronous.
  2. Bake the Core and Validation parts of https://json-schema.org/ into the above mechanism as an optional parameter, with the exception that remote resources are not fetched (for obvious reasons).

The idea is this:

  • By admitting a UTF-8 buffer, there is zero overhead in the engine's ability to process the data. They are literally operating on an optimized representation.
  • By admitting a stream, there's no need to join them in the middle - parsing can be done as bytes come in. This allows implementations to further reduce memory overhead if they choose to implement it this way.
  • By allowing engines to have insight into the schema, they can perform a number of optimizations:
    • They can pre-build type maps based on that schema. The resulting parsed objects could simply use those shared type maps internally to start, resulting in a lot less memory used for each object. They can also normalize integers to 31-bit integers where possible as well, including when the minimum and maximum are set such that no other integer is valid.
    • They can only parse what's desired, not all of JSON. For instance, when validating dates, they can reject any escape that doesn't start with \u00, and when validating integers, they can require every digit after the . to be 0.

This seems like it's solving a fairly rare problem. In my experience a very large majority of JSON encountered by typical JS programs is coming over the network, where network latency is going to absolutely dominate any performance differences you might get from having more optimized parsers.

So before setting out solutions, it's probably worth spending some time establishing these are problems worth solving.

This seems like it's solving a fairly rare problem

I'm not convinced with JSON, but I don't have hard data on that. For schemas, I do, and it suggests otherwise:

I haven't checked GitHub's repos, though. But this preliminary search is rather interesting. I did a cursory search and ajv does have a few large dependencies, ESLint, table, and Webpack's schema-utils, but I suspect it's not the full story here.

How stable/standardised is the JSON scheme standard, I’ve only used it in passing. I’ve not actually looked into its background.

That's a good point - I wouldn't quite call it mature enough to make a web standard yet. Not when there's significant breaking changes even in the most recent drafts: JSON Schema 2020-12 Release Notes | JSON Schema JSON Schema 2019-09 Release Notes | JSON Schema

1 Like

Number of npm downloads is not a particularly useful metric here. Just for comparison, har-validator has 20M/week downloads (and incidentally depends on ajv, so is presumably responsible for a significant fraction of its downloads). Should we add validation for http-archive files to the standard library as well? I wouldn't think so.

I was curious about the download numbers, so:

  • ajv has a large number of downloads because it is depended upon by a small number of popular projects, including eslint, stylelint (via table), webpack (via schema-utils), and request (via har-validator). Between them they are responsible for essentially all of its downloads.
  • json-schema has a large number of downloads because it is depended upon by jsprim, which is depended upon by http-signature, which is depended upon by request. This is responsible for essentially all of its downloads.
  • I couldn't figure out what's going on with is-my-json-valid. It only has a few hundred dependents, none of which get more than low 5 figures of downloads per week. Probably it's used by some internal tool at a big company?
  • jsonschema is depended upon by @aws-cdk/cloud-assembly-schema, which is depended upon by a bunch of aws-cdk packages including the core package; this is responsible for essentially all of jsonschema's downloads.

Basically, these numbers tell us only that the need to validate JSON schemas is common to popular tools which have JSON config files (or use HAR files). That is not a very compelling reason to add support for validating JSON schemas into the language.

And I note that validating JSON config files is something which is typically done at most a small handful of times per execution of a tool, and that config files tend to be human-sized, meaning that this use case is particularly unlikely to need very high performance, which is directly in contrast to your other ask in the OP.

2 Likes

Okay, I did some code searching. Between the above projects, I found 34.2k matches across 4.5k repos. This is about half to 2/3 of that of toString("hex"), which is noted as part of the basis for this proposal. Though if you exclude node_modules, the results drop to about 8.5k matches across 2.0k packages.

Still might be worth considering down the road - first, the JSON schema spec obviously needs to stabilize, then it might be worth looking into. Not that I anticipate it'll easily get past stage 1, though, not when Express is used closed to as many packages as Math.sin at well over 50k packages.

Personally, I don't think it's good idea. JSON schemas are evolving fast and standardizing them in specification basically forbids any backward-incompatible change.

1 Like

I feel like this proposal is combining a few different things:

  • streaming JSON parsing (i.e., parsing incoming data without having to buffer it all into a JS string, and without having to care about all the objects/properties in the JSON).
  • JSON schema validation for security/correctness
  • using expected schemas to speed up parsing

There are absolutely legitimate performance use cases for streaming JSON parsing (when there's simply too much data to buffer it all into memory) and there are also abundant legitimate use cases for JSON Schema validation. More compelling than the npm package download numbers, I think, is the fact that most mainstream API Gateways (e.g., Kong and AWS Api Gateway) include JSON Schema validation on incoming request payloads as a key security feature.

I think @bakkot's objection — that the parsing overhead today may not be a big problem (e.g., relative to the network) — really only applies to the third point about using expected schemas to speed up parsing. And I am honestly skeptical about that idea as well, for some of the reasons mentioned: the performance win may be negligible and, more importantly, JSON Schema may be too unstable to standardize.

If we're not standardizing JSON Schema, though, that still leaves room to potentially add a built-in streaming JSON parser to JS, and maybe an asynchronous replacement for JSON.parse (although an async JSON.parse is probably a harder sell, since Response.json() already exists, and can be used in Node/Deno soon even though it's technically a web API and not a TC39-specced API).

I guess the real question would be: given that there are libraries today for streaming JSON parsing (e.g., Oboe) and making a JSON parse less-blocking (.json(), but also e.g.), do these need to be in the language itself?

I think there's a decently compelling argument that parsing is bug-prone and security-critical, and JSON is ubiquitous, so there is probably a place for more flexible JSON parsing options in the language itself. A streaming JSON parser would some real work to spec the API though.

The biggest objection (why I mostly abandoned this proposal) is that there's no truly stable schema yet despite JSON Schema being around for quite a while now. It's still not quite mature, so it's too early to really push for.