Hello. All the data someone sends via HTTP / WebSocket requests to us is untrusted. On the frontend or the backend we receive data as ArraBuffer(s), decode into JS string, likely parse with JSON.parse and handle it. However:
-
JSON.parse can throw SyntaxError (and from network data that’s highly possible). As for me, any Error instance that generates the calling “stack” must be used for debugging. How do we debug something, that isn’t even ours?
-
We have to decode ArrayBuffer to JS string each request. For Latin characters the result is a produced copy of the data as a JS string, but if decoder finds a single “emoji” (in WEB this is a routine) - everything that was decoded until now gets copied again but as utf16 JS string. One char forces regeneration of the whole payload (it can be large in the end).
Only JSON.parse can validate the payload, so if it WAS invalid - this JS string was a complete waste of CPU time and stays in memory for a while due to GC.
Not only that, this intermediateString is used for microseconds and then goes to GC “eventually”, not immediately. Remember “optimising for use-cases”? We should remove this string completely.
To address these issues I have created these 2 TC39 proposals: JSON.parseBinary and ArrayBuffer.prototype.detach. Note that “.detach” is intended to be the same as “ArrayBuffer.prototype.transfer”, but do less work - don’t perform any copying to another ArrayBuffer
JSON.parseBinary has some similarities with Schema-based JSON parsing? -
Ideas - TC39 (accept UTF-8 ArrayBuffer/Uint8Array) and
JSON.safeParse():call JSON.parse() in try/catch? -
Ideas - TC39(but rather avoiding throwing errors completely)
Hey, I like the idea. It tries to solve a crucial problem, but if you're doing live parsing on the buffer, i think some TOCTOU (Time-of-Check to Time-of-Use) issues might happen. Since the syntax shows it's a sync API, that might not matter much for regular buffers.
If you are just copying and parsing, okay great, but it's not the best for performance(cause u are making a copy). Or maybe you're doing an implicit transfer? I guessed that since you have a co-proposal based on detach/transfer. But that would be a dev inconvenience—you don't want an API suddenly 'eating' up your buffer without warning. May be you suggested manual detaching, but by then, ram would have peaked at double the memory usage. Moreover, like you said, it would be making developers do memory management.
SO…, I have a suggestion for adding an explicit transfer list like postMessage. What I'm thinking is:
await JSON.parseBinary(buff, [buff]) (Async version)
JSON.syncParseBinary(buff, [buff]) (Sync version)
The api just receives the copy/source, parse, deletes (deletes the copy/source). The async version can be a great use for browsers, because u don't want an api blocking the UI just for parsing some large binary.
Anyway, you don’t specify what happens with SAB. We can't detach it nor parse it atomically without huge overhead, so a copy is better. I suggest SAB shouldn't be on the transfer list; it should be an error if you try(a real error, cause it's not a network-produced error), just like postMessage.
Anything not in the transfer list (including SAB) is a copy. I think that solves the issues while keeping the speed.
And, devs may need to know what caused an error. Since you return an object saying it was not okay, why not return the buffer itself if it's an error? Like only if it was on the transfer list (transfer list + error)? I hope that can be a great add-on!, Or, may be they don't care about a malformed buffer. Anyway, consider specifying what happens with SAB.
1 Like
Thanks for reviewing my idea! However the changes you’d like to add are actually better to be avoided. Let me explain why (and everything from below I will rephrase and put into proposals like “why not this”):
- initial idea for
JSON.parseBinary suggests that it doesn’t detach buffer, is synchronous and doesn’t return input as a separate property in the object.
In my view, JSON.parseBinary is a specialized utility with a sole purpose - parse binary data and return the result. If it was to detach buffer inside, it would be “framework-alike” because handles many things at once and TIES ARCHITECTURE to specific rules. I want developers to define their workflow, do only what they need and, if they operate on this raw level, they can avoid unnecessary overhead. In the end, if someone wants to detach buffer alongside this call - create your own utility that does several things, wrapper. Proposing wrapper to be global can be easily cancelled, like JSON.safeParse.
- transfer list is an interesting addition, considering
postMessage as instance, however here that adds even more headache than manual .detach(). Firstly, this list implies transferring ownership immediately, so after function buffer is detached. But if there is a syntax error, we have 2 options: don’t detach buffer and confuse developers even more (transfer list implies detaching) or detach user’s buffer and create one more buffer as “input” property {ok:false, message, input}.
Secondly, it is just more compelling to have one buffer referenced and detached IF SURE about its uselessness afterwards. But this “input” property is creating another view over buffer, which is another GC overhead.
Lastly, if I do return “input” as a property of the object, who is going to detach it? We are back to square one. We can add “reviver” callback in the end, where we can detach the buffer depending on the answer, but why? The code starts to look more like a workaround rather than a clean solution to a problem.
// this example shows that this limits developer’s
// freedom of choice - use one “buffer” however he wants.
// new idea, user has “buffer”
var result = JSON.parseBinary(buffer, [buffer]);
if(!result.ok) {
console.log(“bad”, result.input.buffer.byteLength)
result.input.buffer.detach(); // back to square one.
}
// OR previous idea
var result = JSON.parseBinary(buffer);
if(!result.ok) {
// reference existing buffer
console.log("bad", buffer.byteLength);
buffer.detach(); // we are done with data, can move forward
return;
}
// clear initial buffer
buffer.detach();
// proceed handling result.value
await JSON.parseBinary(buf) and synchronous JSON.syncParseBinary(buf). First of all, if we parse some large JSON on the frontend, that doesn’t consume more than half a second, so why?
Secondly, this requires either copying buffer or explicitly “await” its execution.
Thirdly, there are 2 ways to achive async execution:
3.1 parse on the main js thread, but chunked. So this needs to save some “parser state” (I explain why this is cumbersome below) and do so in certain chunk size. Who decides this size? Function argument? Why not just parse buffer at once, if it is available and not keep it in memory for too long? Parsing half of that buffer doesn’t mean that we can already detach it. Memory will spike 2x+, but parsing one whole buffer in chunks make those 1.7x+ pile up tremendously impacting memory usage. And that is not solvable via GC or detach, because again, buffers are referenced and used.
- After analyzing
await JSON.parseBinary() and figuring out its use of “external state” to save progress, I instinctively considered another proposal like
// return object { ok(false if error), done(if error - false), message(if error), value(if done) }
var parseBinaryChunk = JSON.binaryParser();
/// somewhere in handler where chunks come
var result = parseBinaryChunk(buffer);
// at first it seems that we can detach buffer, but actually no
buffer.detach();
if(!result.ok) {
console.log(result.message);
return;
} else if(result.done) {
var object = result.value;
}
However this parser has problems MAINLY due to that external state. Let’s take a look at example, where chunks come in an awkward but possible way:
chunk 1: { “some string”: “its contents, that are not full
chunk 2: ; this is still that text 
, not full
chunk 3: ; finally end” }
Given chunks demonstrate, that if we want our memory to avoid 2x+ and as max stay at 1.3x+, then properties have to end within those chunks. Otherwise, we either have to prohibit detaching, or copy those strings into our state. Chunk 1 has to save key and partial value into state, chunk 2 has to copy all contents and concatenate value from chunk 1 with chunk 2 (or save as array), while reparsing those string from utf8 to utf16 because of emoji. In chunk 3 this journey ends, but what was before is enough to drop the idea. Parsing JSON in javascript in the streaming manner without sacificing memory is impossible
—
JSON.parseBinary as a synchronous function, that doesn’t detach buffer internally is solving the problem and doesn’t become “framework-alike”
1 Like
fair enough, it was a lot, staying low is better i think.
1 Like