Hello. All the data someone sends via HTTP / WebSocket requests to us is untrusted. On the frontend or the backend we receive data as ArraBuffer(s), decode into JS string, likely parse with JSON.parse and handle it. However:
JSON.parse can throw SyntaxError (and from network data that’s highly possible). As for me, any Error instance that generates the calling “stack” must be used for debugging. How do we debug something, that isn’t even ours?
We have to decode ArrayBuffer to JS string each request. For Latin characters the result is a produced copy of the data as a JS string, but if decoder finds a single “emoji” (in WEB this is a routine) - everything that was decoded until now gets copied again but as utf16 JS string. One char forces regeneration of the whole payload (it can be large in the end).
Only JSON.parse can validate the payload, so if it WAS invalid - this JS string was a complete waste of CPU time and stays in memory for a while due to GC.
Not only that, this intermediateString is used for microseconds and then goes to GC “eventually”, not immediately. Remember “optimising for use-cases”? We should remove this string completely.
To address these issues I have created these 2 TC39 proposals: JSON.parseBinary and ArrayBuffer.prototype.detach. Note that “.detach” is intended to be the same as “ArrayBuffer.prototype.transfer”, but do less work - don’t perform any copying to another ArrayBuffer
Hey, I like the idea. It tries to solve a crucial problem, but if you're doing live parsing on the buffer, i think some TOCTOU (Time-of-Check to Time-of-Use) issues might happen. Since the syntax shows it's a sync API, that might not matter much for regular buffers.
If you are just copying and parsing, okay great, but it's not the best for performance(cause u are making a copy). Or maybe you're doing an implicit transfer? I guessed that since you have a co-proposal based on detach/transfer. But that would be a dev inconvenience—you don't want an API suddenly 'eating' up your buffer without warning. May be you suggested manual detaching, but by then, ram would have peaked at double the memory usage. Moreover, like you said, it would be making developers do memory management.
SO…, I have a suggestion for adding an explicit transfer list like postMessage. What I'm thinking is:
JSON.syncParseBinary(buff, [buff]) (Sync version)
The api just receives the copy/source, parse, deletes (deletes the copy/source). The async version can be a great use for browsers, because u don't want an api blocking the UI just for parsing some large binary.
Anyway, you don’t specify what happens with SAB. We can't detach it nor parse it atomically without huge overhead, so a copy is better. I suggest SAB shouldn't be on the transfer list; it should be an error if you try(a real error, cause it's not a network-produced error), just like postMessage.
Anything not in the transfer list (including SAB) is a copy. I think that solves the issues while keeping the speed.
And, devs may need to know what caused an error. Since you return an object saying it was not okay, why not return the buffer itself if it's an error? Like only if it was on the transfer list (transfer list + error)? I hope that can be a great add-on!, Or, may be they don't care about a malformed buffer. Anyway, consider specifying what happens with SAB.
Thanks for reviewing my idea! However the changes you’d like to add are actually better to be avoided. Let me explain why (and everything from below I will rephrase and put into proposals like “why not this”):
initial idea for JSON.parseBinary suggests that it doesn’t detach buffer, is synchronous and doesn’t return input as a separate property in the object.
In my view, JSON.parseBinary is a specialized utility with a sole purpose - parse binary data and return the result. If it was to detach buffer inside, it would be “framework-alike” because handles many things at once and TIES ARCHITECTURE to specific rules. I want developers to define their workflow, do only what they need and, if they operate on this raw level, they can avoid unnecessary overhead. In the end, if someone wants to detach buffer alongside this call - create your own utility that does several things, wrapper. Proposing wrapper to be global can be easily cancelled, like JSON.safeParse.
transfer list is an interesting addition, considering postMessage as instance, however here that adds even more headache than manual .detach(). Firstly, this list implies transferring ownership immediately, so after function buffer is detached. But if there is a syntax error, we have 2 options: don’t detach buffer and confuse developers even more (transfer list implies detaching) or detach user’s buffer and create one more buffer as “input” property {ok:false, message, input}.
Secondly, it is just more compelling to have one buffer referenced and detached IF SURE about its uselessness afterwards. But this “input” property is creating another view over buffer, which is another GC overhead.
Lastly, if I do return “input” as a property of the object, who is going to detach it? We are back to square one. We can add “reviver” callback in the end, where we can detach the buffer depending on the answer, but why? The code starts to look more like a workaround rather than a clean solution to a problem.
// this example shows that this limits developer’s
// freedom of choice - use one “buffer” however he wants.
// new idea, user has “buffer”
var result = JSON.parseBinary(buffer, [buffer]);
if(!result.ok) {
console.log(“bad”, result.input.buffer.byteLength)
result.input.buffer.detach(); // back to square one.
}
// OR previous idea
var result = JSON.parseBinary(buffer);
if(!result.ok) {
// reference existing buffer
console.log("bad", buffer.byteLength);
buffer.detach(); // we are done with data, can move forward
return;
}
// clear initial buffer
buffer.detach();
// proceed handling result.value
await JSON.parseBinary(buf) and synchronous JSON.syncParseBinary(buf). First of all, if we parse some large JSON on the frontend, that doesn’t consume more than half a second, so why?
Secondly, this requires either copying buffer or explicitly “await” its execution.
Thirdly, there are 2 ways to achive async execution:
3.1 parse on the main js thread, but chunked. So this needs to save some “parser state” (I explain why this is cumbersome below) and do so in certain chunk size. Who decides this size? Function argument? Why not just parse buffer at once, if it is available and not keep it in memory for too long? Parsing half of that buffer doesn’t mean that we can already detach it. Memory will spike 2x+, but parsing one whole buffer in chunks make those 1.7x+ pile up tremendously impacting memory usage. And that is not solvable via GC or detach, because again, buffers are referenced and used.
After analyzing await JSON.parseBinary() and figuring out its use of “external state” to save progress, I instinctively considered another proposal like
// return object { ok(false if error), done(if error - false), message(if error), value(if done) }
var parseBinaryChunk = JSON.binaryParser();
/// somewhere in handler where chunks come
var result = parseBinaryChunk(buffer);
// at first it seems that we can detach buffer, but actually no
buffer.detach();
if(!result.ok) {
console.log(result.message);
return;
} else if(result.done) {
var object = result.value;
}
However this parser has problems MAINLY due to that external state. Let’s take a look at example, where chunks come in an awkward but possible way:
chunk 1: { “some string”: “its contents, that are not full
chunk 2: ; this is still that text , not full
chunk 3: ; finally end” }
Given chunks demonstrate, that if we want our memory to avoid 2x+ and as max stay at 1.3x+, then properties have to end within those chunks. Otherwise, we either have to prohibit detaching, or copy those strings into our state. Chunk 1 has to save key and partial value into state, chunk 2 has to copy all contents and concatenate value from chunk 1 with chunk 2 (or save as array), while reparsing those string from utf8 to utf16 because of emoji. In chunk 3 this journey ends, but what was before is enough to drop the idea. Parsing JSON in javascript in the streaming manner without sacificing memory is impossible
— JSON.parseBinary as a synchronous function, that doesn’t detach buffer internally is solving the problem and doesn’t become “framework-alike”
Hello! I would really appreciate if you sketched out some example of usage, so that I could get acquainted with this technology of yours in a minute. GitHub - bablr-lang/language-en-json: A BABLR language for JSON · GitHub This repo is the closest I could approach to "bablr" with JSON, however it doesn't provide us with any direct examples. TypeScript declaration files or source files in TypeScript would also come in handy, but there is none.
So please, write an example of parsing JSON in a streamed manner, that can replace JSON.parse in some way and, if you ever think of moving to TypeScript with whole bunch of testing, you might want to take a look at this repo as a baseline (just self-advertisement haha) - GitHub - Guthib-of-Dan/lib_arch: Automated architecture for building typescript libraries · GitHub
import { streamParse } from 'bablr';
import language from '@bablr/language-en-json';
let input = '{ "key": 3 }';
let result = streamParse(language, language.defaultMatcher, input);
Bablr is a unique technology, tree-sitter inspired, however it doesn't provide the solution to those problems, described above in this specific conversation.
First of all, streamParse iterates over data and gives a lot of information on each token. The streaming I implied was the ability to accept incomplete data chunk by chunk, without excessive copying + with the ability to clear those chunks immediately after being consumed.
Secondly, streamParse doesn't accept raw buffers, but only parts of it, that are symbolic and likely are referred to with an array notation (these are just guesses, but it did read 'open bracket'). If it does, provide more information on it, that really addresses the points I have made before.
import { streamParse } from 'bablr';
import language from '@bablr/language-en-json';
var encoder = new TextEncoder();
let input = encoder.encode('{ "key": 3 }');
let result = streamParse(language, language.defaultMatcher, input);
for (var el of result) {
console.log(el)
}
// output
{
type: Symbol(OpenNodeTag),
value: {
flags: { token: false, hasGap: false },
name: null,
type: Symbol(_),
literalValue: null,
attributes: {},
selfClosing: false
}
}
file:///C:/Users/abc/projects/dir/node_modules/@bablr/regex-vm/lib/internal/literals.js:4
export const code = (str) => str.codePointAt(0);
^
TypeError: str.codePointAt is not a function
Lastly, all this parsing comes at cost of the CPU time due to all that info on tokens. This instrument has other use-cases, currently not overlapping with JSON.parseBinary
import { streamParse } from 'bablr';
import language from '@bablr/language-en-json';
import { readFile, decodeUTF8 } from '@bablr/fs';
let input = decodeUTF8(readFile('./fixture'));
// now you're parsing a file read in chunks
streamParse(language, language.defaultMatcher, input);
There is not excessive copying and the chunks are being cleared as soon as they are consumed. The complete fixture file is never held in memory.
Everything I looked at was based on your examples, @conartist6. And this one doesn't work
import { streamParse } from 'bablr';
import language from '@bablr/language-en-json';
import { printTag } from '@bablr/agast-helpers/print';
import { readFile, decodeUTF8 } from '@bablr/fs';
let input = decodeUTF8(readFile('./package.json'));
var result = streamParse(language, language.defaultMatcher, input);
for (var tag of result) {
console.log(printTag(tag))
}
output
<_>
file:///C:/Users/abc/projects/abc/node_modules/@bablr/fs/lib/index.js:120
if (result.value !== undefined) yield* result.value;
^
TypeError: Cannot read properties of undefined (reading 'value')
at __readFile (file:///C:/Users/abc/projects/abc/node_modules/@bablr/fs/lib/index.js:120:16)
Performance
In the end, describing bablr in this current conversation might not be the best place. If you want to prove something - give working example, benchmark it for Garbage Collection and time per X iterations. Provide ready results for us to be convinced. Otherwise, this is going to get us nowhere
import { streamParse } from 'bablr';
import language from '@bablr/language-en-json';
let input = '{"a":"123"}'
console.time("streamParse")
for (var i = 0; i < 1_000_000; i++)
streamParse(language, language.defaultMatcher, input);
console.timeEnd("streamParse")
console.time("native parse")
for (var i = 0; i < 1_000_000; i++)
JSON.parse(input)
console.timeEnd("native parse")
Sorry for the broken stuff. I had to publish a new version of @bablr/fs to fix it, and the example code needed tweaks too. But here at least is runnable example code:
import { streamParse } from "bablr";
import language from "@bablr/language-en-json";
import { printTag } from "@bablr/agast-helpers/print";
import { readFile, decodeUTF8 } from "@bablr/fs";
import { getStreamIterator } from "@bablr/agast-helpers/stream";
let input = decodeUTF8(readFile("./fixture"));
var result = streamParse(language, language.defaultMatcher, input);
let iter = getStreamIterator(result);
let step = iter.next();
await (async function () {
for (;;) {
if (step instanceof Promise) {
step = await step;
}
if (step.done) break;
let tag = step.value;
console.log(printTag(tag));
step = iter.next();
}
})();
At the moment, yes, it's pretty slow. That's partly because of all the polyfilling we have to do though. With support from the language core this could be a lot faster...
Even if you don't decide to use BABLR as your parser, you could still use @bablr/fs and hand-write a parser to process an input expressed as a stream iterable. Then you'd have everything you want I believe...?
Further, if engines supported a string-decode of an Immutable ArrayBuffer as a view onto the Immutable ArrayBuffer, the performance gain you seek could even happen without any new APIs or any API changes. OTOH, engines are unlikely to provide that string-as-view optimization.