Proposal: add offset, line, and column properties to `SyntaxError` and `JSON.parse` errors

claudiameadows · September 5, 2024, 1:33am

This would avoid a ton of code for a common case, and would make syntax error reporting way easier.

ljharb · September 5, 2024, 5:11am

How would you suggest specifying this, given that stacks aren't in the language at all (yet)?

bakkot · September 5, 2024, 7:34pm

I've wanted this as well, especially for the new base64 methods. I'm reluctant to suggest enabling it by default, though, because keeping track of this information can slow down parsing.

Also, some parsing strategies will result in different errors being reported first, and for some input languages it would be annoying to standardize those. For example, let x = 1; let x = 2; ,,,bad,,, might error at the , or the x depending on how exactly the engine keeps track of things.

I think this makes the most sense for JSON.parse and other more-constrained languages (like base64), rather than trying to do it for all syntax errors.

How would you suggest specifying this, given that stacks aren't in the language at all (yet)?

I don't think stacks really enter into it? At least not for stuff like JSON.parse.

claudiameadows · September 6, 2024, 7:23am

Oh, I was just thinking of adding those three properties and accepting them in the constructor as new SyntaxError("message", {line, column, offset}). Then, maybe in prose or however, require them to be filled accurately in the places syntax errors need to be thrown in the spec.

They would not interact with the stack trace proposal. It's orthogonal.

claudiameadows · September 6, 2024, 7:37am

It wouldn't slow down parsing.

The source offset in each chunk has to be tracked just to correctly iterate its bytes.
Updating line and column counters incrementally can slow things down slightly, but branches elsewhere are a much bigger factor.
If you can re-scan the source, you can punt the line and column counting until you actually need it to report the error. I did this recently with Marked, since the source offset can be computed, but they don't offer line or column numbers at all. The code is extremely simple and requires very little program space.
Implementors already generally implement at minimum line tracking for JS parsing anyways, just for the sole sake of making debugging far easier.

claudiameadows · September 6, 2024, 7:39am

I'd be okay with giving implementors freedom to choose where to place the syntax error at. I just want the three fields filled with something meaningful.

bakkot · September 6, 2024, 12:05pm

The source offset in each chunk has to be tracked just to correctly iterate its bytes.

Parsing does not necessarily consist of iterating bytes one at a time because of, among other things, SIMD. It's true that for many parsing strategies the overhead would be small, but that's not the same as having no overhead.

I'd be okay with giving implementors freedom to choose where to place the syntax error at. I just want the three fields filled with something meaningful.

I don't think the committee would accept standardizing this property without requiring consistency between engines. Engines themselves probably wouldn't either: they've found that in practice the less popular engines get bug reports complaining when they don't match the more popular ones, regardless of what the standard says, so this theoretical freedom to differ doesn't actually hold up in practice.

claudiameadows · September 6, 2024, 6:57pm

They could recover the source offset on error by re-processing the given chunk of 16/32/64 bytes to find the problem location within the block, and then adding it to the offset they loaded that from. This itself could be SIMD-assisted and reuse some of the intermediate values generated to save on code and time.

And the the line counting could also itself just use a sequence of non-temporal vector loads to minimize overhead. All it's counting is instances of one of the following subsequence sets:

1-byte string:
- 0A
- 0D not followed by 0A
- 0D 0A
2-byte string:
- 000A
- 000D not followed by 000A
- 000D 000A
- 2028
- 2029

Or, equivalently:

1-byte string:
- 0A not preceded by 0D
- 0D
2-byte string:
- 000A not preceded by 000D
- 000D
- 2028
- 2029

Both of these are extremely easily parallelized. The kernels would look something like this, and would all be fully bottlenecked by memory bandwidth:

// 1-byte, kernel needs 6-7 vector ops + 6-8 scalar iops
size_t line = 1, column = 0;
uint8_t prev_cr_mask = 0;
for (uint8_t current : source) {
    uint8_t nocr_mask = -(current == 0x0A):
    uint8_t cr_mask = -(current == 0x0D);
    line -= prev_cr_mask ^ nocr_mask;
    // What this really means:
    //     m = movemask(nocr | cr);
    //     column = WIDTH - (m != 0 ? clz(m) : -column);
    column = nocr_mask | cr_mask ? 1 : column + 1;
    prev_cr_mask = cr_mask; // requires shuffle
}
if (prev_cr_mask) {
    line++;
    column = 1;
}

// 2-byte, kernel needs 9-10 vector ops + 6-8 scalar iops
size_t line = 1, column = 0;
uint16_t prev_cr_mask = 0;
for (uint16_t current : source) {
    uint16_t nocr_mask = (
        -(current == 0x0A) |
        -((current >> 1) == (0x2028 >> 1)
    );
    uint16_t cr_mask = -(current == 0x0D);
    line -= prev_cr_mask ^ nocr_mask;
    // What this really means:
    //     m = movemask(nocr | cr);
    //     column = WIDTH - (m != 0 ? clz(m) : -column);
    column = nocr_mask | cr_mask ? 1 : column + 1;
    prev_cr_mask = cr_mask;
}
if (prev_cr_mask) {
    line++;
    column = 1;
}

While this looks complicated, it could also be scheduled alongside all the vectorized parsing, since there is very likely to be space somewhere to avoid extra clocks. So in reality, this can likely be done with at most a few clocks of overhead (all vector ops in this are typically 1-2 cycles).

bakkot · September 6, 2024, 7:55pm

Yes, you can probably recover most of the performance at the cost of additional complexity in the optimized parser, but that's also something engines try to avoid. It's not necessarily impossible, it just has costs that we may not want to pay.

Doing a bunch of work that is going to get thrown away 99%+ of the time is more convenient for the programmer but worse for the user and/or engine; asking the developer to opt in when they want the additional data isn't so bad. We have already run this logic when we made the RegExp /d flag opt-in instead of just adding indices unconditionally.

claudiameadows · September 6, 2024, 8:24pm

Sure, and that's precisely why I mentioned the "count lines and columns on error" as an alternative.

Also, gonna be honest, I don't find your complexity argument persuasive, not in light of implementation reality.

Every ES implementor I'm aware of already computes line and column locations for JS parsing for the sake of providing accurate stack traces.
Every ES implementor I'm aware of provides offsets, line/column locations, or both for JSON syntax errors.
Many implementors I'm aware of provide offsets for regexp parse errors. Since regexp parsing isn't very vectorizable, reporting offsets should be fairly easy.

I would be okay with reducing "must provide accurate line/column/offset values" to a "should". My focus is just on providing engines a way to expose the information they already have.

claudiameadows · September 6, 2024, 8:32pm

Will call out that this has more in common with AST node location generation than with error location reporting. Parser libraries commonly allow skipping AST node location generation since the extra location object's allocation in particular slows things down a lot.

The closest analogue to my proposal here is re.lastIndex, which every ES implementor must implement in all cases.

Topic		Replies	Views
Support JS-style comments in JSON 💡 Ideas	12	1114	September 9, 2023
Schema-based JSON parsing? 💡 Ideas	9	613	April 4, 2022
Character code literals 💡 Ideas	5	314	January 11, 2022
Syntactic sugar to offset 💡 Ideas proposal	2	242	August 2, 2023
Standardized stack trace 💡 Ideas	2	272	June 22, 2021

Proposal: add offset, line, and column properties to `SyntaxError` and `JSON.parse` errors

Related topics