Proposal: Parser Augmentation Mechanism

Background

I recently did an analysis of every proposal that has been submitted to TC39, because I was hoping to find commonalities about what types of proposals make it into the language vs which proposals get rejected (or, simply, do not advance). I categorized each proposal as semantic or syntactic; see the linked issue for my definitions, but in short: semantic proposals change the behavior of the language, and syntactic proposals change the behavior of the parser.

The vast majority of TC39 proposals are semantic in nature. Of the 59 proposals that have been accepted into ECMA262, 53 of them or 90% were semantic proposals; if you include all proposals that have made it to stage 2 or later, that ratio increases to 93% semantic. Only 6 of the 101 proposals that made it to stage 2 were purely syntactic in nature, and all of them have been implemented as of 2023.[1]

There are 25 other pure-syntax proposals listed in the repository. This makes pure-syntax proposals by far the least-accepted class of proposals with a 19% acceptance rate, compared with 41% for purely-semantic proposals and 47% for semantic proposals that require syntax changes. The conclusion is as self-evident as it is sensible:

TC39 does not like to make syntax-only changes to ECMAScript.

This is obvious, on consideration. The design space for any given programming language is finite. Making any change in the language means taking up syntactic, semantic, and/or conceptual real estate that can't be used for something else in the future - even more so because of the need to maintain compatibility. Put another way: once ECMAScript runs out of SyntaxErrors, once every syntax has a meaning, that's it. You can't make any more syntax changes in the language, ever.

And yet, syntax-level changes are still useful, and there is certainly a demand for them. About 13% of all proposals brought before TC39 have been purely syntactic in nature; if you include all proposals that have a syntactic element, that number climbs to about 46%. From a rough estimate, on this Discourse around 120-150% of all proposals are for a purely syntactic change to ECMAScript - one that can be represented solely at the level of the AST.

What's more, ECMAScript implementations have already implemented vendor-specific syntax extensions, fracturing the ecosystem. Deno runs a preprocessor on Typescript code before parsing it. Console implementations treated an initial hashbang as a comment, which was an extension until 2023. Many, many open-source projects written for ECMAScript are not written in ECMAScript. The abillity to interoperate with these software projects depends on the existence of transpilers whose operation is outside the scope of TC39; thus, ECMAScript on its own is unable to accurately describe the relationship of two pieces of software written in different "ECMAScript-compatible" languages. This occurs even within a single project; only the build scripts know that a ".js" file should be parsed differently than a ".jsx" file.

Proposal

This proposal describes a standardized mechanism to query parser capabilities as well as to inform the parser what mode it should use to interpret a given source text. This can be done either by the requester (import a given file as Typescript, for example), by the source text itself (declare that the current file is written in Typescript), or by out-of-band descriptors referencing ECMAScript-defined conversions (import maps, HTML tags, HTTP headers, file extensions, etc). It also describes a syntax that can be used to concretely and affirmatively describe a given source text transformation that can be used to validate the functionality of such a transforming parser, or to polyfill such functionality if it is missing.

There are a number of other proposals which also deal with syntax-level changes in ECMAScript, including some designed around accepting a different type of input to the parser:

with others describing partial syntax changes that could be accomodated with an AST transformation:

Each of these are already implemented, somewhere, perhaps under an experimental tag, perhaps in a transpiler. With no way to describe these transformations, this leads to some engines being unable to parse code that would otherwise be valid ECMAScript. Once the transformations are defined, however, they can be used to polyfill older implementations or to prototype new features.

Note that, depending on engine support, this mechanism could be used to inject code in previously untamperable locations, as there is no detectable difference between implementing a pipe operator or changing a URL. There are two mitigations to this:

  1. There are only two contexts that can affect the parser: the importing module (using an Import Attribute) and the source text itself (using a syntax to be defined herein). An attacker who can access the source text to add a parser declaration can already add whatever code they need, and the importing module has the choice to import any URL they like. Furthermore:
  2. Engines are not required to support polyfilling parser implementations, and they are not required to respect cross-origin import attributes (they are allowed to pretend they can't perform a given kind of conversion). This notation, when used, is expected to be consumed by whatever build process is used for these files (for typical bundler/minifier implementations) or used to control an implementation-specific parser feature like language support.

Should I continue?


  1. Trailing commas, Lifting template literal restriction, Optional catch binding, JSON superset, Numeric Separators, and Hashbang Grammar. ↩ī¸Ž

I asked about something like this a few weeks ago in the TC39 Matrix channel and was directed to the "TC39 Module Harmony" channel and the compartments proposal. It seems like that's where the idea of "loaders" would be created with import attributes. (Someone correct me if I'm misunderstanding). Apparently the older ShadowRealm proposal had a preprocessor that allowed rewriting modules, but that hasn't be reintroduced into compartments. It's quite a rabbit hole, and I didn't look into how it could be added.

I'd continue investigating this as from what I saw it's been an idea for a while. The idea of running loaders on imports in general seems a logical feature.

Yes, the Compartments proposal is module-related, but it addresses a completely different issue - it's focused on semantic isolation of modules without affecting syntax. Parser Augmentation is, in effect, the syntactic counterpart to Compartments; it allows different parts of a single project to use different parsing modes, rather than using different bindings.

If Compartments and Parser Augmentation were both to advance through TC39, I'd suggest that PA be conceptually positioned as a very limited type of VirtualModuleSource. The full VirtualModuleSource specification has access to a much more powerful toolset than PA, as it can directly assign values and behaviors to module outputs. Compare the example implementation of JSON modules using Compartments and VirtualModuleSource:

class JsonModuleSource {
  bindings = { export: 'default' };
  constructor(text) {
    // Throw SyntaxError if the source is invalid, from here.
    this.#object = JSON.parse(text);
  };
  execute(imports) {
    // Exports of multiple module instances backed by this
    // source should be referentially independent.
    imports.default = clone(this.#object);
  };
}

const source = new JsonModuleSource({ meaning: 42 });
const module = new Module(source);
const { default: { meaning } } = await import(module);

to an equivalent implementation using Parser Augmentation:

async function json(parser) {
  // The parser should parse the rest of the input as text and emit it
  // as a string constant token, as though it were surrounded in double quotes
  // and all appropriate escape characters added.
  parser.setTokenizerMode(Parser.TOKENIZER_STRING_CONSTANT);
  // Without any arguments, parseInput() will consume as much as possible.
  // The return type is an AST node bound to file:line from the source.
  const stringConstantToken = await parser.parseInput();
  // The syntheticAST function runs the default ECMAScript parser on a template,
  // marking all generated AST nodes as synthetic (not in source text)
  return Parser.syntheticAST`export default JSON.parse(${stringConstantToken})`;
}

// Engines are not required to allow registration of implementations
Parser.registerImplementation("json", json, { polyfill: true });
const { default: { meaning } } = await import("life.json", { type: "json" });

The PA implementation can only create modules that could be written in standard ECMAScript. It does not have access to the internals of the module loader mechanism; for example, it cannot violate programmer expectations as the VirtualModuleSource does by cloning the object every time it is imported.

Similarly, the VirtualModuleSource mechanism can allow a WebAssembly module to automagically be loaded with a single import declaration, but Parser Augmentation cannot. (Could you use PA to reformulate a .wasm file into a JS module that creates a binary blob with the WASM's contents? yes. Could you use PA to perform the WebAssembly instantiation every time the module is imported? no.)

This code from my in-progress draft shows a Transformation Description (TD) function that would allow you to implement the Pipeline operator in any Parser Augmentation-compliant bundler/transpiler - even in tsc if it were to implement a PA layer. No, it (almost certainly) wouldn't work in a browser, but that hardly matters if you're already using a build process.

function pipeline(parser) {
    // Defined operators and keywords can have handlers associated with them, as the third argument.
    // The handlers are called upon emit.
    parser.defineOperator("|>", {associativity: "left", precedence: "=", context: "block"}, pipeOperator);
    parser.defineKeyword("%", {treatAs: "identifier", context: {operator: "|>", operand: 2}}, topicReference);
    // having no parse/emit calls in a TD is logically equivalent to a final line reading:
    // for await (const node of parser.parseNode()) parser.emit(node);
}
function pipeOperator(parser, expr, context) {
    if (expr.rhs.isKeyword("%")) {
        parser.emit(expr.lhs);
    } else {
        // expr.state is a container to store TD-local metadata. It is not visible to other TDs or to
        // the emitted AST.
        expr.state.topicVariable = context.newSyntheticLocal();
        // this emit will trigger the keyword handler for topic references in the rhs
        parser.emit(Parser.syntheticExpression`${topicVariable}=${expr.lhs},${expr.rhs}`);
    }
}
function topicReference(parser, expr, context) {
    parser.emit(context.state.topicVariable);
}
Parser.registerImplementation("pipeline", pipeline);

syntax "pipeline"; // a syntax declaration takes effect after the newline following the statement, so just after here →
console.log("foo" |> one(%) |> two(%) |> three(%));

If PA were adopted, it would give TC39 the option of accepting the semantics of the Pipeline proposal without having to give it a syntax. You could declare your own syntax at the top of a file using something like:

syntax "pipeline" with { operator: "|>", topicReference: "%" };

Even the browser vendors might agree to implement something like that, I think.

IMO It's highly unlikely that browsers will implement something that changes the syntax on the fly.

Agreed - that's something I'd expect to remain transpiler-only. That's why hosts are allowed to reject any syntax directives - it means the proposal doesn't need browser buy-in :smile: (that's also why I like having an explicit syntax for it. It provides fail-fast if your source file gets to the browser without the transpiler running first)

A syntax directive at start-of-file, now... I could see that happening in browsers five, ten years down the line

Come to think of it, browsers already have support for this proposal. If the syntax directive names a transformation the host doesn't support, it's supposed to throw a SyntaxError. And that's exactly what browsers do today if you type syntax "typescript"; :joy:

So far, the entirety of the language is at runtime - i doubt this feature would be the one to change that.

1 Like

@dmchurch I think this an absolutely brilliant write-up, and I'm 100% on board. This is what I have been building for the last two years!

I've been experimenting with mechanisms for defining extensible parsers over that time, and I would be extremely curious to get your feedback on this method of defining extensible parsers, which incidentally is a grammar that describes CSTML (my work-in-progress unified format for parse trees).

1 Like

My sense is that nothing about this requires initial buy in from TC39. I hope to be able to establish a de-facto standard for interoperability between tools by doing the best job I can of simultaneously meeting the needs of many constituencies as possible. If I can do that and win over the market, my suspicion is that interest from TC39 in formalizing those "standards" would follow.

In order to make my format the most useful format I've created a parser VM. The goal is to be able to provide guarantees to two sets of people: language authors and tool authors. Language authors get guarantees like extensibility, readability, and debugability. Tool authors get an entire ecosystem worth of languages, streaming parsing, a CLI, and basically all the facilities required to build an IDE around an arbitrary syntax.

1 Like

And it wouldn't! This proposal describes a mechanism that a compliant host could use to perform this entire process at runtime. The fact that I don't expect browsers to implement this soon, if at all (and I expect that they will never support things like mid-stream syntax changes[1]), doesn't change the fact that this is conceptually a runtime operation.

That's why this proposal is, in contract law terms, "severable". Hosts can implement whatever parts of it they're comfortable with, and the expected host behavior upon encountering functionality they don't support is to produce a SyntaxError on import, which is exactly what they do now.

If TC39 were to take up Parser Augmentation, browsers could adopt it at their own pace, and unless and until they do, transpilers would take up the slack by downleveling "valid, PA-mediated ECMAScript" into "valid, non-PA ECMAScript". Exactly as any other newly-introduced feature.


  1. Though Firefox might surprise me, I could imagine them hiding something like that in an obscure about:config option. ↩ī¸Ž

That's why the proposal I'm in the process of writing isn't actually a TC39 proposal - it's a Babel RFC. If some delegate were to show up and say "oh hey I love this idea can I champion it" then I'd pivot, but until then, my assumption is that I'm on my own, and my plan is to convince the Babel folks that this is a good idea and then buckle down to write the initial implementation myself.

My primary reason that I'm posting this on the TC39 discourse - besides the obvious, that I'd love for this to get included in ECMA262 - is that I think that TC39 is going to want to weigh in on the conversation that's going to be happening, and I want to give them the earliest possible opportunity to do so :smile:

I'm definitely excited to spend a little bit looking at what you've put together, though! No need to reinvent the wheel, right?

[ETA: I've put up a draft at:

]

You suggest that users should declare transformers, and that browsers should be allowed to replace the declared transformer with its own implementation.

Why?

If the input grammar and the output grammar are both safe then I guess it's fine, but if the input transformer was really standard compliant then there was no need to replace it, and if it wasn't then you've just created a nasty potential parser mismatch bug/security hole

In my mind language extension is so powerful for the same reasons function calling is: nobody needs to know that you've done it.

So long as you can take your extended language and define it in terms of existing language, nobody can ever stop you from developing your own vernacular and using a tool like Babel to implement it. But your vernacular only has well-defined meaning to the extent that it's translated to a more general meaning according to the rules you intended for it. You can't apply the rules of your vernacular to understand my vernacular, or at least not necessarily so.

As an example, it is possible to superficially implement python syntax and transform it to Javascript. You can then write a "hello world" program in Python syntax, and execute it on a native Python VM, and through a transformer on a Javascript VM. It does the same thing, but is it the same program?

The problem is the version of the program running in JS is still really a JS program. Once you get to more complex programs the distinctions between the runtimes and effective meaning will become apparent, even as the programs you are giving each runtime remain identical.

Another way of thinking about it is that there isn't just one way to translate Python syntax onto the JS runtime either. There are likely many subtly different strategies one might use, all of which would be invisible implementation details from the python syntax side, yet all of which would be non-interoperable on the JS side.

To sum up: I do not think that it can be safe to have a mechanism that declares "this code is written in Python and here is how to transform it to JS" and replace that with "this code is written in Python, so I know how to transform it to JS"

What the code "meant" is fundamentally linked to what it did when the implementor ran it

If a syntax statement makes it as far as the engine - that is to say, if you are using an engine with runtime PA support - then the engine gets the final say on what PA transformations are allowed, end-of-story. It's not required to support your redefinition of a well-known transformation identifier.

(Mind you, the question of "what is a well-known identifier" is open for discussion/bikeshedding, but that's a question that doesn't need answering just yet.)

If you register your transformation as a polyfill, the implementation won't use it if it already has a definition for it. If you register your transformation as a validator, it doesn't have to run it if it has its own implementation, but (assuming it allowed the registration at all) it won't release a parse that conflicts with your definition from a produced-AST perspective.

This is to guard against the possibility of a known vulnerability/weakness in a given implementation of a TD, whether it's a security risk or a performance/DoS risk. Runtime PA is always "best-effort, at the pleasure of the engine" (unless a given transformation is mandated by the spec, like "json"); if your particular code depends on a given transformation running exactly the way you intend, then (a) use a non-well-known identifier, and (b) do the PA transformation as part of your build/bundling process.

You misunderstand the point of PA - especially in the hypothetical runtime-PA environment. Runtime PA is primarily intended as a way to interact with engine-native parser functionality. If I type

syntax "python" with {version: "3.7"};

at the top of the file, it's because in some hypothetical future, some JS runtime has the native ability to parse files with Python syntax as though they were ECMAScript. It doesn't matter how it does it; it could have the full Python runtime sitting in there, have it load up the module, and use Python's own source representation to generate the JS AST. (I'd sure hate to write the TD equivalent of that transformation.) For that matter, it could use some complex weird object-proxy sort of thing where the JS engine talks to the Python engine in order to get native-Python performance out of it. The host is allowed to make any optimizations it likes in order to get the effect of running a given ECMAScript source text. Browsers routinely use JIT-compilations to turn ES code into machine code; this is the equivalent of turning ES code into Python bytecode as an optimization. It's basically like this:

The behavior the ECMAScript+PA spec expects:
take the python source, syntax-translate it to ECMAScript, compile it to bytecode, optimize it to machine-specific code

The behavior of the engine:
listen I don't have time for that and I've got a python-linked-to-JS engine right here, I'm just gonna feed it the python script

As long as all the observable effects are respected, it doesn't matter how the engine goes about it, it's still spec-compliant.

It's the kind of thing where when you open your browser's dev tools everything suddenly gets slow because it stops using the optimized machine-code (or python-bytecode) implementation and falls back to the spec-compliant interpreter version.

Chances are good that, with this hypothetical engine, it actually compiles the Python script to bytecode and emits an AST that looks like a Python VM. If you hit pause in the debugger, it checks the native Python engine's current bytecode location and quick sets up a JS callstack that looks like "yes your honor, of course I was executing this Python code using this JS implementation, and no of course I'm not hiding any tail calls :sweat_drops:"

PA does not enable this behavior. Someone could absolutely link up a Python engine and a JS engine in some Frankensteinian creation and hide it all behind import "filename.py" in a .js file if they wanted to. Heck, someone probably already has, but I don't care enough to go looking.

What PA does is it enables you to describe this behavior, and it provides a way for ECMAScript-compliant code to request this behavior.

And if you wanted, you could write your own Python-to-Python-bytecode compiler in ECMAScript and register it, with a little bit of boilerplate, as a validator for the system's own internal "python syntax transformer". If the bytecode being used by the Python engine doesn't match the bytecode you output, the engine will refuse to import it. You can guarantee the behavior of code running in another language, from ECMAScript, using only ECMAScript.

And now that I think about it, I think the engine's TD would be extremely simple:

async function parsePython(parser, {version}) {
  parser.setTokenizerMode(Parser.TOKENIZER_STRING_CONSTANT);
  const scriptToken = await parser.parseToken();
  const {bytecode, exports}
    = await PythonVM[version].compileScript(scriptToken.value);
  parser.emit(
    Parser.syntheticAST`PythonVM[${version}].executeBytecode(${bytecode})
      ${exports}`);
}

(This is why it's called a "transformation description", because it's just describing what the engine is already doing, in a language we all speak.)

As long as your validating compiler TD outputs an AST of PythonVM["3.7"].executeBytecode(...) with the same bytecode that the engine's (native) compiler did and followed by the same exports, the engine will let you import the file.

Huh, this sort of turns ECMAScript into a sort of lingua franca of software interoperability. I'm kinda okay with that, really.

I've updated the proposal with additional sections describing the role of PA in the runtime environment, expanding on and clarifying the example above.

I've rewritten the section on Validators with a more realistic and concrete example. Does it answer your question?

I feel like I'm not explaining myself particularly well, but honestly that's not that unusual for me. I think that perhaps the only way I can get people to understand that they should be getting excited about this is if I show that it can be done at runtime, without cost, in a production engine, so now I'm doing a bit of V8 hacking :smile: Stay tuned.