Proposal: Parser Augementation with CSTML

This proposal creates a new system of code preprocessing which allows a Javascript runtime to evaluate code written using arbitrary syntax extensions, so long as those extensions are defined by writing a compile-time script that runs in a formalized transpiler environment and the output of which is standard Javascript.

Benefits:

  • JSX and Typescript support becomes native, eliminating a soul-sucking build step
  • All compile-to-js languages gain native execution support with source mapped debugging
  • Allows rapid, effective prototyping of new language features, as many people would love to do on these forums
  • Compile-to-js languages like Typescript can support multiple language versions and options, as they must in order to be able to model real world discrepancies like some versions of TS using a nonstandard implementation of standard language features, particularly for .. of and @annotations
  • Allows developer tools to help users understand what any code they are viewing means semantically, much like developers use https://astexplorer.net when trying to understand how unfamiliar constructs are parsed

CSTML

This proposal is implemented through the creation of a new language, CSTML. CSTML is a fully text-preserving embedding language for code documents. It takes inspiration from HTML and JSON, while providing the most effective and natural structure for dealing with trees which combine abstract and concrete syntactic elements.

When I say that CSTML is a "text-preserving embedding language", I mean that it is always possible to print the program stored in the CSTML document without knowing anything about the language it was written in. This is a crucial property for tooling development. It also makes it simple to formalize the definition of a parser: a parser is a program which outputs a CSTML document which encodes its input.

Let's break down a simple CSTML document then. It starts with a docytype tag:

<!0:cstml>

Here are some things to notice:

  • <! was chosen to denote a certain spiritual kinship with HTML
  • 0 is indicative of this being a proposal. The formal version would be 1
  • Space is not allowed in <!0:cstml, making it trivial to match against
    • Putting 0 before cstml ensures there is no risk of "Windows 9 syndrome". Reportedly Windows 9 was not made because the developers realized that many lazy coders had written checks like version.startsWith(9). Because I cannot fix developer laziness, I have made it impossible to check for the presence of the cstml sigil without consuming the complete version number

The complete document for the JSON input "Hello world" might look like this:

<!0:cstml>
<>
  root:
  <String>
    openToken: <*Punctuator '"' />
    content:
    <*StringContent>
      'Hello world'
    </>
    closeToken: <*Punctuator '"' />
  </>
</>

Let's break down what's going on. Much like HTML or XML (or SGML) we have matched tag pairs like <Tag></>. In CSTML these tags are used to describe how the parser saw the input.

The most readily visible difference is that a language like HTML a node only has one kind of relationship: an array of its children. CSTML requires that each of a node's relationships with a child node have a JSON-like name, e.g. name: <Node/>. This much more closely matches the natural form of syntax trees. Any named relationship may also be array-ish.

The next most visible difference is that content of the document must be wrapped in string quotes. This ensures that these documents can be safely formatted for viewing without risking damage to the code stored in them.

Because single-character tokens will be so common in real use, a shorthand has been devised for <*Punctuator> '"' </>, which is <*Punctuator '"' />.

Speaking of <*Punctuator>, what is that * doing there? It is a flag of the node, in this case flags.token. It tells you that that this node will be a leaf of the syntax tree, and its contents will be text strings, not other nodes. CSTML forbids a node to have text strings as children if it does not set flags.token, which has the practical effect of ensuring that each piece of text in the document will have a production names and a unique path associated with it.

Quoted strings use JSON escaping rules. XML escapes were rejected as they have the unfortunate effect of making it impossible to know what text an (arbitrary) XML document contains unless you have the entities dictionary it was encoded with. To print a document, concatenate all the string portions.

Two more flags are important to know about:

  • #, the trivia flag, denotes semantically meaningless content and allows a node to be inserted as a child without requiring the relationship to be named.
    • e.g. <#Comment> '/* comment */' </>
  • @, the escape flag, allows a node to be embedded in a token without requiring the relationship to be named.
    • e.g. in a string like "\"", you could have <@Escape cooked='"'> '\\"' </>

Language Embedding

Like XML, CSTML supports the idea of multiple namespaces. A namespaced node looks like:

<Namespace:Node />

The difference with XML here is that the name of the namespace is non-negotiable. Just like productions names are subjective -- they only have meaning in the context of the original parser: so too are the names of the relationships between namespaces. The upshot of this is that in CSTML checking for Namespace:Node is a sufficiently safe check, where in XML it would not be (due to the presence of arbitrary XML namespace name to schema URL mappings).

It is also possible in CSTML to have dotted namespaces: <Foo.Bar:Baz />.

I have omitted a completely formal specification of CSTML because it is pleasantly self-specifying. Here is the CSTML parser definition: language-en-cstml/lib/grammar.macro.js at 2e47ff3da616c225184837fafdeabfbc5d842f7d ยท bablr-lang/language-en-cstml ยท GitHub

To try it out you can run:

npx -p @bablr/cli  bablr -l @bablr/language-en-cstml -p Node -f <<< '<*Token />'

agAST

The in-memory representation of CSTML is currently termed agAST. It encodes nodes as:

node = { flags, language: [...'parts'], type: 'str', children: [], properties: {}, attributes: {} }

I've mentioned flags before, but here is the complete accounting of them. Each is a boolean.

flags = { 
  token, // is this a token node with only literal children
  trivia, // is it a comment
  escape, // is it an escape code which has a literal interpretation
  expression, // is it necessary to look ahead in the source to understand why this node is here
  intrinsic // if true, disallow interpolating this node: the parser needs it
}

children is the most interesting part of the node. It contains the document's tag stream. The tags should (mostly) map quite clearly to the kinds of syntaxes you see in a CSTML document. The types of tags are:

  • { type: 'ReferenceTag', value: { name: 'str' } }
  • { type: 'OpenNodeTag', value: { flags, language, type, attributes } }
  • { type: 'CloseNodeTag', value: { language, type } }
  • { type: 'LiteralTag', value: 'value' }
  • { type: 'GapTag', value: undefined } (indicates a location where a required or expected node is known to be missing)
  • { type: 'NullTag', value: undefined } (indicates the absence of a node, for monomorphism)
  • { type: 'ArrayTag', value: undefined } (allows many values to be bound to this name, creating an array in properties)
  • { type: 'EmbeddedNodeTag', value: embeddedNode } (for comments and escapes)

It is possible to construct a single tag stream for the entire document by visiting all the children of each node recursively. This tag stream constitutes a valid and complete streaming representation of the documents contents, and the tree can be rebuilt from the stream. A tag stream's tokens are embeddable in a tree without alteration.

BABLR

The principal drawback to this approach to the data structure is that it denormalizes information. What is the source of truth? Is it the tags and attributes inserted by the parser, or is it the text in them?

The answer is: the text in them. But to keep everything in sync we'll need to make the in-memory structure immutable, and to make sure that a document follows the rules of a particular language we'll need a good schema validator, which will more or less just be a specially written parser. BABLR is the parser system I have created for this purpose. It is a fully functioning system that can already be used to experiment with how parsers are created and the CSTML output they produce.

A CSTML document intended to be validated with BABLR will have a doctype like <!0:cstml bablr-language="https://url.for/parser.js">. Other schema validator systems can eventually be created for CSTML documents and can be invoked similarly.

BABLR defines two main operations on CSTML documents: parse and traverse. You can think of these as two different levels of validation. Traversal validates that the document is a valid tree in the parse forest. Parsing validates that you get that parse tree from the embedded text. The ability to do both operations creates the possibility of working with trees that are valid in the parse forest but whose syntax would otherwise be grammatically shadowed by some higher-precedence parse rule.

Once support for decorators lands, no build step will be required in the definition of BABLR grammars. This will make them exceptionally easy to quickly write and debug. Here is an example grammar for JSON: language-en-json/lib/grammar.macro.js at ecf86072879f3220d4f428ecc58f70f32d58ad51 ยท bablr-lang/language-en-json ยท GitHub

Extensible parsing

BABLR parsers are naturally extensible due to the inversion of control flow. Creating a higher order parser function always allows you to intercept the calls made by the wrapped parser, performing language extensions on the fly as needed. Alternatively you can use class extension.