Lexical grammar: Why do context-sensitive tokens require multiple goal symbols?

monica · November 15, 2021, 5:15am

According to Section 12: Lexical Grammar:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar.

I understand that the meaning of / depends on the context in which it appears (division, regex or comment) and that the lexer cannot distinguish between division and regex contexts on its own (a lookbehind won't suffice in situations where the previous token is itself ambiguous, e.g. }/ or )/).

But I'm not sure I understand why this necessitates two separate goal symbols.

Why can't we put both under the same goal symbol, e.g.

InputElement::
     [...]
     DivPunctuator
     RegularExpressionLiteral
     [...]

...and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegEx)?

jmdyck · November 15, 2021, 3:02pm

Well, we could, but I don't think we should.
(1) That would make the grammar for InputElement ambiguous.

(2) If you want to treat the lexical parser as a black box with a knob that controls its behavior, then it's simpler (more 'natural') to say say that the knob selects the goal symbol to use, than to say the knob reaches into the lexical grammar and temporarily modifies one of its productions.

(3) If the spec used such production-modification for this purpose, then the reader would wonder if it's used elsewhere, and their certainty about the grammar would decrease.

jmdyck · November 15, 2021, 7:15pm

On second thought, it wouldn't make the grammar for InputElement ambiguous, so scratch reason 1.

monica · November 16, 2021, 12:18am

Thank you @jmdyck.

Just to clarify, my post is not a proposal.

monica · November 16, 2021, 1:13am

Yes, is this because of the maximal munch rule?

I also wondered if InputElement would mess up ASI. For example, in

a = b
/hi/g;

maximal munch would cause /hi/g to be misparsed a regex literal and the parser would end up ASI'ing the input into a = b; /hi/g;. But this wouldn't happen if the parser signals the lexer to treat the leading / as a DivPunctuator.

So the reason for multiple goal symbols comes down to points 2 and 3?

jmdyck · November 16, 2021, 3:02am

No. In context free grammars, ambiguity means that there's more than one way to derive some text from some symbol. But your suggested production doesn't cause that. There's only one way for InputElement to derive / (via DivPunctuator), and there's only one way for it to derive /foo/ (via RegularExpressionLiteral). That much is true regardless of the maximal munch rule.

ASI happens in the syntactic parser, so if you change the lexical parser in a way that doesn't affect the input elements that it delivers, then ASI would be unaffected.

Here's another reason that might be closer to whatever the original reason was:
When you're generating a parser, there are fairly simple ways to handle a grammar with multiple goal symbols, but I don't think that's true for a grammar where an external force can choose to 'disable' part of a production. (I think handling it would amount to re-expressing it as a grammar with multiple goal symbols anyway.)

monica · November 17, 2021, 11:24am

Oh true, thank you for clearing that up. Maximal munch just resolves ambiguity in the lexeme formation phase, whereas the lexical grammar itself is unambiguous because as you say each string that can be derived from it only has one parse tree.

I think this may explain it. This is all very new to me so please correct me if I'm mistaken, but one way to deal with context-sensitive tokenisation is to split the lexer up into multiple "lexical states" (see JavaCC FAQ).

This article by Andy Chu talks about how lexical states are especially useful for tokenising instances of language constructs that accept strings from a recursive sub-language, e.g. regex or template literals in the context of ECMAScript. So the multiple goals are like sub-grammars for the regex and template literal sub-languages. Then in an implementation, each sub-grammar can be handled by a lexical state/sub-lexer.

jmdyck · November 18, 2021, 12:48am

Yes, but I think that's more common as an implementation technique than as a specification technique. I interpreted your original question as asking about the spec, but if you're asking about implementation, you can do anything you like as long as the observable results conform to the spec. So an implementation's parser doesn't have to follow the spec's division between syntactic and lexical grammars (and my impression is that most don't).

Re template literals, you should look at the TemplateLiteral sub-grammar. It's an interesting combination of syntactic and lexical productions, and the lexical productions aren't really recursive. (I mean, technically TemplateCharacters is, but only to achieve repetition.)

Similarly, the sub-grammar for RegularExpressionLiteral only uses recursion for repetition. But the sub-grammar for Pattern is quite recursive, so that's maybe what you're thinking of.

Hm, this is combining a couple of ideas. The Pattern nonterminal is certainly a separate lexical goal symbol. In fact, the spec refers to it (and the productions under it) as its own separate grammar. Similarly for the grammar under StringNumericLiteral. However, those are "multiple goals" in a fairly different way from how the various InputElement symbols are multiple goals.

monica · November 18, 2021, 3:48am

Yes, my original question was about the spec. Prior to this, my only points of reference had been the textbook grammars like the ones in the Dragon Book, which only have a single goal ("start") symbol. I also couldn't find a real-world lexical specification that uses multiple goal symbols. So I wasn't sure if the ones in the ECMAScript spec were a formal requirement of the grammar or just helpers to make the different contexts more explicit. I think it's just a case of the latter.

True, I think by "recursive" I was referring to the stuff inside the braces ${expr} - expr can be any valid JS, which is a recursive language. In the case of regex literals /pattern/ - pattern is a string derived from a recursive language.

Sorry, I was just trying to reconcile the spec with lexical states. But I was also wary of that explanation because as you say the spec doesn't care about how it's implemented. I think the multiple goals are just there to highlight the different contexts, not an instruction for the implementer to use lexical states specifically.

jmdyck · November 18, 2021, 2:38pm

Yup, but all that recursion is in the syntactic grammar.

Yes, I was just drawing a distinction that's important if you look into that part of the spec: RegularExpressionBody is distinct from Pattern, and does not derive it. (Roughly speaking, RegularExpressionBody is a cover grammar for Pattern, although the spec doesn't describe it that way.)

Yeah, that's probably a fair understanding of the situation.

monica · November 21, 2021, 1:29pm

Oh right, RegularExpressionBody and RegularExpressionFlag are just used to find the end of the regex literal as mentioned here. Everything inside /.../ is slurped as a RegularExpressionNonTerminator and then reparsed/refined according to Pattern, which does resemble a cover grammar.

Great, thank you for helping me think this through :) I was stuck on it for some time.

Topic		Replies	Views
Context-Free Grammars I have questions	2	146	January 23, 2024
ASI: Why are control flow statements and the postfix operator restricted? I have questions	2	418	September 28, 2021
Why can't "?" be used in variable or function names I have questions	2	3254	November 12, 2019
Why do non-strict functions need a separate lexical env for direct eval calls? I have questions	4	480	July 15, 2021
WordNET Dictionary APIs and Data Structures 💡 Ideas proposal	0	280	December 23, 2020

Lexical grammar: Why do context-sensitive tokens require multiple goal symbols?

Related topics