Towards Homoiconicity for `{C.like;}` languages

Error tolerant parsing techniques for C-like languages to enable better tool integration, partial analysis, language extensibility, and meta-programming.

Keywords: combinators, homoiconicity, IDE, meta-programming, operator-precedence, parsing, quasiquotation

Abstract

Most developers are familiar with languages whorse surface syntax is "C-like": C, C++, C#, Go, Java, JavaScript, Perl, Rust, Scala, Swift, and TypeScript. But the complexity of parsing C-like languages makes it hard to achieve what Lisps take for granted.

Lisps reuse a few syntactic constructs that transparently relate to commonly used datatypes. This makes it easy for programmer tools to deal with program fragments, for designers to extend the language with new special forms or macros, and for programs to manipulate programs as data.

Lispers often group these benefits under the rubric of “homoiconicity”. That term has some problems, so this document adopts goals attributed to homoiconicity but does not claims that any resulting language is homoiconic. These goals include:

The coarse-grained structure of a program can be inferred without a grammar rule per special form.
This lets tools easily analyzing a fragment of a larger program at the lowest appropriate level of abstraction. For example, test coverage tools and dynamic call graph generators instrument programs but can often operate on a token stream.
Common syntactic errors can be isolated to part of a parse tree. This is important when a developer is editing code.
- IDEs should be able to derive some type information to suggest appropriate completions, even when the expression containing the cursor is malformed.
- If an input to a REPL fails to parse, the REPL should be able to localize the error to a fragment, and re-present the input with the cursor at a convenient place.
It also lets tools provide degraded service when given a program that uses a new special form.
Program generating programs can reuse the small number of rules with extensions.
Non-onerous syntactic constraints allow common flow-insensitive analyses like matching uses of symbols with declarations to be performed by simple tools without converting a parse tree to an AST or taking into account other source files.

Unfortunately large grammars make it hard for tools simpler than the compiler to accurately deal with program fragments, to degrade gracefully, or to deal with code as data. This limits a diverse tool ecosystems to languages with a large community that can afford the engineering effort.

This document introduces a way for niche languages to use familar C-like syntax and hopefully still support a diverse tool ecosystem. Specifically it

Discusses developers' intuitions about C-like languages that help them reason about partial programs, broken programs, or brokens with unfamiliar special forms,
Formalizes those intuitions via a cover grammar for C-like statement constructs and catalogues exceptions to the rule,
Defines an operator-precedence parser for C-like languages that reuses a small number of rules,
Catalogues sources of non-context-freeness in some widely-used C-like languages, and how to avoid those in new languages,
Defines a context-free lexer which recognizes complicated lexical constructs from JavaScript and Perl but which, unlike those languages, does not require scannerless parsing,
Evaluates the same on partial & broken inputs to see how well it localizes damage to the parse tree,
Catalogues the ways that a parse tree derived from a partial or broken input may fail generic well-formedness checks,
Shows how parser combinators can be used to derive ASTs from parse trees, including partial or broken parse trees, for a toy language,
Extends the toy language with quasiquotation syntax,
Suggests ways language implementors can craft a grammar to take full advantage of this approach.

Covering Grammars

Grammars with many, fine-grained rules can do a good job preventing nonsensical inputs from confusing later compiler stages, but typically produce either a parse tree, or error messages, not both.

Statement	::==	for	( Stuff ) { Stuff }
	/	if	( Stuff ) { Stuff } …
	/	while	( Stuff ) { Stuff }
	/	…

A large grammar like this could adapt to account for common developer errors and to deal with partial inputs, but this takes engineering effort per rule, and makes the grammar even larger.

if ( x ) { fn(; }

   ┗━┛ ┃   ╹ ┃
         ┗━━━┛

A parser that degrades gracefully on inputs like the above, should recognize the kind of rules-of-thumb that programmers use to reason about broken inputs:

like brackets pair,
curlies group statements,
semicolons separate statements,
commas separate expressions

The grammar above has a lot of repetition. In languages with C-like grammars, most of the flow control special forms, and some other special forms (like JavaScript's function(...) {...}) follow the convention:

Statement

::==

keyword [ ( Stuff ) ] { Stuff } …

A cover grammar is a grammar that matches all strings matched by some “covered” grammars so that a later pass can disambiguate between them. For example, consider the JavaScript expression (a = null) => (a = null). In it, (a = null) appears twice, once as a formal parameter list with a default value, and once as an assignment. To avoid reparsing based on whether a => follows the close parenthesis, JavaScript engines often use a cover grammar for FormalParameterList and PrimaryExpression.

We can formalize developers' intuitions about a language by crafting a few rules that cover the myriad rules in the full grammar. This lets a few error recovery strategies tuned to programmers' intuitions carry much weight.

Each of the branches for Statement above looks like a function application followed by a block, but there are a some conventions which distinguish these from normal function calls.

They start with a reserved keyword.
Declarations inside the parentheses, when allowed, are visible within the curly bracket block ({…}).
Declarations in either are not visible outside the whole. (This was not the case in C89 and earlier but was fixed in C99 and languages since have followed suit.)
Execution of statements in {…} may be executed late, multiply, or not at all.

Additionally, certain keywords allow chaining these constructs, e.g. else.

Most C-like flow control constructs can be covered by a single rule:

FlowControl	::==	Keyword ParenPart_opt Tail;
ParenPart	::==	( TokenSoup );
ParenPart_opt	::==	ParentPart / ε;
Tail	::==	{ TokenSoup } MoreFlowControl
	/	;;
MoreFlowControl	::==	InfixKeyword FlowControl
	/	ε;
InfixKeyword	::==	else if / else / catch / finally;

Not all follow this convention but many do, and below we detail how to handle exceptions to the rule.

Stages

This document builds a parser in JavaScript so that it can run in your browser. You can see below that the parser operates in three stages:

Input: Source Text
“f(x + y);”
Lexical Phase: Source Text → Tokens
f ( x + y ) ;
Parse Phase: Tokens → Parse Tree
f ( x + y ) ;
Final Phase: Parse Tree → Abstract Syntax Tree

stmt
expr
call
expr
id f

actuals
expr
binaryop
expr
id x

+
expr
id y

Before jumping into code, it's worth defining some terms. For the purposes of this document:

Token

A substring of a program fragment with some diagnostic metadata. Tokens may be malformed, e.g. an unclosed quoted string.

Significant token

A token that is not a comment or ignorable whitespace.

Parse Tree

Concrete Syntax Tree (CST)

A tree such that leaves are tokens and a preorder enumeration of leaves yields the significant tokens from which it was parsed.

Abstract Syntax Tree (AST)

A tree such that each node is either a token, or is a collection of sub-trees and a tag that identifies a production in the language's abstract grammar.

An AST need not contain all significant tokens, e.g. an implementation might drop parentheses used merely for grouping, or tokens like “if” which are implicit in the tag.

An inner AST node may have the special error tag in which case its children are a soup of tokens to which later passes should not assign clear semantics.

Lexer

Partitions source text into a stream of tokens.

Parser

A consumer of a stream of tokens that produces a parse tree.

Combinator

A consumer of a stream that builds a structured output based on declarative rules describing patterns involving sequences of stream elements.

Precedence

Operator precedence parsers reuse a few simple rules. It's straightforward to craft such a parser to produce a parse tree for every input so they recover well from many kinds of syntax errors. Instead of having a grammar rule for each language construct, an operator precedence parser behaves like a grammar driven parser with just one rule per kind of operator:

Operand	::==	PrefixOperator Operand
	\|	Operand InfixOperator Operand
	\|	Operand TernaryOperator Operand TernaryFollower Operand
	\|	Operand OpenBracket Operand CloseBracket
	\|	Operand PostfixOperator
	\|	SimpleOperand;

Operator precedence parsers use “operators” to figure out how to turn a series of tokens into a tree structure. This lets us infer parentheses in expressions and to identify when a keyword like else or the while in do...while... groups things to its immediate left and right. So PrefixOperator defines which operators can appear before their operands like the ‘-’ in “-x”; InfixOperator defines which can appear in the middle like the ‘+’ in “x+y”.

An operator precedence parser still needs to decide how to group operands: for example, if InfixOperator::==+|×; then naïvely applying the grammar above to “a×b+c” would give two possible inerpretations: (a×b)+c and a×(b+c).

Operator precedence parsers use an order over operators to avoid ambiguity. This order gives precedence to some operators and answers the question “Can this operator contain that operator unparenthesized?” For example, if ‘+’ can contain ‘×’ unparenthesized but not vice versa then valid interpretations of “a×b+c” include (a×b)+c but not a×(b+c).

Here is an operator precedence table similar to those that appear in user-documentation for C, Java (normative), and JavaScript which gives a foundation for our precedence relation.

We need to be able to match up brackets.

Some derived tables simplify later code greatly.

These operators give arithmetic & logical expressions theit usual meanings.

Note: Herein, parse trees are shown with a rectangle around each inner node. Click the expando (▶) to see an indented view of the same. These diagrams are generated programatically by the described parser running as JavaScript in your browser.

Less-than ambiguity

There is a deep, unavoidable ambiguity in C++, Java, and probably Rust. A ‘<’ token may be either infix operator less-than or the start of an angle bracketed group of type parameters. Consider the partial token sequence … f ( a < b , c > d ) … in Java:

public <b, c> int f(a<b, c> d) {
  //   ⇧   ⇧       ⇧   ⇧
  // Angle brackets group type parameters in a method signature.

  return d.f(a < b, c > d);
  //           ⇧     ⇧
  // Using less than, greater than to compute actual parameters.
}

This is a real problem and this document only provides a partial solution to this ambiguity. We take a cue from TypeScript which allows types (and hence type parameters) in a few readily identifiable contexts:

As the right operand of a few infix operators like ‘:’, “as”, “extends”, “instanceof”, and “new”. Since ‘:’ can precede labeled statements this may not be perfect.
As the right operand of ‘=’ when it appears in a “type” declaration.
In deprecated <typecast> syntax which we ignore.

Here's how things parse when our operator precedence function only allows <…> to the right of the operators listed above.

Note: JSXElements like (image = <img src={src}>) do not require parse-time disambiguation. The code herein does not handle JSX, but later sections introduce mechanisms that should extend to that use case. Specifically, lexing explains how to use one token of lookbehind to decide whether ‘/’ starts a division operator or a regular expression literal. A similar approach could decide whether a ‘<’ starts an operator or a JSX block. Also, preparsing talks about a technique to group template literal parts which would help handle JSX.

Dead-end: “Part-of-speech” tagging

This approach was found to be brittle, and hard to build confidence in. It is included to document a dead-end.

It may be possible to resolve ambiguities, like angle bracket ambiguity, would be to provide additional context cues to the parser.

A part-of-speech tagger could run as a part of the pre-parse processing pass over tokens to add context usable by the operator precedence function. Parts of speech might include:

An expression is entered on ‘(’, ‘[’, ‘=’ (reset by ‘,’), ‘;’ in parens
A statement is entered on statement keyword or ‘;’ not in parens.
A type is entered on ‘as’ (TypeScript infix cast operator) or ‘:’ in parens or following declaration keyword, but not following ‘?’

This example shows two distinct uses of ‘<’: once to bracket type parameters, and later as a comparison operator. It also shows three distinct uses of <{>: first to start a statement block, then o bracket a record type, and finally to construct a record.

An operator precedence function that used part-of-speech cues could assign different precedences to these different uses.

Here's a tagger that seems to work well for TypeScript, but is brittle.

This tagger is a fairly complicated push-down automaton, but languages for which it's possible to specify may be able to exploit that fact to simplify their precedence functions.

Flow control joiners

Besides that, expressions in C-like languages are a good fit for precedence tables like this. It turns out that statements can be shoe-horned in with a wrinkle though. As discussed, most flow control constructs in C-like languages have the form keyword (expression) statement.

There are a few keywords that precede blocks but that always follow statements. And ‘;’ separates statements.

Statement parsing

With this, we're ready to parse some statements.

do…while

Except that in do{...}while(...); the while appears like an infix operator, but only directly inside a do which acts like a prefix operator.

Flavours of colon

Also, ‘:’ is odd. It seems to have widely distinct precedences:

Paired with a ‘?’ it acts as part of a ternary operator.
After case or default it acts as a postfix operator and terminates any expression.
It separates a label from a statement.
Otherwise it's an infix operator that separates keys from values or typed items from types.

Infix curly brackets

There's one more wrinkle. In keyword (expr) { stmt; }, the {}s act as an infix operator like the ()s. This is not what we want for the second pair of {}s in while (b) {x} {y}. In C-like languages, {y} is a block that executes separately from, and after the while construct. So we will disallow chaining of {} as an infix operator.

Expression preceding keywords

Defining prefix operators for keywords like return that precede bare expressions ensures we won't get a different parse tree depending on whether parentheses are present, and return-1 won't parse to infix ‘-’ as it should in x-1.

An operator precedence function

Knowing all this, we're ready to craft our answer to “Can this operator contain that one unparenthesized?”:

This function takes >50 lines of code to handle legacy constructs and corner cases from a slew of languages to demonstrate that handling these is feasible. An equivalent function for a language that did not need to support all these corner cases might be much simpler. For example, using […] to group type parameters, a la Scala, would obviate half of this code.

This function depends on some bookkeeping functions to juggle brackets.

Separators

This works pretty well but there are a few quirks worth mentionining.

In f(actual₀, actual₁, actual₂) there are nested infix commas (,) which will later require a bit of work in the grammar (see commaop:).

But semicolons (‘;’) are postfix operators so there's no nesting in blocks, and in for (init;cond;incr).

In for loops, parts are optional. “for (;;) {}” has a postfix semicolon operator applied to a leaf ‘;’ token which requires an extra branch in the for-loop rule.

Multiple adjacent NOPs behave the same.

Parsing

Before defining the parser, we abstract out the set of operators. (This is just a wrapper for our operators array, but the abstraction will come in handy later when talking about user-defined operators a la Scala.)

We can use the operator definitions and precedence relation to define a parser.

Tests use these unless otherwise specified.

Lexing

We've defined a parser that operates on a stream of tokens, but have not yet defined the source of those tokens. This lexer is not interesting except to demonstrate strategies for dealing with irregularities and recursive lexical constructs without entangling the parser & lexer.

Note: Readers may find it confusing when I use the term “context-free” (CF) to refer specifically to the lexical grammar instead of the language as a whole. Some C-like languages have simple lexical grammars but are not CF. E.g., in C++ the parser needs context about whether “x” is a type to decide whether x* y; is a declaration of a pointer variable or an application of operator *. This kind of complication does not mean that an input cannot be split into tokens with a CF lexical grammar, even though you may need to do some tricks when “>>” follows type parameters.) See also Trevor Jim.

C-like languages' lexical grammars need not be regular, meaning there need not exist a single regular expression that partitions the input into tokens. JavaScript and Perl give some ideas of where common non-regularities arise; neither have either regular or context-free lexical grammars; both require scannerless parsing.

It is possible to restructure the parser above to be scannerless; where the parser above asks “is the current token an infix operator?”, a scannerless parser would ask “which, if any, is the longest infix operator that is a prefix of the remaining input and can nest here?”.

We do not do that though, because non-CF lexical grammars defeat our goal of allowing simple analyzers to correctly segment fragments of code that start and end on token boundaries.

So CF grammars meet our goals, but regular lexical grammars are insufficient for commonly used, C-like languages. Complications arise when deciding whether ‘/’ starts a division operator (like / or /=) or a /regular-expression-literal/.

Below is a lexer that recognizes /regular-expression-literals/ and JavaScript-esque `template ${ literals }`. It is CF but non-regular in two ways:

When parsing a token starting with ‘/’
1. If the token starts with // or /* then the current token is a comment token.
2. Let last be the last significant (non-comment or whitespace) token.
3. If last is a prefix operator, an infix operator, or an open bracket, then parse the current token as a regular expression.
4. Else parse the current token as a division operator.
This differs from JavaScript around expressions like x = ++/regex/i.index and x++ /num/i because some operators (++, --, +, -) are both infix/prefix and postfix operators. (The author believes this is not a source of confusion in practice, but has not measured).
The lexer keeps a stack of bits so it can tell whether a ‘}’ is a regular source token or returns lexing to content of a JavaScript-esque template literal.

Our lexer works as expected on some sample inputs. These tests show the series of tokens produced, with comment and whitespace tokens filtered out.

Token preparsing

Unfortunately, our parser can't deal with complex constructs like `staticText ${ dynamicValue } etc.` that span multiple tokens.

We could build knowledge of this into the parser, but that would defeat the goal of having few, simple rules, and moving most of the understanding into the language-specific parse tree → AST transform.

Instead, we transform the token stream between the lexer and the parser by adding synthetic tokens. This keeps knowledge about complex lexical constructs in the pre-parse phases.

Before		`staticText ${		`dynamicValue`		} etc.`
After	(	`staticText ${	(	`dynamicValue`	)	} etc.`	)

The added parentheses have the effect of ensuring that the whole string is grouped as one construct, and that expressions inside holes (${...}) are self-contained. Open parenthesis is overloaded in most C-like languages to mean (grouping, function call, type cast) so we take care that the synthesized parentheses around holes are not interpreted as INFIX operations.

We can see that this parses sensibly, even when there's an errant ‘;’ in an expression that fills a hole, and that the parentheses around holes do not participate in function application.

Automatic Semicolon Insertion (ASI)

Some C-like languages automatically insert semicolons.

let x = 1    /* ; */
let y = 2    // Not here
        + x  /* ; */

This may make writing code easier, requiring fewer keystrokes and getting rid of common error messages about missing semicolons. ASI may be a net positive for authors, but in all languages that do it is a source of uncommon, but subtle bugs. Whether it is better for readers is unclear.

This section discusses different varieties of semicolon insertion and how they affect operator precedence parsing. Later, suggestions provides concrete advice. The parser introduced above does not do semicolon insertion.

Syntactic ASI

ASI can be deeply entangled with parsing:

JavaScript's ASI relies on exhaustive parsing to know when an insertion would not prevent expanding the current statement:

When, …, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if …

This operator precedence scheme allows two non-operator tokens to appear adjacent: int x. An attempt to build ASI into an operator precedence parser would need to design the language to avoid adjacent words or explicitly handle this via some kind of adjacency operator, so that the parser could elect to insert a semicolon when the current token would require popping the stack past semicolon's precedence level, and a postfix semicolon could accept the lowest precedence stack element that has precedence greater than semicolon.

Lexical ASI

Alternatively, ASI can be based on lexical analysis before parsing starts, which would allow it to be done during token preparsing which would allow ASI without entangling the parser:

Go:

a semicolon is automatically inserted into the token stream immediately after a line's final token if that token is …
Scala:

A line ending is treated as a semicolon unless one of the following conditions is true: [predicates about the top of the bracket stack, or next/last token]

Effect of ASI on Tooling

Simple operations like moving tokens from one line to another to split overly long lines becomes more complicated when you need to ensure that the compiler won't differently insert semicolons. For example, in JavaScript ASI can cause two similar looking programs to be interpreted quite differently (see also Ecma262 ¶ 11.9.2):

for (let i = 0; i < 10; i++) {
  f(i);
}
//        ↓
for (let i = 0; i < 10; i
     ++) {
  f(i);
}
// SyntaxError.  Per JavaScript's ASI rules `++` cannot be separated
// from its operand by a line break.


longName = i++
f()
//        ↓
longName = i
++
f()
// Loud semantic change.  Second equivalent to `longName=i;++(f());`


function f(longName) {
  return longName()
}
//        ↓
function f(longName) {
  return
    longName()
}
// Silent semantic change.  Call never happens.  Returns undefined

With ASI, a developer using a generic editor function like Emacs's fill-paragraph (M-q) without a JavaScript-specific mode hook runs a risk of introducing subtle semantic changes to code. This is subideal.

Tool-based ASI

Another option (unimplemented in any language to the best of the authors' knowledge) is to do ASI but not in a way that's invisible to authors and readers.

Much of the debate about ASI can be boiled down to two claims which seem mutually contradictory but are actually reconcilable:

The language should incorporate ASI to make it easier for authors (especially novices) to produce programs, and to reduce cognitive load for casual readers.
The language's compiler/interpreter should not perform ASI since it is a source of subtle bugs and careful reading of code is difficult when readers have to predict where semicolons are inserted.

Language designers could specify how ASI should be done, commit to taking ASI into account when considering backwards compatibility, but leave the actual insertion to developer tools.

IDE on-save hooks could perform ASI, thus relieving the author of the burden of typing semicolons, but making them explicit in the source text.
Code syntax highlighters could de-emphasize (via lower-contrast coloring) semicolons that would be inserted by such hooks were they not present, reducing the cognitive burden on readers concerned with intent over semantics.
Additionally, the compiler/interpreter could do ASI but then reqort a fatal error if any semicolons were actually inserted. Developers who uses inserting editors should never encounter one of these errors, but incorrigible Notepad users would have to produce output that would not change if loaded into an inserting editor and immediately saved.

Custom Expression Operators

Scala shows one way of providing for user-defined infix expression operators.

Any method with a single parameter can be used as an infix operator.
…
When an expression uses multiple operators, the operators are evaluated based on the priority of the first character: …

It's unsurprising that operation precedence parsers can handle user defined operators with different precedences.

Now, user code can use an operator not defined above like ‘<=>’ (trinary comparison in Perl) with precedence similar to ‘<’:

This requires some changed to lexing since our stock lexer treats ‘<=>’. The testcase above uses a lexer that simply splits on word breaks and spaces.

One downside is that custom operators require developers to use more whitespace even in code that does not use custom operators. This limitation arises because there is no closed set of punctuation strings that lets us split, for example “x*-y” into “x * -y” because “*-” could be a user defined operator. C already has this problem to a small degree because “x--y” is not the same as “x - -y”, but developers typically write “x+y” instead of subtracting a negated value, so the fact that ‘++’ and ‘--’ are unsplittable by context-free lexers does not, in practice, confuse C authors.

Well-formedness

These tests show some common ways a parse tree may fail some generic well-formedness checks, and in the process catalogue some common kinds of problematic inputs. Red wiggly underlines show tokens that participate in a malformed subtree. Click the expando to see the error messages.

A partial token like
An illegal token.
A missing token.
A missing construct.
Extra tokens
Things that aren't recognized as errors before the AST pass but which could be

Deriving ASTs via Combinators

Many tools properly operate on parse trees, but some, chiefly compilers and interpreters, do better with representations that lack extraneous details, and are more directly related to language specification abstractions or elements of the output. “Compilers: Principles, Techniques, and Tools” by Aho, Sethi, Ullman contrasts parse trees with a more abstract tree representation:

Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.

Parser combinators are a fine way to turn a source text into a parse tree, but they can also be used to derive an abstract syntax tree (AST) from a parse tree.

These tests show ASTs. Each inner AST node has a green border, and the node type in small, green print overlapping the top-left of its border. Leaf nodes are black text, have no type, and always correspond to tokens in the input.

And this doesn't prevent containing errors:

Combinators operate on a sequence of tokens, so, first, we need to transform our parse tree into a sequence.

Flattening a parse tree
Input	“		let		x		=		1		;	”
Tokens			"let"		"x"		"="		"1"		";"
Parse Tree	[	[	"let"	,	"x"	],	"="	, [	"1"	],	";"	]
Flat Tree	⇨	⇨	"let"		"x"	⇦	"="	⇨	"1"	⇦	";"	⇦

As you can see, a flattened parse tree:

is a series of tokens and pseudo-tokens.
uses a single kind of bracket (⇨…⇦ above).
is always balanced even for malformed parse trees.
is consumable by parser combinators.

The flatten function takes a parse tree and produces a sequence with indent and dedent psuedo-tokens around the content of each inner node.

Because the flattened tree contains explicit ⇨ and ⇦ it can be consumed by simple combinators that make no special provisions for either Left- or Right- recursion. Readers familiar with widely used combinator libraries will probably find the implementation uninteresting; you can skip them without missing anything.

There's nothing novel about the combinator definitions, so if you already know how combinator libraries tend to work, you can gloss over them without missing much. The main thing to note is that they don't use side-effects to build a parse tree. Combinators read an input buffer containing a flattened parse tree and build a flattened AST by appending to an output buffer which is then “lifted” into a proper tree in a reverse of the flatten function used above.

Definitions of combinators

Here is a grammar for a toy language that looks like TypeScript. This grammar has gaps (no switch); it is too long already.

Unless otherwise stated, examples hereafter use this transform from parse trees to ASTs:

The rest of the end-to-end tests are a bit duplicative.

Let statement tests

For loop tests

Function declaration tests

More expression tests

More statement tests

Quantifying Parser Resilience

To answer “How good is this parser at containing damage?” we collect statistics on how many substrings can be removed from a benchmark code sample without affecting the parsing of structures on either side.

outerBefore();
{
  innerBefore();
  f(a/2*'3');
  innerAfter();
}
outerAfter();

This benchmark code fragment was crafted to include a variety of token types, and so that some mutations would introduce new token types like comments and regular expression literals.

For each possible substring of the underlined portion, we parse a program having removed that substring from the benchmark, and test several properties of the resulting AST:

has outer before: Does the AST contain a call to outerBefore?
has inner before: Does the AST contain a call to innerBefore?
has inner after: Does the AST contain a call to innerAfter?
has outer after: Does the AST contain a call to outerAfter?
has error: Does the AST contain an error node?

has outer before	has inner before	has inner after	has outer after	has error

The parser reliably parses statements before the mutated statement and in a different block. The parser fairly reliably parses statements after the end of the block, and in the same block but before, but is not perfect; removing the substring “2” produces f(a/*'3'); and the block comment consumes everything after it. The parser has mixed success parsing statements that occur after the mutated statement in the same block.

Code as data

The parsing scheme outlined above provides for a limited kind of homoiconicity; tools less complex than the compiler can consume fragments of code, and those rules would extend to new special forms.

Homoiconicity also makes it easier for programs to operate on or produce program fragments. Quoting from “Quasiquotation in Lisp” (Bawden 1999):

Quasiquotation is the technology commonly used in Lisp to write program-generating programs.

…

The backquote character (`) introduces a quasiquotation … Inside the quasiquotation, the comma character (,) marks expressions whose values are to be substituted into the result.

…

S-expressions were at the core of McCarthy's original version of Lisp. The ability to manipulate programs as data has always been an important part of what Lisp is all about. But without quasiquotation, actually working with S-expressions can be painful.

So quasiquotation requires two affordances:

Syntax to “quote” part of a parse tree.
Syntax to embed an unquoted expression inside a quoted parse tree.

JavaScript-esque template literals provide this, but run afoul of:

The string substitution that underlies [fprintf] has no understanding of the syntactic structure of the programming language being generated.

Javascript's tagged template literals allow that, but require implementing a parser for code with holes and lose source metadata. This section shows how this operator precedence scheme can meet the goals of quasi-quotation within a C-like language.

… goals for a successful implementation of quasiquotation:

Quasiquotation should enable programmers to write down what they want the output to look like, modified only slightly in order to parameterize it.
The parameter expressions should appear inside the template, in the positions where their values will be inserted.
The underlying data structures manipulated by quasiquotation should be rich enough to represent recursively defined structures such as expressions.

Defining “\{” and “\(” as bracket operators, enables quoting statements and expressions. Defining “${” as an expression operator allows embedding unquoted expressions inside quoted code. Then we add some extra productions to the toy grammar, and augment a few existing ones.

With that “\{…}” embeds a parse sub-tree as data. qtree and other bluish AST nodes recreate the parse tree structure of the quoted portions.

If qinner behaves like array and qleaf like literal then that quasiquotation would produce a tree like data structure. The qhole breaks back out into unquoted AST nodes that could contribute to the larger data structure.

Since “\{…}” uses the cover grammar, its consumers can recognize forms not allowed by the parse-tree → AST phase. In this sense, it is similar to Common Lisp's read which recognizes some characters (‘[’, ‘]’, ‘{’, ‘}’, ‘?’, and ‘!’) which are not used by Common Lisp syntax but which are explicitly reserved for user extensions.

List structure is not quite as stark a representation [of code] as character strings, but it is still pretty low-level. Perhaps we would be happier if, instead of manipulating lists, our quasiquotation technology manipulated objects from a set of abstract data types that were designed specifically for each of the various different constructs in our language (variables, expressions, definitions, cond-clases, etc.). After abandoning character strings as too low-level it seems very natural to keep moving towards even highler-level representations.

Separately, “\(…)” quotes an AST with holes. These two constructs let program generators deal with program fragment templates at two different levels of abstraction: parse trees and ASTs.

Risk: Lock-in

The second kind of quasiquotation exposes details of language specification abstracts to user code. Some language designers have expressed concern that this may make it harder to evolve the language. On a proposal for a binary encoding for EcmaScript AST:

WH: compatibility means existing nodes strips compiled continue to work as we upgrade the language. But as ECMAScript evolves the same text source code may compile to different nodes even if they don't use new constructs. … The Hyrum's Law challenge is that folks may come to rely on source text compiling to a specific AST.

The parse tree form has fewer rules, but there is still a non-zero lock-in risk:

WH: In this committee it came up discussing changing associativity of || from left-to-right to right-to-left in order to properly support one of the variants of the ?? proposal. It's invisible from within ECMAScript but would change which AST gets generated. There was opposition related to the effect this would have on Babel's ASTs.

In Lisp, the grammar for S-expressions is very simple and stable. Without experience maintaining a language whose initial parsing is based on cover grammars & precedence table, it's hard to say how stable these are over time, or what strategies are effective in working around compatibility problems due to Hyrum's law.

Limitations & CounterExamples

This parser scheme produces a parse tree that groups tokens for some C-like languages into a tree that is unambiguous and which can be processed into an AST.

It deals poorly with some idioms and grammatical constructions.

Angle bracket ambiguity

As previously discussed, there is a deep, unavoidable ambiguity between ‘<’ as a comparison operator and as a bracket around type parameters.

Unless there is a closed set of operators that precisely distinguish type contexts from exception context, this is incorrigible. template could be a prefix operator, but operator precedence parsing cannot take into account questions like “Is vector a type name?”

Flow control without brackets

This is subideal because the low precedence “--” operator captures the apparent function call if (cond) since it can function as both a prefix and postfix operator. This can be partially fixed by defining if and friends as low-priority prefix operators, as is already done for do.

The if node is now self contained, but without some invisible adjacency operator auto-inserted between the close parenthesis and -- the parse tree is still odd. Worse, the -- acts as a postfix operator on if so this alternate nesting scheme would require extra rules to prevent postfix operators applying to if.

Worse, this is no longer homoiconic. New operators that want to syntactically mimic the structure of if need to have their own entry in the precedence table.

Else Else If

The cover grammar allows else if to follow else which might introduce some shift-reduce style problems. This may be corrigible by tweaking canNest with respect to a prefix if and infix else or by recognizing else if as a separate, two token infix operator in preparseTokens as for Python-esque not in and is not.

This is not a problem when the language requires brackets around if bodies as the toy language does.

Flow control without parentheses

Go makes parentheses optional around flow control constructs; it uses for init;cond;incr { } instead of for (init;cond;incr) { }.

This is subideal because the low priority && operator captures the higher priority infix {} operator. Again, defining infix curly brackets as low-priority can probably address this but the author has not proven this. (TODO: try this. It shouldn't break do{}while)

Scoping

It would be nice if tools could make high-quality-ish conclusions about chunks of a program. One of the most important things is to match declarations with uses, which is often a prerequisite to identifying free variables.

The author has been unable to find a generic algorithm that identifies declarations.

It is generally the case that in keyword (...) {...} any declarations immediately inside the parentheses are in scope for the code in the curly brackets.

This is not sufficient.

To recognize a declaration in C, C++, and Java, you need to have a list of symbols that are type names. For example, in C++, {x * y;} is a declaration of a pointer when x is a type name, or an invocation of the infix ‘*’ operator when it is not.

In Java, inner classes can be inherited, so local analysis is not sufficient to tell whether x.y.z is a reference to a static member of unqualified type x, a reference to a member of a (possibly inherited) field x, or a reference to a static member of a class with the absolute name x.y.

Languages that consistently set declarations apart with a keyword like let, const, val, or var could define those as prefix operators and may avoid this problem.
Code inside a class declaration in languages like Java and C++ inherit masking symbols, so you need to understand the inheritance structure to draw conslusions about variables in scope.

Languages that require this.x to access fields, or that shorten this to .x by defining ‘.’ as a prefix operator can avoid this problem.

Suggestions

If you wish your language to be parseable by schemes like this:

Divide parsing into these stages:
1. Lexing: breaking the input into tokens. (See note on scannerless parsing.)
2. Preparsing: inserting synthetic tokens to group complicated lexical constructs like template literals or JSX/E4X tags.
3. Parsing: deriving a parse tree from the token stream
4. Checking: checking the parse tree for well-formedness
5. Abstracting: converting the parse tree to an AST (abstract-syntax-tree) which is a suitable input to a compiler or interpreter.
Document each so that tools can operate at an appropriate level.
Avoid angle bracket ambiguity.
- Do not use <…> to group type parameters. For example, Scala uses […] .
- or ensure that lexical analysis (a left to right pass over tokens that may keep a bracket stack) alone can disambiguate. It is important to be able to distinguish, without a list of type names:
  - Parentheses that group formal parameters from those that group actual parameters.
  - Declarations from assignment statements.
Craft declaration and flow control constructs to fit within the cover grammar modulo explicitly handled legacy constructs like do...while... and switch(...){case...:...}.

Specifically:
- require parentheses unlike Go and Swift
- make brackets non optional like Go and Swift
Avoid layered namespaces like Java's layering of explicitly imported type names, wildcard imported type names, explicitly imported static properties and methods, wildcard imported properties and methods, local identifiers, inherited properties and methods.

Make local identifiers always mask external definitions and maybe use a lightweight convention like .identifier instead of just this.identifier so that tools can confidently distinguish local references from references that can only be resolved with knowledge from other source files.
If you must automagically insert semicolons, leave it to tools like IDEs, and do it based on lexical analysis.

Summary

Separating parsing of coarse-level structure from fine-grained structure provides benefits including: better tool support during development, easier partial program analysis, language extensibility, and metaprogramming.

Some simple guidelines make it easier to craft a new language that gets these benefits and uses a syntax familiar to developers who work with widely-used, C-like languages.

Ambiguity between less-than & angle-brackets means that not all existing languages cannot be migrated to this scheme, although, strangely, some complex languages like TypeScript may be migratable. There are clear precedents that show how to design around this ambiguity.

That ambiguity avoided, operator precedence parsers can parse C-like languages. (Operator precedence parsers have not been widely used for parsing C-like languages, except in conjunction with RD, though Perl 6's Parser Grammar Engine may be one precedent.)

This operator precedence parsers degrades gracefully given partial and broken inputs. It combines a small fixed set of rules with an extensible operator partial-order, so also meets the extensibility goal.

The same techniques that allow grammar extensibility, also allow extensible quasiquotation similar to Common Lisp's read.

Try it out

Input
Tokens
Parse Tree
Problems
Flat Tree
AST
Parts of Speech

↑ Output, ↓ Input

Thanks for reading!

Towards Homoiconicity for {C.like;} languages