{C.like;}
languagesError tolerant parsing techniques for C-like languages to enable better tool integration, partial analysis, language extensibility, and meta-programming.
Keywords: combinators, homoiconicity, IDE, meta-programming, operator-precedence, parsing, quasiquotation
Most developers are familiar with languages whorse surface syntax is "C-like": C, C++, C#, Go, Java, JavaScript, Perl, Rust, Scala, Swift, and TypeScript. But the complexity of parsing C-like languages makes it hard to achieve what Lisps take for granted.
Lisps reuse a few syntactic constructs that transparently relate to commonly used datatypes. This makes it easy for programmer tools to deal with program fragments, for designers to extend the language with new special forms or macros, and for programs to manipulate programs as data.
Lispers often group these benefits under the rubric of “homoiconicity”. That term has some problems, so this document adopts goals attributed to homoiconicity but does not claims that any resulting language is homoiconic. These goals include:
The coarse-grained structure of a program can be inferred without a grammar rule per special form.
This lets tools easily analyzing a fragment of a larger program at the lowest appropriate level of abstraction. For example, test coverage tools and dynamic call graph generators instrument programs but can often operate on a token stream.
It also lets tools provide degraded service when given a program that uses a new special form.
Unfortunately large grammars make it hard for tools simpler than the compiler to accurately deal with program fragments, to degrade gracefully, or to deal with code as data. This limits a diverse tool ecosystems to languages with a large community that can afford the engineering effort.
This document introduces a way for niche languages to use familar C-like syntax and hopefully still support a diverse tool ecosystem. Specifically it
Grammars with many, fine-grained rules can do a good job preventing nonsensical inputs from confusing later compiler stages, but typically produce either a parse tree, or error messages, not both.
Statement | ::== | for | ( Stuff ) { Stuff } |
/ | if | ( Stuff ) { Stuff } … | |
/ | while | ( Stuff ) { Stuff } | |
/ | … |
A large grammar like this could adapt to account for common developer errors and to deal with partial inputs, but this takes engineering effort per rule, and makes the grammar even larger.
if ( x ) { fn(; }
┗━┛ ┃ ╹ ┃
┗━━━┛
A parser that degrades gracefully on inputs like the above, should recognize the kind of rules-of-thumb that programmers use to reason about broken inputs:
The grammar above has a lot of repetition.
In languages with C-like grammars, most of the flow control special
forms, and some other special forms (like JavaScript's
function(...) {...}
) follow the
convention:
Statement | ::== | keyword [ ( Stuff ) ] { Stuff } … |
A cover grammar is a grammar that matches all strings matched by
some “covered” grammars so that a later pass can
disambiguate between them. For example, consider the JavaScript
expression (a = null) => (a = null)
.
In it, (a = null)
appears twice,
once as a formal parameter list with a default value, and once as an
assignment. To avoid reparsing based on whether a
=>
follows the close parenthesis, JavaScript
engines often use a cover grammar for FormalParameterList
and PrimaryExpression.
We can formalize developers' intuitions about a language by crafting a few rules that cover the myriad rules in the full grammar. This lets a few error recovery strategies tuned to programmers' intuitions carry much weight.
Each of the branches for Statement above looks like a function application followed by a block, but there are a some conventions which distinguish these from normal function calls.
{…}
).{…}
may be executed
late, multiply, or not at all.Additionally, certain keywords allow chaining these constructs, e.g. else.
Most C-like flow control constructs can be covered by a single rule:
FlowControl | ::== | Keyword ParenPartopt Tail; |
ParenPart | ::== | ( TokenSoup ); |
ParenPartopt | ::== | ParentPart / ε; |
Tail | ::== | { TokenSoup } MoreFlowControl |
/ | ;; | |
MoreFlowControl | ::== | InfixKeyword FlowControl |
/ | ε; | |
InfixKeyword | ::== | else if / else / catch / finally; |
Not all follow this convention but many do, and below we detail how to handle exceptions to the rule.
This document builds a parser in JavaScript so that it can run in your browser. You can see below that the parser operates in three stages:
Before jumping into code, it's worth defining some terms. For the purposes of this document:
An AST need not contain all significant tokens, e.g. an implementation might drop parentheses used merely for grouping, or tokens like “if” which are implicit in the tag.
An inner AST node may have the special error tag in which case its children are a soup of tokens to which later passes should not assign clear semantics.
Operator precedence parsers reuse a few simple rules. It's straightforward to craft such a parser to produce a parse tree for every input so they recover well from many kinds of syntax errors. Instead of having a grammar rule for each language construct, an operator precedence parser behaves like a grammar driven parser with just one rule per kind of operator:
Operand | ::== | PrefixOperator Operand |
| | Operand InfixOperator Operand | |
| | Operand TernaryOperator Operand TernaryFollower Operand | |
| | Operand OpenBracket Operand CloseBracket | |
| | Operand PostfixOperator | |
| | SimpleOperand; |
Operator precedence parsers use “operators” to figure out
how to turn a series of tokens into a tree structure. This lets us
infer parentheses in expressions and to identify when a keyword like
else
or the while
in do...while...
groups things to its immediate left and right.
So PrefixOperator defines which operators
can appear before their operands like the ‘-’ in
“-x
”;
InfixOperator defines which can appear
in the middle like the ‘+’ in
“x+y
”.
An operator precedence parser still needs to decide how to group
operands: for example, if
InfixOperator::==+|×
then naïvely applying the grammar above to
“a×b+c
” would give two possible
inerpretations: (a×b)+c and a×(b+c).
Operator precedence parsers use an order over operators to avoid
ambiguity. This order gives precedence to some operators
and answers the question “Can this operator contain that
operator unparenthesized?” For example, if ‘+’ can
contain ‘×’ unparenthesized but not vice versa
then valid interpretations of “a×b+c
”
include
(a×b)+c but not a×(b+c).
Here is an operator precedence table similar to those that appear in user-documentation for C, Java (normative), and JavaScript which gives a foundation for our precedence relation.
We need to be able to match up brackets.
Some derived tables simplify later code greatly.
These operators give arithmetic & logical expressions theit usual meanings.
Note: Herein, parse trees are shown with a rectangle around each inner node. Click the expando (▶) to see an indented view of the same. These diagrams are generated programatically by the described parser running as JavaScript in your browser.
There is a deep, unavoidable ambiguity in C++,
Java, and probably Rust. A ‘<’ token may be either
infix operator less-than or the start of an angle bracketed group of
type parameters. Consider the partial token sequence …
f ( a < b , c > d )
… in Java:
public <b, c> int f(a<b, c> d) { // ⇧ ⇧ ⇧ ⇧ // Angle brackets group type parameters in a method signature. return d.f(a < b, c > d); // ⇧ ⇧ // Using less than, greater than to compute actual parameters. }
This is a real problem and this document only provides a partial solution to this ambiguity. We take a cue from TypeScript which allows types (and hence type parameters) in a few readily identifiable contexts:
<typecast>
syntax which we
ignore.
Here's how things parse when our operator precedence function
only allows <…>
to the right of the
operators listed above.
Note: JSXElements
like (image = <img src={src}>
)
do not require parse-time disambiguation. The code herein does not
handle JSX, but later sections introduce mechanisms that should extend
to that use case. Specifically, lexing explains
how to use one token of lookbehind to decide whether ‘/’
starts a division operator or a regular expression literal. A similar
approach could decide whether a ‘<’ starts an operator
or a JSX block. Also, preparsing
talks about a technique to group template literal parts which would
help handle JSX.
This approach was found to be brittle, and hard to build confidence in. It is included to document a dead-end.
It may be possible to resolve ambiguities, like angle bracket ambiguity, would be to provide additional context cues to the parser.
A part-of-speech tagger could run as a part of the pre-parse processing pass over tokens to add context usable by the operator precedence function. Parts of speech might include:
This example shows two distinct uses of ‘<’: once to bracket type parameters, and later as a comparison operator. It also shows three distinct uses of <{>: first to start a statement block, then o bracket a record type, and finally to construct a record.
An operator precedence function that used part-of-speech cues could assign different precedences to these different uses.
Here's a tagger that seems to work well for TypeScript, but is brittle.
This tagger is a fairly complicated push-down automaton, but languages for which it's possible to specify may be able to exploit that fact to simplify their precedence functions.
Besides that, expressions in C-like languages are a good fit for
precedence tables like this. It turns out that statements can be
shoe-horned in with a wrinkle though.
As discussed, most flow control
constructs in C-like languages have the form keyword (expression)
statement
.
There are a few keywords that precede blocks but that always follow
statements. And ‘;
’ separates statements.
With this, we're ready to parse some statements.
Except that in do{...}while(...);
the while
appears like an infix operator, but only directly
inside a do
which acts like a prefix operator.
Also, ‘:’ is odd. It seems to have widely distinct precedences:
case
or default
it acts as a postfix
operator and terminates any expression.
There's one more wrinkle. In keyword
(expr) { stmt; }
, the {}
s act as an infix
operator like the ()
s. This is not what we want for
the second pair of {}
s in while (b) {x} {y}
.
In C-like languages, {y}
is a block that
executes separately from, and after the while
construct. So we will disallow chaining of {}
as an
infix operator.
Defining prefix operators for keywords like return
that precede bare expressions ensures we won't get a different parse tree
depending on whether parentheses are present, and return-1
won't parse to infix ‘-’ as it should in x-1
.
Knowing all this, we're ready to craft our answer to “Can this operator contain that one unparenthesized?”:
This function takes >50 lines of code to handle legacy constructs and corner cases from a slew of languages to demonstrate that handling these is feasible. An equivalent function for a language that did not need to support all these corner cases might be much simpler. For example, using […] to group type parameters, a la Scala, would obviate half of this code.
This function depends on some bookkeeping functions to juggle brackets.
This works pretty well but there are a few quirks worth mentionining.
In f(actual0, actual1,
actual2)
there are nested infix commas
(,
) which will later require a bit of work in the
grammar (see commaop:
).
But semicolons (‘;’) are postfix operators so there's
no nesting in blocks, and in for (init;cond;incr)
.
In for loops, parts are optional.
“for (;;) {}
” has a postfix semicolon
operator applied to a leaf ‘;
’ token which
requires an extra branch in the for-loop rule.
Multiple adjacent NOPs behave the same.
Before defining the parser, we abstract out the set of operators.
(This is just a wrapper for our operators
array, but
the abstraction will come in handy later when talking about
user-defined operators a la Scala.)
We can use the operator definitions and precedence relation to define a parser.
Tests use these unless otherwise specified.
We've defined a parser that operates on a stream of tokens, but have not yet defined the source of those tokens. This lexer is not interesting except to demonstrate strategies for dealing with irregularities and recursive lexical constructs without entangling the parser & lexer.
Note: Readers may find it confusing when I use the term
“context-free” (CF) to refer specifically to the lexical
grammar instead of the language as a whole. Some C-like languages
have simple lexical grammars but are not CF. E.g., in C++ the parser
needs context about whether “x” is a type to decide
whether x* y;
is a declaration of a pointer variable or an
application of operator *
. This kind of complication
does not mean that an input cannot be split into tokens with a CF
lexical grammar, even though you may need to do some tricks when
“>>
” follows type parameters.) See
also
Trevor Jim.
C-like languages' lexical grammars need not be regular, meaning there need not exist a single regular expression that partitions the input into tokens. JavaScript and Perl give some ideas of where common non-regularities arise; neither have either regular or context-free lexical grammars; both require scannerless parsing.
It is possible to restructure the parser above to be scannerless; where the parser above asks “is the current token an infix operator?”, a scannerless parser would ask “which, if any, is the longest infix operator that is a prefix of the remaining input and can nest here?”.
We do not do that though, because non-CF lexical grammars defeat our goal of allowing simple analyzers to correctly segment fragments of code that start and end on token boundaries.
So CF grammars meet our goals, but regular lexical grammars are
insufficient for commonly used, C-like languages. Complications arise
when deciding whether ‘/’ starts a division operator
(like /
or /=
) or a
/regular-expression-literal/
.
Below is a lexer that recognizes
/regular-expression-literals/
and
JavaScript-esque `template ${ literals }`
.
It is CF but non-regular in two ways:
When parsing a token starting with ‘/’
//
or /*
then the
current token is a comment token.This differs from JavaScript around expressions like x = ++/regex/i.index
and x++ /num/i
because some operators (++, --, +, -) are both infix/prefix
and postfix operators.
(The author believes this is not a source of confusion in practice,
but has not measured).
Our lexer works as expected on some sample inputs. These tests show the series of tokens produced, with comment and whitespace tokens filtered out.
Unfortunately, our parser can't deal with complex constructs like
`staticText ${ dynamicValue } etc.`
that span multiple
tokens.
We could build knowledge of this into the parser, but that would defeat the goal of having few, simple rules, and moving most of the understanding into the language-specific parse tree → AST transform.
Instead, we transform the token stream between the lexer and the parser by adding synthetic tokens. This keeps knowledge about complex lexical constructs in the pre-parse phases.
Before | `staticText ${ |
dynamicValue | } etc.` | ||||
---|---|---|---|---|---|---|---|
After | ( | `staticText ${ |
( | dynamicValue | ) | } etc.` | ) |
The added parentheses have the effect of ensuring that the whole
string is grouped as one construct, and that expressions inside holes
(${...}
) are self-contained. Open parenthesis is overloaded
in most C-like languages to mean (grouping, function call, type cast)
so we take care that the synthesized parentheses around holes
are not interpreted as INFIX operations.
We can see that this parses sensibly, even when there's an errant ‘;’ in an expression that fills a hole, and that the parentheses around holes do not participate in function application.
Some C-like languages automatically insert semicolons.
let x = 1 /* ; */ let y = 2 // Not here + x /* ; */
This may make writing code easier, requiring fewer keystrokes and getting rid of common error messages about missing semicolons. ASI may be a net positive for authors, but in all languages that do it is a source of uncommon, but subtle bugs. Whether it is better for readers is unclear.
This section discusses different varieties of semicolon insertion and how they affect operator precedence parsing. Later, suggestions provides concrete advice. The parser introduced above does not do semicolon insertion.
ASI can be deeply entangled with parsing:
When, …, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if …
This operator precedence scheme allows two non-operator tokens to
appear adjacent: int x
. An attempt to build ASI into an
operator precedence parser would need to design the language to avoid
adjacent words or explicitly handle this via some kind of adjacency
operator, so that the parser could elect to insert a semicolon when
the current token would require popping the stack past semicolon's
precedence level, and a postfix semicolon could accept the lowest
precedence stack element that has precedence greater than semicolon.
Alternatively, ASI can be based on lexical analysis before parsing starts, which would allow it to be done during token preparsing which would allow ASI without entangling the parser:
a semicolon is automatically inserted into the token stream immediately after a line's final token if that token is …
A line ending is treated as a semicolon unless one of the following conditions is true: [predicates about the top of the bracket stack, or next/last token]
Simple operations like moving tokens from one line to another to split overly long lines becomes more complicated when you need to ensure that the compiler won't differently insert semicolons. For example, in JavaScript ASI can cause two similar looking programs to be interpreted quite differently (see also Ecma262 ¶ 11.9.2):
for (let i = 0; i < 10; i++) { f(i); } // ↓ for (let i = 0; i < 10; i ++) { f(i); } // SyntaxError. Per JavaScript's ASI rules `++` cannot be separated // from its operand by a line break. longName = i++ f() // ↓ longName = i ++ f() // Loud semantic change. Second equivalent to `longName=i;++(f());` function f(longName) { return longName() } // ↓ function f(longName) { return longName() } // Silent semantic change. Call never happens. Returns undefined
With ASI, a developer using a generic editor function like Emacs's fill-paragraph (M-q) without a JavaScript-specific mode hook runs a risk of introducing subtle semantic changes to code. This is subideal.
Another option (unimplemented in any language to the best of the authors' knowledge) is to do ASI but not in a way that's invisible to authors and readers.
Much of the debate about ASI can be boiled down to two claims which seem mutually contradictory but are actually reconcilable:
Language designers could specify how ASI should be done, commit to taking ASI into account when considering backwards compatibility, but leave the actual insertion to developer tools.
Scala shows one way of providing for user-defined infix expression operators.
Any method with a single parameter can be used as an infix operator.
…When an expression uses multiple operators, the operators are evaluated based on the priority of the first character: …
It's unsurprising that operation precedence parsers can handle user defined operators with different precedences.
Now, user code can use an operator not defined above like
‘<=>
’ (trinary comparison in Perl)
with precedence similar to ‘<
’:
This requires some changed to lexing since our stock lexer treats
‘<=>
’. The testcase above uses a
lexer that simply splits on word breaks and spaces.
One downside is that custom operators require developers to use more
whitespace even in code that does not use custom operators. This
limitation arises because there is no closed set of punctuation
strings that lets us split, for example “x*-y
”
into “x * -y
” because
“*-
” could be a user defined operator. C
already has this problem to a small degree because
“x--y
” is not the same as
“x - -y
”, but developers typically write
“x+y
” instead of subtracting a negated
value, so the fact that ‘++
’ and
‘--
’ are unsplittable by context-free
lexers does not, in practice, confuse C authors.
These tests show some common ways a parse tree may fail some generic well-formedness checks, and in the process catalogue some common kinds of problematic inputs. Red wiggly underlines show tokens that participate in a malformed subtree. Click the expando to see the error messages.
Many tools properly operate on parse trees, but some, chiefly compilers and interpreters, do better with representations that lack extraneous details, and are more directly related to language specification abstractions or elements of the output. “Compilers: Principles, Techniques, and Tools” by Aho, Sethi, Ullman contrasts parse trees with a more abstract tree representation:
Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.
Parser combinators are a fine way to turn a source text into a parse tree, but they can also be used to derive an abstract syntax tree (AST) from a parse tree.
These tests show ASTs. Each inner AST node has a green border, and the node type in small, green print overlapping the top-left of its border. Leaf nodes are black text, have no type, and always correspond to tokens in the input.
And this doesn't prevent containing errors:
Combinators operate on a sequence of tokens, so, first, we need to transform our parse tree into a sequence.
Input | “ | let | x | = | 1 | ; | ” | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Tokens | "let" | "x" | "=" | "1" | ";" | |||||||
Parse Tree | [ | [ | "let" | , | "x" | ], | "=" | , [ | "1" | ], | ";" | ] |
Flat Tree | ⇨ | ⇨ | "let" | "x" | ⇦ | "=" | ⇨ | "1" | ⇦ | ";" | ⇦ |
As you can see, a flattened parse tree:
The flatten function takes a parse tree and produces a sequence with indent and dedent psuedo-tokens around the content of each inner node.
Because the flattened tree contains explicit ⇨ and ⇦ it can be consumed by simple combinators that make no special provisions for either Left- or Right- recursion. Readers familiar with widely used combinator libraries will probably find the implementation uninteresting; you can skip them without missing anything.
There's nothing novel about the combinator definitions, so if you already know how combinator libraries tend to work, you can gloss over them without missing much. The main thing to note is that they don't use side-effects to build a parse tree. Combinators read an input buffer containing a flattened parse tree and build a flattened AST by appending to an output buffer which is then “lifted” into a proper tree in a reverse of the flatten function used above.
Here is a grammar for a toy language that looks like TypeScript.
This grammar has gaps (no switch
); it is too long already.
Unless otherwise stated, examples hereafter use this transform from parse trees to ASTs:
The rest of the end-to-end tests are a bit duplicative.
To answer “How good is this parser at containing damage?” we collect statistics on how many substrings can be removed from a benchmark code sample without affecting the parsing of structures on either side.
outerBefore(); { innerBefore(); f(a/2*'3'); innerAfter(); } outerAfter();
This benchmark code fragment was crafted to include a variety of token types, and so that some mutations would introduce new token types like comments and regular expression literals.
For each possible substring of the underlined portion, we parse a program having removed that substring from the benchmark, and test several properties of the resulting AST:
outerBefore
?innerBefore
?innerAfter
?outerAfter
?node
?has outer before | has inner before | has inner after | has outer after | has error |
---|
The parser reliably parses statements before the mutated statement and in a
different block.
The parser fairly reliably parses statements after the end of the block,
and in the same block but before, but is not perfect;
removing the substring “2
” produces
f(a/*'3');
and the block comment consumes
everything after it.
The parser has mixed success parsing statements that occur after the mutated
statement in the same block.
The parsing scheme outlined above provides for a limited kind of homoiconicity; tools less complex than the compiler can consume fragments of code, and those rules would extend to new special forms.
Homoiconicity also makes it easier for programs to operate on or produce program fragments. Quoting from “Quasiquotation in Lisp” (Bawden 1999):
Quasiquotation is the technology commonly used in Lisp to write program-generating programs.
…
The backquote character (`) introduces a quasiquotation … Inside the quasiquotation, the comma character (,) marks expressions whose values are to be substituted into the result.
…
S-expressions were at the core of McCarthy's original version of Lisp. The ability to manipulate programs as data has always been an important part of what Lisp is all about. But without quasiquotation, actually working with S-expressions can be painful.
So quasiquotation requires two affordances:
JavaScript-esque template literals provide this, but run afoul of:
The string substitution that underlies [
fprintf
] has no understanding of the syntactic structure of the programming language being generated.
Javascript's tagged template literals allow that, but require implementing a parser for code with holes and lose source metadata. This section shows how this operator precedence scheme can meet the goals of quasi-quotation within a C-like language.
… goals for a successful implementation of quasiquotation:
- Quasiquotation should enable programmers to write down what they want the output to look like, modified only slightly in order to parameterize it.
- The parameter expressions should appear inside the template, in the positions where their values will be inserted.
- The underlying data structures manipulated by quasiquotation should be rich enough to represent recursively defined structures such as expressions.
Defining “\{
” and “\(
” as
bracket operators, enables quoting statements and expressions.
Defining “${
” as an expression operator allows
embedding unquoted expressions inside quoted code.
Then we add some extra productions to the toy grammar, and augment a
few existing ones.
With that “\{
…}
” embeds
a parse sub-tree as data. qtree and
other bluish AST nodes recreate the parse tree structure of the
quoted portions.
If qinner behaves like array and qleaf like literal then that quasiquotation would produce a tree like data structure. The qhole breaks back out into unquoted AST nodes that could contribute to the larger data structure.
Since “\{
…}
” uses the
cover grammar, its consumers can recognize forms not allowed by the
parse-tree → AST phase. In this sense, it is similar to Common
Lisp's read
which recognizes some characters
(‘[’, ‘]’, ‘{’, ‘}’,
‘?’, and ‘!’) which are not used by Common
Lisp syntax but which are explicitly reserved for user extensions.
List structure is not quite as stark a representation [of code] as character strings, but it is still pretty low-level. Perhaps we would be happier if, instead of manipulating lists, our quasiquotation technology manipulated objects from a set of abstract data types that were designed specifically for each of the various different constructs in our language (variables, expressions, definitions,
cond
-clases, etc.). After abandoning character strings as too low-level it seems very natural to keep moving towards even highler-level representations.
Separately, “\(
…)
”
quotes an AST with holes. These two constructs let program
generators deal with program fragment templates at two different
levels of abstraction: parse trees and ASTs.
The second kind of quasiquotation exposes details of language specification abstracts to user code. Some language designers have expressed concern that this may make it harder to evolve the language. On a proposal for a binary encoding for EcmaScript AST:
WH: compatibility means existing nodes strips compiled continue to work as we upgrade the language. But as ECMAScript evolves the same text source code may compile to different nodes even if they don't use new constructs. … The Hyrum's Law challenge is that folks may come to rely on source text compiling to a specific AST.
The parse tree form has fewer rules, but there is still a non-zero lock-in risk:
WH: In this committee it came up discussing changing associativity of
||
from left-to-right to right-to-left in order to properly support one of the variants of the??
proposal. It's invisible from within ECMAScript but would change which AST gets generated. There was opposition related to the effect this would have on Babel's ASTs.
In Lisp, the grammar for S-expressions is very simple and stable. Without experience maintaining a language whose initial parsing is based on cover grammars & precedence table, it's hard to say how stable these are over time, or what strategies are effective in working around compatibility problems due to Hyrum's law.
This parser scheme produces a parse tree that groups tokens for some C-like languages into a tree that is unambiguous and which can be processed into an AST.
It deals poorly with some idioms and grammatical constructions.
As previously discussed, there is a deep, unavoidable ambiguity between ‘<’ as a comparison operator and as a bracket around type parameters.
Unless there is a closed set of operators that precisely
distinguish type contexts from exception context, this is
incorrigible. template
could be a prefix operator, but
operator precedence parsing cannot take into account questions like
“Is vector
a type name?”
This is subideal because the low precedence “--
” operator
captures the apparent function call if (cond)
since it can function
as both a prefix and postfix operator.
This can be partially fixed by defining if
and friends as low-priority
prefix operators, as is already done for do
.
The if
node is now self contained, but without some
invisible adjacency operator auto-inserted between the close
parenthesis and --
the parse tree is still odd. Worse,
the --
acts as a postfix operator on if
so this
alternate nesting scheme would require extra rules to prevent postfix
operators applying to if
.
Worse, this is no longer homoiconic. New operators that want to syntactically mimic the
structure of if
need to have their own entry in the precedence table.
The cover grammar allows
else if
to follow else
which might
introduce some shift-reduce style problems. This may be corrigible
by tweaking canNest
with respect to a
prefix if
and infix else
or by
recognizing else if
as a separate, two token infix
operator in preparseTokens as for Python-esque
not in and is not.
This is not a problem when the language requires brackets around if bodies as the toy language does.
Go makes parentheses optional around flow control constructs; it
uses for init;cond;incr { }
instead
of for (init;cond;incr) { }
.
This is subideal because the low priority &&
operator captures the higher
priority infix {}
operator.
Again, defining infix curly brackets as low-priority can probably address this
but the author has not proven this.
(TODO: try this. It shouldn't break do{}while)
It would be nice if tools could make high-quality-ish conclusions about chunks of a program. One of the most important things is to match declarations with uses, which is often a prerequisite to identifying free variables.
The author has been unable to find a generic algorithm that identifies declarations.
It is generally the case that in keyword
(...) {...}
any declarations immediately inside the
parentheses are in scope for the code in the curly brackets.
This is not sufficient.
To recognize a declaration in C, C++, and Java, you need to have a list of symbols that are
type names. For example, in C++, {x * y;}
is a declaration of a pointer when
x
is a type name, or an invocation of the infix ‘*’ operator when
it is not.
In Java, inner classes can be inherited, so local analysis is not
sufficient to tell whether x.y.z
is a reference to a static member of
unqualified type x
, a reference to a member of a (possibly inherited)
field x
, or a reference to a static member of a class with the absolute
name x.y
.
Languages that consistently set declarations apart with a keyword
like let
, const
, val
,
or var
could define those as prefix operators and
may avoid this problem.
Code inside a class declaration in languages like Java and C++ inherit masking symbols, so you need to understand the inheritance structure to draw conslusions about variables in scope.
Languages that require this.x
to access fields, or that shorten this to
.x
by defining ‘.’ as a prefix operator can avoid this problem.
If you wish your language to be parseable by schemes like this:
<…>
to group type parameters.
For example, Scala
uses
[…]
.
Craft declaration and flow control constructs to fit within the
cover grammar modulo explicitly
handled legacy constructs like do...while...
and
switch(...){case...:...}
.
Specifically:
Avoid layered namespaces like Java's layering of explicitly imported type names, wildcard imported type names, explicitly imported static properties and methods, wildcard imported properties and methods, local identifiers, inherited properties and methods.
Make local identifiers always mask external definitions and
maybe use a lightweight convention like .identifier
instead of just this.identifier
so that tools can
confidently distinguish local references from references that can
only be resolved with knowledge from other source files.
If you must automagically insert semicolons, leave it to tools like IDEs, and do it based on lexical analysis.
Separating parsing of coarse-level structure from fine-grained structure provides benefits including: better tool support during development, easier partial program analysis, language extensibility, and metaprogramming.
Some simple guidelines make it easier to craft a new language that gets these benefits and uses a syntax familiar to developers who work with widely-used, C-like languages.
Ambiguity between less-than & angle-brackets means that not all existing languages cannot be migrated to this scheme, although, strangely, some complex languages like TypeScript may be migratable. There are clear precedents that show how to design around this ambiguity.
That ambiguity avoided, operator precedence parsers can parse C-like languages. (Operator precedence parsers have not been widely used for parsing C-like languages, except in conjunction with RD, though Perl 6's Parser Grammar Engine may be one precedent.)
This operator precedence parsers degrades gracefully given partial and broken inputs. It combines a small fixed set of rules with an extensible operator partial-order, so also meets the extensibility goal.
The same techniques that allow grammar extensibility, also allow
extensible quasiquotation similar to Common Lisp's read
.
Input |
|
---|---|
Tokens | |
Parse Tree | |
Problems | |
Flat Tree | |
AST | |
Parts of Speech |
↑ Output, ↓ Input