Indentation-Sensitive Parsing with Pyramids
March 24th, 2022
Abstract
This is a short note about combining indentation-sensitivity and pyramid grammars. We describe the syntax of indentation-sensitive pyramid grammars by one such grammar.
Introduction
A central idea of
Practal is to empower users to express their mathematical and technical ideas freely and concisely. An important facet of this is that users must be able to define their own syntax in a simple way. This means that Practal must support
user-defined collaborative syntax. The collaborative aspect is important because users must also be able to reuse each others theories. Fragments of syntax defined by different parties will then mesh with each other, and it is paramount to have a system in place which can detect and point out ambiguous and conflicting syntax.
The idea of collaborative syntax gave rise to the concept of
Local Lexing [1]. As it turns out, a major obstacle to combining different syntax fragments is that parsing is usually split into two phases: lexing and actual parsing. While the actual parsing phase is based on context-free grammars and composable, the lexing phase is not composable at all. Local Lexing remedies this by not separating those two phases but interleaving them.
Local Lexing is a very general concept, in fact it can be understood as a general semantics of parsing.
Pyramid grammars have been introduced (also in
[1]) to package Local Lexing into a familiar and easy-to-use form. A pyramid grammar looks like a context-free grammar, but both terminals and nonterminals can appear on the left-hand side of rules. They can thus be thought of as a natural extension of context-free grammars. In fact, pyramid grammars are powerful enough to also encompass
parsing expression grammars [2].
In a further step, we now add the ability to specify indentation-sensitive syntax with pyramid grammars.
Indentation-Sensitivity
Adams [3] invented a method to amend context-free grammars with layout annotations to allow the specification of indentation-sensitive syntax.
Later on [4], he applied the same method to parsing expression grammars. Given that pyramid grammars are generalisations of both context-free grammars and parsing expression grammars, his approach seemed to be ideally suited for us. This turns out to be true.
At this point, we will not dive into the technical details of how exactly the integration of the Adams method with pyramid grammars works. It is quite nice but straightforward and will be covered at a later time.
Instead, we will just quickly describe how the layout annotations of Adams are written for pyramid grammars:
- We write A>, A≥ and A= for A>, A≥ and A=, respectively.
- We write A# for A⊛.
- Instead of ∣A∣, we write @A.
- We introduced an additional layout operator §A, which signifies that on the line where A starts, there are only spaces in front of it.
Every symbol on the right hand side of a rule needs to be annotated with either >, ≥, = or #. This becomes annoying quickly, and it is therefore convenient to specify a default relation which is used in case of a missing layout annotation. This default relation can be overriden for all rules belonging to a particular left-hand side symbol.
Expression Groups
We also added the concept of
expression groups to pyramid grammars. Expression groups are just syntactic sugar, and provide a simple way to specify groups of operators of different precedence. For example, it is possible to specify this
C Operator Precedence table pretty much one-to-one via expression groups. The precedence of an expression group is simply given by its position in the syntax specification. The higher up it appears, the higher its precedence is. The associativity of an operator can be influenced by prepending expressions with a tick:
Plus ⟶ `Expr + Expr
Here the left Expr has the same precedence as Plus, while the right Expr has higher precedence than Plus. This makes + a left-associative operator.
The idea for expression groups originates from the work on syntax for
Practical Types [5].
Parse Tree Generation
We have argued in
[1] that parse tree generation should not be explicitly defined, but automatically inferred from the grammar. To aid this, both nonterminals and terminals are classified along two axes:
- Structure: atomic and nested
- Visibility: visible, syntax, auxiliary
Nested symbols contain parse trees as children themselves, while atomic symbols have no children. By default, nonterminals are nested, and terminals are atomic.
Visible symbols have their own nodes in the parse tree, while for auxiliary symbols only their children appear in the parse tree. Thus a symbol which is both atomic and auxiliary is effectively hidden. By default, both nonterminals and terminals are visible.
A syntax symbol is a symbol that is visible for the purposes of creating a concrete syntax tree (for example in order to perform syntax highlighting), but auxiliary for the purposes of creating an abstract syntax tree.
Automatically generated
Swift data structures for the abstract syntax tree of the grammar below can be inspected
here.
The Grammar
We will now simply give you the pyramid grammar that describes its own syntax. Syntax-highlighting has been added to it, but you can also view it as
pure text.
By default, symbols starting with an uppercase letter are nonterminals, and symbols starting with a lowercase letters are terminals.
default relation ≥
ExprGrammar ⟶ ws {Decl= ws}
terminal(auxiliary, #) ws sp
nested ws
syntax comment
ws ⟶ (\s / \n / comment)*
sp ⟶ \s*
comment ⟶ "//" ~\n*
comment ⟶ "/*" (!"*/" ⋅)* "*/"
auxiliary Decl
Decl ⟶ @§NonterminalDecl=
Decl ⟶ @§TerminalDecl=
Decl ⟶ @§NestedDecl=
Decl ⟶ @§AtomicDecl=
Decl ⟶ @§AuxDecl=
Decl ⟶ @§SyntaxDecl=
Decl ⟶ @§RuleDecl=
Decl ⟶ @§PriorityDecl=
Decl ⟶ @§DefaultDecl=
Decl ⟶ @§ExpressionDecl=
Decl ⟶ @§TreatAsDecl=
NonterminalDecl ⟶ keyword:"nonterminal" SymbolDecl>
TerminalDecl ⟶ keyword:"terminal" SymbolDecl>
NestedDecl ⟶ nested SymbolList>
AtomicDecl ⟶ atomic SymbolList>
AuxDecl ⟶ aux SymbolList>
SyntaxDecl ⟶ syntax SymbolList>
auxiliary SymbolDecl
SymbolDecl ⟶ Modifiers SymbolList
Modifiers ⟶ [sp "(" ws [Modifier {ws "," ws Modifier} ws] ")"] ws
Modifier ⟶ aux | atomic | nested | syntax | Relation
aux ⟶ "aux" | "auxiliary"
atomic ⟶ "atomic"
nested ⟶ "nested"
syntax ⟶ "syntax"
Relation ⟶ greaterOrEqual | greater | equal | unrelated
greaterOrEqual ⟶ ">=" | "≥"
greater ⟶ ">"
equal ⟶ "="
unrelated ⟶ "#"
SymbolList ⟶ ws [symbol {ws symbol}]
symbol ⟶ letter {letter | digit}
auxiliary letter digit hexdigit
letter ⟶ "a" ... "z" | "A" ... "Z"
digit ⟶ "0" ... "9"
hexdigit ⟶ digit | "a" ... "f" | "A" ... "F"
nested accentedSymbol
accentedSymbol ⟶ "`" symbol
RuleDecl ⟶ symbol sp arrow> ws Expr>
syntax arrow
arrow ⟶ "->" | "-->" | "→" | "⟶"
TreatAsDecl ⟶ symbol sp backarrow> ws SymbolAlts>
syntax backarrow
backarrow ⟶ "<-" | "<--" | "←" | "⟵"
SymbolAlts ⟶ [symbol {ws op:"|" ws symbol}]
PriorityDecl ⟶ keyword:"priority" Priorities>
auxiliary Priorities
Priorities ⟶ ws [LongestMatch {ws op:"<" ws LongestMatch}]
LongestMatch ⟶ ws [symbol {ws symbol}]
DefaultDecl ⟶ keyword:"default" \s sp keyword:"relation" sp Relation>
syntax dots
dots ⟶ "..."
priority any < dots
syntax assignSyntax
assignSyntax ⟶ backarrow | ":"
expression Expr
group AtomicExpr
⟶ symbol
⟶ accentedSymbol
⟶ "(" ws Expr ws ")"
empty ⟶ "ε" | """" | "''"
any ⟶ "." | "⋅" | "·"
literal ⟶ """ {+ ~(""" | \n)} """ | "'" {+ ~("'" | \n)} "'"
hexchar ⟶ "\x" {+hexdigit}
decchar ⟶ "\d" {+digit}
space ⟶ "\s"
newline ⟶ "\n"
Opt ⟶ op:"[" ws Expr ws op:"]"
Rep ⟶ op:"{" ws Expr ws op:"}"
Rep1 ⟶ op:"{+" ws Expr ws op:"}"
Range ⟶ Expr ws dots ws Expr
Complement ⟶ op:"~" ws `Expr
Diff ⟶ `Expr ws op:"-" ws Expr
SyntaxTreatAs ⟶ symbol ws assignSyntax ws Expr
Related ⟶ Expr ws Relation
group
Front ⟶ layout:"§" ws `Expr
Abs ⟶ layout:"@" ws `Expr
group
OptGreedy ⟶ Expr ws peg:"?"
RepGreedy ⟶ Expr ws peg:"*"
Rep1Greedy ⟶ Expr ws peg:"+"
group
And ⟶ peg:"&" ws `Expr
Not ⟶ peg:"!" ws `Expr
Seq ⟶ [Expr {+ ws Expr}]
OrGreedy ⟶ `Expr ws peg:"/" ws Expr
Or ⟶ `Expr ws op:"|" ws Expr
auxiliary ExpressionPriorities ExpressionPriority
ExpressionDecl ⟶ keyword:"expression" sp symbol ExpressionPriorities>
ExpressionPriorities ⟶ ws {ExpressionPriority= ws}
ExpressionPriority ⟶ §@(ExpressionGroup= | ExpressionDef=)
ExpressionDef ⟶ GroupSymbol sp arrow ws Expr>
ExpressionGroup ⟶ keyword:"group" sp GroupSymbol ExpressionDefs>
ExpressionDefs ⟶ ws {§@ExpressionDefInGroup= ws}
ExpressionDefInGroup ⟶ [symbol] sp arrow ws Expr>
permeable ⟶ "!"
GroupSymbol ⟶ symbol [sp permeable]
GroupSymbol ⟶ [permeable]
Future Work
Parsing using indentation-sensitive pyramid grammars is currently based entirely on an unoptimised generalisation of
Earley parsing [6] adapted to Local Lexing. As such, it is currently very slow, to a point where it becomes unusable, especially for user interaction purposes. For example, the above grammar is parsed in about 15 seconds. Thus, to make pyramid grammars usable for Practal, we need to achieve a dramatic speedup by finding ways to apply more efficient parsing algorithms like
GLR [7] parsing, and furthermore by optimising the implementation.
Another issue is that we need to specify every single occurrence of whitespace (ws and sp in the above grammar). This seems excessive, and hopefully we can find a simple way to avoid it.
References
[7]Masaru Tomita (ed.). (1991). Generalized LR Parsing, Kluwer Academic Publishers.