Indentation-Sensitive Parsing with Pyramids

This is a short note about combining indentation-sensitivity and pyramid grammars. We describe the syntax of indentation-sensitive pyramid grammars by one such grammar.

A central idea of Practal is to empower users to express their mathematical and technical ideas freely and concisely. An important facet of this is that users must be able to deﬁne their own syntax in a simple way. This means that Practal must support user-deﬁned collaborative syntax. The collaborative aspect is important because users must also be able to reuse each others theories. Fragments of syntax deﬁned by different parties will then mesh with each other, and it is paramount to have a system in place which can detect and point out ambiguous and conﬂicting syntax.

The idea of collaborative syntax gave rise to the concept of Local Lexing [1]. As it turns out, a major obstacle to combining different syntax fragments is that parsing is usually split into two phases: lexing and actual parsing. While the actual parsing phase is based on context-free grammars and composable, the lexing phase is not composable at all. Local Lexing remedies this by not separating those two phases but interleaving them.

Local Lexing is a very general concept, in fact it can be understood as a general semantics of parsing. Pyramid grammars have been introduced (also in [1]) to package Local Lexing into a familiar and easy-to-use form. A pyramid grammar looks like a context-free grammar, but both terminals and nonterminals can appear on the left-hand side of rules. They can thus be thought of as a natural extension of context-free grammars. In fact, pyramid grammars are powerful enough to also encompass parsing expression grammars [2].

In a further step, we now add the ability to specify indentation-sensitive syntax with pyramid grammars.

Adams [3] invented a method to amend context-free grammars with layout annotations to allow the speciﬁcation of indentation-sensitive syntax. Later on [4], he applied the same method to parsing expression grammars. Given that pyramid grammars are generalisations of both context-free grammars and parsing expression grammars, his approach seemed to be ideally suited for us. This turns out to be true.

At this point, we will not dive into the technical details of how exactly the integration of the Adams method with pyramid grammars works. It is quite nice but straightforward and will be covered at a later time.

Instead, we will just quickly describe how the layout annotations of Adams are written for pyramid grammars:

Every symbol on the right hand side of a rule needs to be annotated with either >, ≥, = or #. This becomes annoying quickly, and it is therefore convenient to specify a default relation which is used in case of a missing layout annotation. This default relation can be overriden for all rules belonging to a particular left-hand side symbol.

We also added the concept of expression groups to pyramid grammars. Expression groups are just syntactic sugar, and provide a simple way to specify groups of operators of different precedence. For example, it is possible to specify this C Operator Precedence table pretty much one-to-one via expression groups. The precedence of an expression group is simply given by its position in the syntax speciﬁcation. The higher up it appears, the higher its precedence is. The associativity of an operator can be inﬂuenced by prepending expressions with a tick:

Here the left Expr has the same precedence as Plus, while the right Expr has higher precedence than Plus. This makes + a left-associative operator.

The idea for expression groups originates from the work on syntax for Practical Types [5].

We have argued in [1] that parse tree generation should not be explicitly deﬁned, but automatically inferred from the grammar. To aid this, both nonterminals and terminals are classiﬁed along two axes:

Nested symbols contain parse trees as children themselves, while atomic symbols have no children. By default, nonterminals are nested, and terminals are atomic.

Visible symbols have their own nodes in the parse tree, while for auxiliary symbols only their children appear in the parse tree. Thus a symbol which is both atomic and auxiliary is effectively hidden. By default, both nonterminals and terminals are visible.

A syntax symbol is a symbol that is visible for the purposes of creating a concrete syntax tree (for example in order to perform syntax highlighting), but auxiliary for the purposes of creating an abstract syntax tree.

Automatically generated Swift data structures for the abstract syntax tree of the grammar below can be inspected here.

We will now simply give you the pyramid grammar that describes its own syntax. Syntax-highlighting has been added to it, but you can also view it as pure text.

By default, symbols starting with an uppercase letter are nonterminals, and symbols starting with a lowercase letters are terminals.

// This is a Pyramid Grammar which describes its own syntax

default relation ≥

ExprGrammar ⟶ ws {Decl= ws}

terminal(auxiliary, #) ws sp
nested ws
syntax comment
ws ⟶ (\s / \n / comment)*
sp ⟶ \s*
comment ⟶ "//" ~\n*
comment ⟶ "/*" (!"*/" ⋅)* "*/"

auxiliary Decl
Decl ⟶ @§NonterminalDecl=
Decl ⟶ @§TerminalDecl=
Decl ⟶ @§NestedDecl=
Decl ⟶ @§AtomicDecl=
Decl ⟶ @§AuxDecl=
Decl ⟶ @§SyntaxDecl=
Decl ⟶ @§RuleDecl=
Decl ⟶ @§PriorityDecl=
Decl ⟶ @§DefaultDecl=
Decl ⟶ @§ExpressionDecl=
Decl ⟶ @§TreatAsDecl=

NonterminalDecl ⟶ keyword:"nonterminal" SymbolDecl>
TerminalDecl ⟶ keyword:"terminal" SymbolDecl>
NestedDecl ⟶ nested SymbolList>
AtomicDecl ⟶ atomic SymbolList>
AuxDecl ⟶ aux SymbolList>
SyntaxDecl ⟶ syntax SymbolList>

auxiliary SymbolDecl
SymbolDecl ⟶ Modifiers SymbolList

Modifiers ⟶ [sp "(" ws [Modifier {ws "," ws Modifier} ws] ")"] ws
Modifier ⟶ aux | atomic | nested | syntax | Relation
aux ⟶ "aux" | "auxiliary"
atomic ⟶ "atomic"
nested ⟶ "nested"
syntax ⟶ "syntax"

Relation ⟶ greaterOrEqual | greater | equal | unrelated
greaterOrEqual ⟶ ">=" | "≥"
greater ⟶ ">"
equal ⟶ "="
unrelated ⟶ "#"

SymbolList ⟶ ws [symbol {ws symbol}]
symbol ⟶ letter {letter | digit}

auxiliary letter digit hexdigit
letter ⟶ "a" ... "z" | "A" ... "Z"
digit ⟶ "0" ... "9"
hexdigit ⟶ digit | "a" ... "f" | "A" ... "F"

nested accentedSymbol
accentedSymbol ⟶ "`" symbol

RuleDecl ⟶ symbol sp arrow> ws Expr>
syntax arrow
arrow ⟶ "->" | "-->" | "→" | "⟶"

TreatAsDecl ⟶ symbol sp backarrow> ws SymbolAlts>
syntax backarrow
backarrow ⟶ "<-" | "<--" | "←" | "⟵"
SymbolAlts ⟶ [symbol {ws op:"|" ws symbol}]

PriorityDecl ⟶ keyword:"priority" Priorities>
auxiliary Priorities
Priorities ⟶ ws [LongestMatch {ws op:"<" ws LongestMatch}]
LongestMatch ⟶ ws [symbol {ws symbol}]

DefaultDecl ⟶ keyword:"default" \s sp keyword:"relation" sp Relation>

syntax dots
dots ⟶ "..."
priority any < dots
syntax assignSyntax
assignSyntax ⟶ backarrow | ":"

expression Expr
    group AtomicExpr
        ⟶ symbol
        ⟶ accentedSymbol
        ⟶ "(" ws Expr ws ")"
        empty ⟶ "ε" | """" | "''"
        any ⟶ "." | "⋅" | "·"
        literal ⟶ """ {+ ~(""" | \n)} """ | "'" {+ ~("'" | \n)} "'"
        hexchar ⟶ "\x" {+hexdigit}
        decchar ⟶ "\d" {+digit}
        space ⟶ "\s"
        newline ⟶ "\n"
        Opt ⟶ op:"[" ws Expr ws op:"]"
        Rep ⟶ op:"{" ws Expr ws op:"}"
        Rep1 ⟶ op:"{+" ws Expr ws op:"}"
    Range ⟶ Expr ws dots ws Expr
    Complement ⟶ op:"~" ws `Expr
    Diff ⟶ `Expr ws op:"-" ws Expr
    SyntaxTreatAs ⟶ symbol ws assignSyntax ws Expr
    Related ⟶ Expr ws Relation
    group
        Front ⟶ layout:"§" ws `Expr
        Abs ⟶ layout:"@" ws `Expr
    group
        OptGreedy ⟶ Expr ws peg:"?"
        RepGreedy ⟶ Expr ws peg:"*"
        Rep1Greedy ⟶ Expr ws peg:"+"
    group
        And ⟶ peg:"&" ws `Expr
        Not ⟶ peg:"!" ws `Expr
    Seq ⟶ [Expr {+ ws Expr}]
    OrGreedy ⟶ `Expr ws peg:"/" ws Expr
    Or ⟶ `Expr ws op:"|" ws Expr

auxiliary ExpressionPriorities ExpressionPriority
ExpressionDecl ⟶ keyword:"expression" sp symbol ExpressionPriorities>
ExpressionPriorities ⟶ ws {ExpressionPriority= ws}
ExpressionPriority ⟶ §@(ExpressionGroup= | ExpressionDef=)
ExpressionDef ⟶ GroupSymbol sp arrow ws Expr>
ExpressionGroup ⟶ keyword:"group" sp GroupSymbol ExpressionDefs>
ExpressionDefs ⟶ ws {§@ExpressionDefInGroup= ws}
ExpressionDefInGroup ⟶ [symbol] sp arrow ws Expr>
permeable ⟶ "!"
GroupSymbol ⟶ symbol [sp permeable]
GroupSymbol ⟶ [permeable]

Parsing using indentation-sensitive pyramid grammars is currently based entirely on an unoptimised generalisation of Earley parsing [6] adapted to Local Lexing. As such, it is currently very slow, to a point where it becomes unusable, especially for user interaction purposes. For example, the above grammar is parsed in about 15 seconds. Thus, to make pyramid grammars usable for Practal, we need to achieve a dramatic speedup by ﬁnding ways to apply more eﬃcient parsing algorithms like GLR [7] parsing, and furthermore by optimising the implementation.

Another issue is that we need to specify every single occurrence of whitespace (ws and sp in the above grammar). This seems excessive, and hopefully we can ﬁnd a simple way to avoid it.

Indentation-Sensitive Parsing with Pyramids

Introduction

Indentation-Sensitivity

Expression Groups

Parse Tree Generation

The Grammar

Future Work

References