© 2022 Steven Obua

Indentation-Sensitive Parsing with Pyramids

by Steven Obua
Cite as: https://doi.org/10.47757/pwp.1
March 24th, 2022
Abstract
This is a short note about combining indentation-sensitivity and pyramid grammars. We describe the syntax of indentation-sensitive pyramid grammars by one such grammar.

Introduction

A central idea of Practal is to empower users to express their mathematical and technical ideas freely and concisely. An important facet of this is that users must be able to define their own syntax in a simple way. This means that Practal must support user-defined collaborative syntax. The collaborative aspect is important because users must also be able to reuse each others theories. Fragments of syntax defined by different parties will then mesh with each other, and it is paramount to have a system in place which can detect and point out ambiguous and conflicting syntax.
The idea of collaborative syntax gave rise to the concept of Local Lexing [1]. As it turns out, a major obstacle to combining different syntax fragments is that parsing is usually split into two phases: lexing and actual parsing. While the actual parsing phase is based on context-free grammars and composable, the lexing phase is not composable at all. Local Lexing remedies this by not separating those two phases but interleaving them.
Local Lexing is a very general concept, in fact it can be understood as a general semantics of parsing. Pyramid grammars have been introduced (also in  [1]) to package Local Lexing into a familiar and easy-to-use form. A pyramid grammar looks like a context-free grammar, but both terminals and nonterminals can appear on the left-hand side of rules. They can thus be thought of as a natural extension of context-free grammars. In fact, pyramid grammars are powerful enough to also encompass parsing expression grammars [2].
In a further step, we now add the ability to specify indentation-sensitive syntax with pyramid grammars.

Indentation-Sensitivity

Adams [3] invented a method to amend context-free grammars with layout annotations to allow the specification of indentation-sensitive syntax. Later on [4], he applied the same method to parsing expression grammars. Given that pyramid grammars are generalisations of both context-free grammars and parsing expression grammars, his approach seemed to be ideally suited for us. This turns out to be true.
At this point, we will not dive into the technical details of how exactly the integration of the Adams method with pyramid grammars works. It is quite nice but straightforward and will be covered at a later time.
Instead, we will just quickly describe how the layout annotations of Adams are written for pyramid grammars:
Every symbol on the right hand side of a rule needs to be annotated with either >, , = or #. This becomes annoying quickly, and it is therefore convenient to specify a default relation which is used in case of a missing layout annotation. This default relation can be overriden for all rules belonging to a particular left-hand side symbol.

Expression Groups

We also added the concept of expression groups to pyramid grammars. Expression groups are just syntactic sugar, and provide a simple way to specify groups of operators of different precedence. For example, it is possible to specify this C Operator Precedence table pretty much one-to-one via expression groups. The precedence of an expression group is simply given by its position in the syntax specification. The higher up it appears, the higher its precedence is. The associativity of an operator can be influenced by prepending expressions with a tick:
Plus ⟶ `Expr + Expr
Here the left Expr has the same precedence as Plus, while the right Expr has higher precedence than Plus. This makes + a left-associative operator.
The idea for expression groups originates from the work on syntax for Practical Types [5].

Parse Tree Generation

We have argued in  [1] that parse tree generation should not be explicitly defined, but automatically inferred from the grammar. To aid this, both nonterminals and terminals are classified along two axes:
Nested symbols contain parse trees as children themselves, while atomic symbols have no children. By default, nonterminals are nested, and terminals are atomic.
Visible symbols have their own nodes in the parse tree, while for auxiliary symbols only their children appear in the parse tree. Thus a symbol which is both atomic and auxiliary is effectively hidden. By default, both nonterminals and terminals are visible.
A syntax symbol is a symbol that is visible for the purposes of creating a concrete syntax tree (for example in order to perform syntax highlighting), but auxiliary for the purposes of creating an abstract syntax tree.
Automatically generated Swift data structures for the abstract syntax tree of the grammar below can be inspected here.

The Grammar

We will now simply give you the pyramid grammar that describes its own syntax. Syntax-highlighting has been added to it, but you can also view it as pure text.
By default, symbols starting with an uppercase letter are nonterminals, and symbols starting with a lowercase letters are terminals.

// This is a Pyramid Grammar which describes its own syntax

default relation 

ExprGrammar  ws {Decl= ws}

terminal(auxiliary#ws sp
nested ws
syntax comment
ws  (\s / \n / comment)*
sp  \s*
comment  "//" ~\n*
comment  "/*" (!"*/" )* "*/"

auxiliary Decl
Decl  @§NonterminalDecl=
Decl  @§TerminalDecl=
Decl  @§NestedDecl=
Decl  @§AtomicDecl=
Decl  @§AuxDecl=
Decl  @§SyntaxDecl=
Decl  @§RuleDecl=
Decl  @§PriorityDecl=
Decl  @§DefaultDecl=
Decl  @§ExpressionDecl=
Decl  @§TreatAsDecl=

NonterminalDecl  keyword:"nonterminal" SymbolDecl>
TerminalDecl  keyword:"terminal" SymbolDecl>
NestedDecl  nested SymbolList>
AtomicDecl  atomic SymbolList>
AuxDecl  aux SymbolList>
SyntaxDecl  syntax SymbolList>

auxiliary SymbolDecl
SymbolDecl  Modifiers SymbolList

Modifiers  [sp "(" ws [Modifier {ws "," ws Modifier} ws] ")"] ws
Modifier  aux | atomic | nested | syntax | Relation
aux  "aux" | "auxiliary"
atomic  "atomic"
nested  "nested"
syntax  "syntax"

Relation  greaterOrEqual | greater | equal | unrelated
greaterOrEqual  ">=" | ""
greater  ">"
equal  "="
unrelated  "#"

SymbolList  ws [symbol {ws symbol}]
symbol  letter {letter | digit}

auxiliary letter digit hexdigit
letter  "a" ... "z" | "A" ... "Z" 
digit  "0" ... "9"
hexdigit  digit | "a" ... "f" | "A" ... "F"

nested accentedSymbol
accentedSymbol  "`" symbol

RuleDecl  symbol sp arrow> ws Expr>
syntax arrow
arrow  "->" | "-->" | "" | ""

TreatAsDecl  symbol sp backarrow> ws SymbolAlts>
syntax backarrow
backarrow  "<-" | "<--" | "" | ""
SymbolAlts  [symbol {ws op:"|" ws symbol}]

PriorityDecl  keyword:"priority" Priorities>
auxiliary Priorities
Priorities  ws [LongestMatch {ws op:"<" ws LongestMatch}]
LongestMatch  ws [symbol {ws symbol}]

DefaultDecl  keyword:"default" \s sp keyword:"relation" sp Relation>

syntax dots
dots  "..."
priority any < dots
syntax assignSyntax
assignSyntax  backarrow | ":"

expression Expr
    group AtomicExpr
         symbol
         accentedSymbol
         "(" ws Expr ws ")"
        empty  "ε" | """" | "''"
        any  "." | "" | "·"
        literal  """ {+ ~(""" | \n)} """ | "'" {+ ~("'" | \n)} "'"
        hexchar  "\x" {+hexdigit}
        decchar  "\d" {+digit}
        space  "\s"
        newline  "\n"
        Opt  op:"[" ws Expr ws op:"]"
        Rep  op:"{" ws Expr ws op:"}"
        Rep1  op:"{+" ws Expr ws op:"}"
    Range  Expr ws dots ws Expr
    Complement  op:"~" ws `Expr
    Diff  `Expr ws op:"-" ws Expr
    SyntaxTreatAs  symbol ws assignSyntax ws Expr
    Related  Expr ws Relation
    group
        Front  layout:"§" ws `Expr
        Abs  layout:"@" ws `Expr
    group
        OptGreedy  Expr ws peg:"?"
        RepGreedy  Expr ws peg:"*"
        Rep1Greedy  Expr ws peg:"+"
    group
        And  peg:"&" ws `Expr
        Not  peg:"!" ws `Expr
    Seq  [Expr {+ ws Expr}]
    OrGreedy  `Expr ws peg:"/" ws Expr
    Or  `Expr ws op:"|" ws Expr
  
auxiliary ExpressionPriorities ExpressionPriority
ExpressionDecl  keyword:"expression" sp symbol ExpressionPriorities>
ExpressionPriorities  ws {ExpressionPriority= ws}
ExpressionPriority  §@(ExpressionGroup= | ExpressionDef=)
ExpressionDef  GroupSymbol sp arrow ws Expr>
ExpressionGroup  keyword:"group" sp GroupSymbol ExpressionDefs>
ExpressionDefs  ws {§@ExpressionDefInGroup= ws}
ExpressionDefInGroup  [symbol] sp arrow ws Expr>
permeable  "!"
GroupSymbol  symbol [sp permeable]
GroupSymbol  [permeable]

Future Work

Parsing using indentation-sensitive pyramid grammars is currently based entirely on an unoptimised generalisation of Earley parsing [6] adapted to Local Lexing. As such, it is currently very slow, to a point where it becomes unusable, especially for user interaction purposes. For example, the above grammar is parsed in about 15 seconds. Thus, to make pyramid grammars usable for Practal, we need to achieve a dramatic speedup by finding ways to apply more efficient parsing algorithms like GLR [7] parsing, and furthermore by optimising the implementation.
Another issue is that we need to specify every single occurrence of whitespace (ws and sp in the above grammar). This seems excessive, and hopefully we can find a simple way to avoid it.

References

[1]Steven Obua. (2021). A New Semantics for Local Lexing, https://doi.org/10.47757/obua.ll.1.
[2]Bryan Ford. (2004). Parsing Expression Grammars: a recognition-based syntactic foundation, https://doi.org/10.1145/964001.964011.
[3]Michael D. Adams. (2013). Principled Parsing for Indentation-Sensitive Languages: Revisiting Landin's Offside Rule, https://doi.org/10.1145/2429069.2429129.
[4]Michael D. Adams, Ömer S. Ağacan. (2014). Indentation-sensitive parsing for Parsec, https://doi.org/10.1145/2775050.2633369.
[5]Steven Obua. (2021). Practical Types, https://doi.org/10.47757/practical.types.1.
[6]Jay Earley. (1970). An efficient context-free parsing algorithm, Communications of the ACM, https://doi.org/10.1145/362007.362035.
[7]Masaru Tomita (ed.). (1991). Generalized LR Parsing, Kluwer Academic Publishers.