A revised semantics for Local Lexing is presented that entirely avoids priority conflicts between non-empty tokens. Pyramid Grammars are introduced, and their semantics is given in terms of Local Lexing. It is demonstrated that they are more expressive than Context-Free Grammars, and at least as expressive as Parsing Expression Grammars.
Foreword
In 2017 I introduced in joint work with Phil Scott and Jacques Fleuriot the concept of Local Lexing[1]. Recently, I have applied Local Lexing to design the grammar for Recursive teXt, a language inspired by both Markdown[2] and TeX[3], and also more generally by the paradigm of functional (and declarative) programming [4] →. All articles on obua.com are written in Recursive teXt. Recursive teXt is still in an early stage and experimental.
During the design process I gained (and am still gaining) new insights into applying Local Lexing. Not only that, it also became apparent that the semantics of Local Lexing needs to be revised for practical reasons. So this article does that, and it also introduces new concepts. The article is self-contained and can be read and understood independently from [1] →.
Introduction
Traditionally, the parsing of (computer) languages happens in two phases: First comes a phase of lexical analysis, which is followed by a phase of syntax analysis[5]. The lexical analysis phase partitions the input consisting of a sequence of characters into a sequence of tokens, and is usually specified via regular expressions. The syntax analysis operates purely on the resulting sequence of tokens, more specifically, on the resulting sequence of token kinds called terminals; this results in a parse tree. The syntax analysis is usually specified by a Context-Free Grammar (CFG).
The three main reasons why parsing is divided into two phases are:
Efficiency
Regular expressions can be implemented by deterministic finite state machines, which is usually much more efficient than parsing with respect to a CFG.
Expressivity
Although regular expressions are strictly less expressive than CFGs, by splitting the analysis into two phases one can insert additional rules between these two phases which increase the overall expressivity over that of CFGs. These additional rules are usually considered to be part of the lexical analysis, and concern mostly the priority with which tokens are selected in case of conflicts:
The longest match rule prioritizes longer tokens over shorter tokens.
Furthermore, token specifications are totally ordered. In case of conflicting tokens of the same length, the token with the higher-ranked specification is chosen.
Convenience
Issues like the treatment of whitespace and comments can often be implemented by an additional rule to simply drop certain kinds of tokens after the lexical analysis instead of passing them on to the syntax analysis.
One disadvantage of this split into two distinct phases is that two different formalisms are employed for specifying the language: one formalism uses regular expressions, the other one CFGs.
But its most significant disadvantage is that it negatively affects the composability of language and parsing specifications. The syntax analysis phase is unproblematic as it is based on CFGs, which compose. But the lexical analysis phase requires that the input can be uniquely decomposed into a single sequence of tokens. Therefore parts from different language specifications cannot be easily combined, as they have possibly different and conflicting token specifications. Usually, these conflicts cannot be resolved automatically without corrupting the semantics of the original parts.
Local Lexing avoids this problem by dropping the requirement for lexical analysis to produce a unique partition of the input into tokens. Instead, all “sensible” partitions are allowed. Whether or not a partition is “sensible” is determined by a parsing process that intertwines lexical analysis and syntax analysis. This allows for token conflict resolution based on the local parsing context. What exactly “sensible” means will be made concrete in the following sections, and differs for this new semantics of Local Lexing from the one presented in [1] →.
On the basis of Local Lexing, Pyramid Grammars will then be introduced, a simple language and parser specification mechanism which composes and is more expressive than Context-Free Grammars and at least as expressive as Parsing Expression Grammars[6].
Local Lexing
A local lexing is a quadruple (T,Σ,Lex,Sel):
T is the set of terminals.
Σ is a set of characters called the alphabet .
Lex is a function with certain properties, and called the lexer.
Sel is a function with certain properties, and called the selector.
The only important thing about T and Σ is that they are sets. Otherwise, they can be chosen arbitrarily. But before we can understand the role that lexer and selector play, we need to understand what tokens are.
Tokens
For any set U we let U∗ denote the set of all finite sequences formed from elements in U. In particular, U∗ contains the empty sequence ε. For two sequences α,β∈U∗ we write αβ for their concatenation, which is of course also in U∗. Given a sequence α∈U∗ we write ∣α∣ for the length of α, and αi for the i-th element of α where i∈{0,1,…,∣a∣−1}.
A tokenτ is a pair τ=(t,α) such that t∈T and α∈Σ∗. So a token is just a sequence of characters tagged with a terminal. We let T denote the set of all tokens: T={(t,α)∣t∈Tand
α∈Σ∗}
We define τ#=t and τ=α. These notions are extended in the obvious way to sequences of tokens: (τ1…τn)#τ1…τn==τ1#…τn#∈T∗τ1…τn∈Σ∗
Given a set T⊆T we denote its subset of empty tokensTε by Tε={τ∈T∣τ=ε}
The Lexer
The task of lexical analysis is to break down the input document D∈Σ∗ into tokens. A local lexing specifies how to do this locally at a given positionk of D. The task of the lexer is to recognize all possible token candidates at position k in D, and is therefore a function taking a document and a position as arguments, and returning a set of tokens: Lex:Σ∗×N→P(T) We write P(U) to denote the powerset of U, i.e. the set of all subsets of U.
Lex must have the property that for all documents D∈Σ∗ and all positions k∈{0,…,∣D∣} within that document, it returns only tokens τ∈Lex(D,k) such that τ is actually a subsequence of D starting at k, i.e. k+∣τ∣≤∣D∣and
∀i∈{0,…,∣τ∣−1}.τi=Dk+i
The Selector
The lexer might return more than one token candidate at a given position. Sometimes we might be fine with the resulting ambiguity, but often we want to be able to impose constraints on which token candidates are actually chosen as tokens. One example is that if one token is a prefix of another, longer token, we might want to choose the longer token only and ignore the shorter one.
The selector Sel allows us to impose such constraints. The basic idea is simple: Given a set of token candidates T⊆T, the selector chooses a subset Sel(T)⊆T such that only tokens in that subset are considered.
The potential presence of empty tokens complicates this simple idea. As we will see soon, if the lexer allows empty tokens in a position, then token discovery at that position turns into a greedy iterative process. Tokens discovered and chosen earlier in that process cannot be ignored later in the same process. We accommodate this by passing Sel not only an upper bound, from which we are to choose the tokens, but also a lower bound, which contains the tokens that have been chosen so far. Thus the selector has the signature Sel:P(Tε)×P(T)→P(T), and it must have the property that for all token sets E and T with Tε⊇E⊆T⊆T the following bounds hold for the selected tokens: E⊆Sel(E,T)⊆T. Note that the lower bound E is guaranteed to consist only of empty tokens. The idea here is that we delay deciding on which non-empty tokens to choose until we have discovered all empty tokens in a position. This is the difference to the old version of Local Lexing presented in [1] →, where there was no such guarantee for E.
If no empty tokens are in play in a given position, then the lower bound will necessarily be the empty set, turning Sel back into a function with a single upper-bound argument Sel(T)=Sel(∅,T).
Consistency
In this new version of Local Lexing we require the selector to have one more property, which we call consistency. It allows us to give a uniform treatment of both finite and infinite sets of characters and terminals. We call a family of sets (Ui)i∈N a chain iff Ui⊆Ui+1 for all i∈N. We call a selector consistent for document D at position k iff for any two chains (Ei)i∈N and (Ti)i∈N such that E0T0⊆⊆E1T1⊆⊆E2T2⊆⊆……⊆⊆Lex(D,k)εLex(D,k)Ei⊆Tifor all
i∈N the following holds: Sel(i=0⋃∞Ei,i=0⋃∞Ti)⊆i=0⋃∞Sel(Ei,Ti) We call a selector consistent everywhere (or just consistent) if it is consistent for all documents D∈Σ∗ at all positions k∈{0,…,∣D∣}.
We require Sel to be consistent everywhere.
Note that if both Σ and T are finite sets, the selector is clearly consistent everywhere. Because in this case Lex(D,k) is also a finite set, and this means that both chains (Ei)i∈N and (Ti)i∈N converge. Thus there is J∈N such that Ei=EJ and Ti=TJ for all i≥J, which implies Sel(i=0⋃∞Ei,i=0⋃∞Ti)=Sel(EJ,TJ)⊆i=0⋃∞Sel(Ei,Ti)
Priority Relations
A convenient way to define a selector is by providing a priority relation⊲ on tokens: ⊲⊆T×T For two tokens u∈T and v∈T we write u⊲v to mean that u has lower priority than v, i.e. (u,v)∈⊲ holds.
For example, let SpecPos:T→N be an injective function assigning to each terminal the position (e.g. line in a grammar specification file) where it was specified. The priority relation used for traditional lexical analysis is then defined by u⊲viff
∣u∣<∣v∣∨(∣u∣=∣v∣∧SpecPos(u#)>SpecPos(v#)) In this case, ⊲ is even a total order when restricted to the set of tokens appearing at the same position in the same document, but we do not require any properties of ⊲ beyond being a relation on tokens.
If selectors depended only on their upper bound argument T, there would be a canonical way how ⊲ induces a selector by just choosing all maximal elements in T: Sel(T)={u∈T∣∄v∈T.u⊲v}
But selectors must also respect their lower bound argument E, and therefore the above definition needs to be adjusted. There are at least two ways to do this:
FB (“First in Blocks”). This way defines the selector as Sel(E,T)=E∪{u∈T∣(∄v∈T.u⊲v)∧(∄v∈E.v⊲u)} It adds those maximal elements of T that are not in conflict with any elements in E. This means that tokens in E block conflicting tokens in T from being chosen, even if those tokens in T have higher priority than those already in E.
PP (“Priority Prevails”). This way defines the selector as Sel(E,T)=E∪{u∈T∣∄v∈T.u⊲v} It adds all maximal elements of T regardless of the content of E. As a consequence, lower priority empty tokens might be chosen although higher priority tokens have been chosen as well. On the other hand, a lower priority token can never block a higher priority one.
Note that FB and PP coincide for E=∅. In practice PP seems to be a better choice than FB, as it is more in line with our intuitive concept of priority.
It is easy to see that both FB and PP produce selectors that are consistent everywhere. For example, to show consistency of a PP selector, assume t∈Sel(E∞,T∞)where
E∞=i=0⋃∞Eiand
T∞=i=0⋃∞Ti for an otherwise arbitrary token t∈T. Based on the definition of PP this implies t∈E∞∨(t∈T∞∧∄v∈T∞.t⊲v) In case of t∈E∞, there exists J∈N with t∈EJ and therefore also t∈Sel(EJ,TJ) In case of t∈/E∞, there exists J∈N with t∈TJ and ∄v∈T∞.t⊲v. This implies t∈TJ∧∄v∈TJ.t⊲v and we again arrive at t∈Sel(EJ,TJ).
Semantics of Local Lexing
So far we have defined what a local lexing is, but we have not yet talked about how to use it. In fact, on its own a local lexing is pretty useless. For it to obtain a meaning, it needs to be paired with a Context-Free Grammar, resulting in a Local-Lexing Grammar.
Context-Free Grammars
Let us first remind ourselves what a Context-Free Grammar is, and introduce the relevant notation. A Context-Free GrammarG is a quadruple (N,T,R,S) where N is its set of nonterminals, T its set of terminals, R⊆N×(N∪T)∗ its set of rules, and S∈N its start symbol. Nonterminals and terminals are required to be disjoint sets, i.e. N∩T=∅.
Instead of (N,γ)∈R we also write N→γ. We write α⇒β for α,β∈(N∪T)∗ iff there are α1, N, α2 and γ such that α=α1Nα2, β=α1γα2, and N→γ. Furthermore we let ⇒∗ denote the reflexive and transitive closure of ⇒.
The languageL of G is defined as the set of all terminal sequences derivable from the start symbol, i.e. L={ω∈T∗∣S⇒∗ω} The prefix languageLprefix
of G is defined as Lprefix
={ω∈T∗∣∃α∈(N∪T)∗.S⇒∗ωα}
Local Lexing Grammars
A Local Lexing GrammarG is a tuple (N,T,R,S,Σ,Lex,Sel) such that G=(N,T,R,S) is a Context-Free Grammar and L=(T,Σ,Lex,Sel) is a local lexing.
To denote the same Local Lexing Grammar as G, but with a different start symbol X∈N, we write G/X.
We will define the semanticsℓℓ of G to be a set of token sequences, i.e. ℓℓ⊆T∗
We construct ℓℓ by separately constructing P(D) for each document D∈Σ∗, where P(D)={p∈ℓℓ∣p=D} Once we have constructed P(D), we can define ℓℓ via ℓℓ(G)=ℓℓ=D∈Σ∗⋃P(D)
Paths
To construct P(D), we will assume a fixed document D∈Σ∗ from now on to simplify our notation. For example, we will just write P instead of P(D).
P is defined by recursively constructing {ε}=P−1⊆P0⊆P1⊆…⊆P∣D∣ and then setting P={p∈P∣D∣∣∣p∣=∣D∣∧p#∈L}
The elements of P are the “sensible” partitions of the input D into tokens we alluded to earlier →. Furthermore the elements p∈Pk are “sensible” partitions of prefixes of the input D into tokens. We can say a priori (i.e. before actually having constructed Pk) more about such p than them just being sequences of tokens. We demand that p must be a path. We define the set P of paths to consist of those p∈T∗ such that:
p#∈Lprefix
Given a decomposition of p into tokens p=t1…tn we demand that each of these tokens ti can be lexed at the appropriate position: ti∈Lex(D,ki)where
ki=∣∣t1…ti−1∣∣ This implies that p is a prefix of D, of course.
Furthermore we define the set of paths that stop at position k in D as Pk={p∈P∣∣p∣=k}
Paths are an auxiliary tool to aid us in defining P. But they are an interesting notion in itself already, as they showcase how lexical analysis (Lex) and syntax analysis (Lprefix
) are intertwined in Local Lexing. They nicely package up the influence of G and Lex on the Local Lexing semantics: We will not need to refer to G or Lex anymore to finish the definition of the semantics. This also means that the semantics does not depend on the particular shape of G, but just on G’s language L and its prefix language Lprefix
. If G has no unproductive nonterminals, i.e. for all N∈N there is ω∈T∗ such that N⇒∗ω, then Lprefix
is uniquely determined by L and therefore the semantics just depends on G’s language L alone (and on the local lexing L, of course).
Indeed, if Local Lexing did not provide a mechanism for constraining token selection, or equivalently, if the selector would be trivial, i.e. Sel(E,A)=A for all arguments E and A, then we would already be done, as we could then just set P={p∈P∣D∣∣p#∈L}
This also shines a spotlight on the importance of the selector mechanism in Local Lexing. It is its defining feature.
Constructing Pk
The basic idea for constructing the Pk for k∈{0,…,∣D∣} is to construct in tandem also a family of token sets Zk⊆Lex(D,k). The dependencies here are as follows: Pk−1Pk−1and
Zkdetermines
determine
ZkPk If you think of the (Pk)0≤k≤∣D∣ as the result of syntax analysis, and the (Zk)0≤k≤∣D∣ as the result of lexical analysis, then these dependencies show how Local Lexing intertwines syntax analysis and lexical analysis.
The straightforward part is how Pk is determined by Zk and Pk−1: Pk=AppendkZkPk−1 where AppendkTPAppendk1TPlimitfX===limit(Appendk1T)PP∪{pt∈P∣p∈P∩Pk,t∈T}⋃n=0∞fn(X) Here we just append tokens in Zk to those paths in Pk−1 that stop at k, as long as the result is still a path. Because appending empty tokens in Zk potentially leads to new paths stopping at k which also need to be taken into consideration, in general Appendk needs to be specified as a limit process.
Constructing Zk
If Zk was guaranteed not to contain empty tokens, the other part would be straightforward as well: Zk=Sel(Nextk(Pk−1)) where Nextk(P)={t∈T∣∃p∈P∩Pk.pt∈P}for any
P⊆T∗ Here Nextk(P) consists of all tokens that could “sensibly” extend some path in P that stops at position k in D.
The trouble is that if Zk contains any empty token, say e, then we should really have taken e into account before computing Zk as it might create more sensible paths that stop at k. Therefore, in the presence of empty tokens, we need a more elaborate approach for constructing Zk as follows.
We construct chains (Qki)i∈N, (Eki)i∈N, (Tki)i∈N via Qk0Ek0TkiEki+1Qki+1=====Pk−1∅Nextk(Qki)Sel(Eki,Tki)εAppendkEki+1Qki and can then define Zk=Sel(i=0⋃∞Eki,i=0⋃∞Tki)
Was it all worth it?
But have we actually gained something by this more elaborate approach? After all, if there is an empty token e∈(Zk)ε such that e is not in any of the Eki then we would again have the same problem of possibly having missed some sensible paths.
This is where the consistency of the selector comes to the rescue. Because of it we have (Zk)ε=⊆===⊆Sel(⋃i=0∞Eki,⋃i=0∞Tki)ε(⋃i=0∞Sel(Eki,Tki))ε⋃i=0∞Sel(Eki,Tki)ε⋃i=0∞Eki+1⋃i=0∞Eki(Zk)ε and thus (Zk)ε=⋃i=0∞Eki.
This shows that when constructing Zk, before making the final selection of which non-empty tokens to select, it has already been decided which empty tokens to select. The final selection will not add any new empty tokens, and therefore we can safely proceed to the next position k+1 without missing out on sensible paths that have (intermediate) stops at k. Furthermore, all non-empty tokens at k will be selected at once. This means that there will not be any priority conflicts between non-empty tokens that need to be resolved the FB or the PP way →. Such conflicts can only possibly occur between empty tokens and other empty tokens, or between lower priority empty tokens and higher priority non-empty tokens. This is the major improvement over the original Local Lexing semantics where the presence of empty tokens could introduce such conflicts between non-empty tokens.
Pyramid Grammars
A Local Lexing Grammar is a tuple consisting of seven components and somewhat unwieldy for directly specifying parsers. Furthermore, although I have claimed that Local Lexing is a panacea for composable specifications, I have not provided direct evidence for this claim so far.
Luckily, there is a powerful specification mechanism that is more compact than Local Lexing Grammars but which can still be understood in terms of Local Lexing Grammars.
Definition of Pyramid Grammars
A Pyramid Grammar is a tuple (N,T,R,S,<,⋖) such that N is its set of nonterminals, T its set of terminals such that N∩T=∅, R⊆(N∪T)×(N∪T)∗ its set of rules, S∈N its start symbol, and <,⋖⊂T×T its priorities.
A Pyramid Grammar looks just like a Context-Free Grammar, with two extensions:
The left hand side of a rule can be a terminal, too.
There are relations < and ⋖ on terminals jointly specifying priorities.
Semantics of Pyramid Grammars
We are giving a Pyramid Grammar Y a meaning by turning it into a Local Lexing Grammar G, and then defining its alphabetΣ and languageL by Σ(Y)L(Y)==ΣG{α∣α∈ℓℓ(G)}⊆Σ∗
The terminals and the start symbol of G are the same:
TG=T,SG=S
The characters ΣG⊆TG of G are those terminals which do not appear on the left side of any rule in R: ΣG={c∈T∣∄α.(c,α)∈R}
For each T∈T∖ΣG we add a fresh nonterminal ST: NG=N⊎{ST∣T∈T∖ΣG}
The rules RG of G are those rules in R with a nonterminal on the left side, plus rules for the new nonterminals: RG={(N,α)∈R∣N∈N}⊎{(ST,α)∣T∈T∖ΣG,(T,α)∈R}
The selector SelG of G is defined the PP way → by turning < and ⋖ into a priority relation ⊲ on tokens via u⊲viff
⎝⎛∨∨∨u#<v#u#=v#∧∣u∣<∣v∣u#⋖v#∧∣u∣≤∣v∣v#⋖u#∧∣u∣<∣v∣⎠⎞ This definition reflects that < represents unconditional priority and ⋖ represents longest match priority. Furthermore, for two tokens with the same associated terminal, the longer token has higher priority.
The tricky part is the definition of the lexer component. Firstly, we will define a separate lexer LexT for each terminal T∈TG and then define for any D∈ΣG∗ and position k in D: Lex(D,k)=T∈TG⋃{(T,α)∣α∈LexT(D,k)}
For c∈ΣG the lexer is easily defined: Lexc(D,k)={{c}∅if
0≤k<∣D∣∧Dk=cotherwise
We use G/ST→ for parsing T∈T∖ΣG: LexT(D,k)={α∣α∈ℓℓ(G/ST){α∣∧k≥0∧k+∣α∣≤∣D∣{α∣∧∀0≤i<∣α∣.αi=Dk+i} This means that we parse T with the same Local Lexing Grammar G, but use ST instead of S as a start symbol!
Wellfounded Pyramid Grammars
The above semantics for Pyramid Grammars is simple and intuitive, but unfortunately as a definition technically wrong. The problem is that we are defining G in terms of G/ST, leading to a cyclic definition.
All components of G are defined nonrecursively except for the definition of the LexT for T∈X, where X={ST∣T∈T∖ΣG}. So the problematic dependencies are those between the “calls” to LexT(D,k) for T∈X, a fixed D, and 0≤k≤∣D∣. Therefore, if we can find a wellfounded relation on {(T,k)∣T∈X,0≤k≤∣D∣} along those calls, then G is well-defined.
If there are no left-recursive rules for terminals, either directly or indirectly, and assuming that T∖ΣG is finite, then such a wellfounded relation exists.
To see why, let us first clarify what is meant by “no left-recursive rules for terminals”.
Consider the Context-Free Grammar E={N∪(TG∖ΣG),ΣG,R,S}. We say that α∈(N∪T)∗ is nullable iff α⇒∗Eε.
We define a directed dependency graph DP such that the nodes of DP are the elements of TG∖ΣG. There is a directed edge from node U∈T to node V∈TG iff SU⇒∗αVβ for some nullable α and some β, and in this case we write U↠V. “No left-recursive rules for terminals” means then just that DP is acyclic. The relation > is defined as (U,k)>(V,l)iff
U↠V∨0≤k<l≤∣D∣ This relation clearly holds along calls of LexV(D,l) from LexU(D,k) because of the way P is defined. Furthermore > is wellfounded, because DP is finite and acyclic, and thus ↠ is wellfounded.
This leads us to call Pyramid Grammars wellfounded iff ↠ is a wellfounded relation. The semantics of wellfounded Pyramid Grammars is well-defined via the Local Lexing Grammar G.
Composability
Composability is a big strength of Pyramid Grammars. As examples, we will consider three operations supporting composition of Pyramid Grammars in the following: renaming, union and import.
Renaming
Given a bijection R:N∪(T∖Σ)→B such that B∩Σ=∅, the renaming of the Pyramid Grammar Y according to R is the Pyramid Grammar Y′ defined in the obvious way. Y′ is wellfounded iff Y is wellfounded, and L(Y′)=L(Y).
Union
Given two Pyramid Grammars Y1=(N1,T1,R1,S1,⋖1,<1)Y2=(N2,T2,R2,S2,<2,⋖2) as long as N1∩T2=N2∩T1=∅, you can form its union as Y=(N,T,R,S,<,⋖) where S is a fresh nonterminal and NTR<⋖=====N1∪N2∪{S}T1∪T2R1∪R2∪{S→S1,S→S2}<1∪<2⋖1∪⋖2
It is not necessarily true that Y is wellfounded if both Y1 and Y2 are wellfounded. Consider Z1=({T},{A,B},{T→A,A→B},T,∅,∅)Z2=({T},{A,B},{T→B,B→A},T,∅,∅) and their union Z12=({S,T},{A,B},R,S,∅,∅) where R={S→T,T→A,T→B,A→B,B→A}Z1 and Z2 are wellfounded with Σ(Z1)=L(Z1)={B} and Σ(Z2)=L(Z2)={A}. But Z12 is not wellfounded, as there is a cycle in the dependency graph of Z12 between A and B. Note that Σ(Z12)=∅.
It is also not necessarily true that even if Y is wellfounded, that then L(Y)=L(Y1)∪L(Y2). Consider the union Z13 of Z1 and Z3, where Z3=({T},{B,C},{T→B,B→C},T,∅,∅) We have Σ(Z3)=L(Z3)={C} and Σ(Z13)=L(Z13)={C} and therefore clearly L(Z13)=L(Z1)∪L(Z3)={B,C}.
Given wellfounded Pyramid Grammars Y1 and Y2 such that (N1∪(T1∖Σ1))∩(N2∪(T2∖Σ2))=∅ holds, then their union Y is wellfounded, too, and L(Y)=L(Y1)∪L(Y2) is actually true.
Forming the language union of two Pyramid Grammars is always possible, because if the above condition does not hold, a renaming can be applied to Y1 or Y2 to ensure that the condition holds.
Import
A Pyramid Grammar FragmentF=(NF,TF,RF,<F,⋖F) is a Pyramid Grammar without a designated start symbol. The notions of alphabet and wellfoundedness are independent of the start symbol and transfer therefore over to fragments.
Adding a fragment F to a Pyramid Grammar Y=(N,T,R,S,<,⋖) results in a Pyramid Grammar Y′=(N∪NF,T∪TF,R∪RF,S,<∪<F,⋖∪⋖F). Obviously, Σ(Y′)⊆Σ(Y)∪Σ(F)
We say that Y′ is the import of F to Y if the following conditions hold:
F is a wellfounded fragment
NF∩(N∪T)=∅
TF∩(N∪T)⊆Σ(Y)
In this case, when Y is wellfounded, so is Y′.
The effect of an import is to swap out some, maybe all, characters of Y by their definitions in F in terms of Σ(F). Therefore F could define a library of terminals, for example numbers, identifiers, etc., and a Pyramid Grammar Y could import them by simply adding the fragment.
In the extreme case of Σ(Y)⊆TF we have Σ(Y′)=Σ(F), i.e. the entire alphabet Σ(Y) is swapped out for the alphabet Σ(F) of the fragment. This recreates within the setting of Pyramid Grammars the traditional separation of lexical analysis and syntax analysis, where Y can be considered as being responsible for syntax analysis, F for lexical analysis, and Y′ represents the parser resulting from combining the two.
Parse Trees
When parsing, one is usually not only interested in whether a given document is in the language specified by the Pyramid Grammar, but also in some sort of result derived from the discovered document structure. Traditionally, parser generators allow to associate custom actions with grammar rules. These custom actions then compute a result.
It is ill-advised to pursue this approach in a parser generator for Pyramid Grammars. Two consumers of the same grammar have often not the same use case and need different kinds of results. Given how easy it is to compose Pyramid Grammars, the results of parsing with Pyramid Grammars should also compose nicely.
A simple solution to this is to just use the parse tree as a result, a concrete syntax tree. Often though, such a concrete syntax tree follows too much the structure of the parse and contains therefore too much noise and unwanted structure.
To fix this, one can classify symbols in N∪T along two axes: visibility and structure. Both axes consist of two values. The value of visibility can be either visible or auxiliary. The value of structure can be either nested or atomic. For a symbol to be classified as auxiliary means that it has no corresponding node in the parse tree of its own, but is just represented by the collection of its children. For a symbol to be classified as atomic means that its children will not become part of the parse tree. A symbol that is classified as both atomic and auxiliary is effectively hidden and will have no representation in the result.
By default, terminals are atomic and visible, while nonterminals are nested and visible. These defaults are overridable at specification time, and / or by the user of a Pyramid Grammar.
With the help of visibility and structure the parser can generate a tree more akin to an abstract syntax tree than to a concrete syntax tree. By analysing the grammar a parser generator can even automatically generate data types for the parse result useful for static type checking.
Note that parse trees can be ambiguous. While an ambiguous grammar is usually something to be avoided, during the development of the grammar easy inspection of ambiguities is important, and parse trees, properly implemented, can support that.
Grammar Expressions
It is possible to implement a prioritized choice operator / as it is used in Parsing Expression Grammars[6]. Given two sequences α,β∈(N∪T)∗, we can add a fresh nonterminal N and two fresh terminals A and B and add the rules NNAB→→→→ABαβ We also add the priority B<A The effect of this is that N now stands for the prioritized choice between α and β: N=α/β To aid in parse tree creation, we declare N, A and B as auxiliary and nested.
Of course, we can also add a standard choice operator ∣ by creating a fresh nonterminal N (auxiliary and nested) and adding the rules NN→→αβ The effect is that N=α∣β
An interesting operator is the isolation operator ⟨α⟩. We implement it by adding a fresh, auxiliary and nested terminal T, and adding the rule T→α and thus T=⟨α⟩ Isolation has the effect of spawning off a separate parsing process for its argument which cannot be influenced by token selection issues outside of this process.
Note that we cannot have prioritized choice without isolating the arguments first, i.e. α/β has the same meaning as ⟨α⟩/⟨β⟩. On the other hand, ⟨α⟩∣⟨β⟩ has in general a different semantics than just α∣β.
Inspired by these examples, we extend the notion of Pyramid Grammars to allow rules of the form X→g where X∈N∪T and g is a grammar expression. So far, grammar expressions are inductively defined as follows:
Any α∈(N∪T)∗ is a grammar expression.
If g and h are grammar expressions, then so is their concatenationgh.
If g is a grammar expression, then so is ⟨g⟩.
If g and h are grammar expressions, then so are g/h and g∣h.
But we don’t have to stop here, there are more grammar expressions we can define:
If g is a grammar expression, then so are g? and [g].
Here we define g?=g/ε and [g]=g∣ε.
And we can also add repetition operators:
If g is a parsing expression, then so are g∗, {g}, g+ and {+g}.
Let us first define greedy repetition g∗. We add a fresh nonterminal N (auxiliary and nested), and add the rule N→(gN)? This yields N=g∗.
For non-greedy repetition {g} we proceed analogously. We add a fresh nonterminal N (auxiliary and nested), and add the rule N→[Ng] This gives us N={g}.
The remaining operators can be defined as g+=gg∗ and {+g}=g{g}.
Lookahead Operators
Grammar expressions as defined so far cover the complete Extended Backus-Naur form [7] and almost all constructs of Parsing Expression Grammars (PEG) [6]. What is missing to complete our set are the PEG lookahead operators &g for positive lookahead, and !g for negative lookahead.
Unlike all other grammar expression operators we have encountered previously, the lookahead operators cannot be viewed simply as syntactic sugar that can be desugared into our current framework of Pyramid Grammars.
The notion of Pyramid Grammar is itself just “semantic sugar” on top of Local Lexing, though. So all we need to do is to expand our notion of Pyramid Grammar to let some more of the power of Local Lexing back in. The simplest way to do this is to allow that on the right hand side of rules in R not only nonterminals N and terminals T can appear, but also negated terminals !T.
The construction of the semantics G stays the same, with the following differences:
TG now also includes the negated terminals !T={!T∣T∈T}, i.e.
TG=T∪!T
We need to define additional lexers for the negated terminals. For T∈T we set Lex!T(D,k)={∅{ε}if
LexT(D,k)=∅otherwise
The definition of wellfoundedness is adapted such that there are additional directed edges !T↠T in the dependency graph, and such that nullability of !T is taken into account.
We can now define grammar expressions for lookahead:
If g is a grammar expression, then so are !g and &g.
For the definition of !g we create fresh terminals T and U and add the rules TU→→g!T which results in U=!g.
Both T und U are declared as auxiliary and atomic, i.e. hidden.
Positive lookahead is just doubly negative lookahead: &g=!!g
This shows that Pyramid Grammars are at least as expressive as PEGs and Context-Free Grammars. Indeed, because there are languages recognized by PEGs that are not context-free, Pyramid Grammars are strictly more expressive than Context-Free Grammars.
Conclusion
So, this is the new semantics for Local Lexing. Its new and hopefully improved way of handling empty tokens and avoiding conflicts between non-empty tokens is promising, especially in how it fits together with the new notion of selector consistency. Selector consistency is guaranteed for both finite local lexings and for priority relations, which covers all types of interesting selectors I have encountered so far.
I have implemented an algorithm for parsing according to this new semantics [8] →. It is adapted from the parsing algorithm from the original Local Lexing paper [1] →, which is based on Earley’s algorithm[9]. I have tried it out on Recursive teXt and it seems to be working correctly. But while the original algorithm has a formal correctness proof [10] →, I have no proof for the new algorithm yet. Nevertheless, I would expect the new proof to be less involved and technical than the original proof as the new semantics seems more intuitive.
The most important point of it all is that the parsing algorithm itself is not that important, and actually not covered at all in this article. This work started out as an investigation into using Earley’s parsing algorithm for composable syntax, and was particularly inspired by the use of blackboxes in [11] →. But what has been discovered through this work is not so much a particular parsing algorithm, but that Local Lexing is a general semantics for language and parser specifications.
Pyramid Grammars are how you use Local Lexing in practice. Each terminal in a Pyramid Grammar gives rise to a subgrammar. It seems obvious that each of these subgrammars should be analysed by a parser generator to find the parsing algorithm best suited for it. This seems to invite a renewed research into parsing algorithms with a focus on how they can be adapted to Local Lexing. Improvements in parsing algorithms could be directly passed on to the user who does not even have to be aware that improvements have been made. This is because the UI of the parser generator is most importantly its semantics, which is Local Lexing / Pyramid Grammars, and stable.
Special care has been taken to include infinite grammars. These arise naturally from parameterized Local Lexing Grammars [12] →. The grammar for Recursive teXt is a parameterized pyramid grammar, and I have found parameters to be very useful indeed. Parameterized Pyramid Grammars are basically just L-attributed[13] Pyramid Grammars. This makes sense, as on one hand L-attributed grammars are the largest class of attribute grammars supporting parser-driven attributes, and on the other hand such support is essential for Local Lexing as attribute evaluation may influence token selection. Parsing techniques for Local Lexing based on GLR parsing[14] and ECLR-attributed parsing[15] seem particularly promising.
[5]Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman. (2007). Compilers: Principles, Techniques, & Tools.
[6]Bryan Ford. (2004). Parsing Expression Grammars: a recognition-based syntactic foundation, Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, https://doi.org/10.1145/964001.964011.
[11]Trevor Jim, Yitzhak Mandelbaum, David Walker. (2010). Semantics and Algorithms for Data-dependent Grammars, Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, https://doi.org/10.1145/1706299.1706347.
[14]Masaru Tomita (ed.). (1991). Generalized LR Parsing, Kluwer Academic Publishers.
[15]Masataka Sassa, Harushi Ishizuka, Ikuo Nakata. (1987). ECLR-Attributed Grammars: A Practical Class of L-Attributed Grammars, Information Processing Letters 24, pp. 31-41, North-Holland.