© 2021 Steven Obua

A New Semantics for Local Lexing

by Steven Obua
Cite as: https://doi.org/10.47757/obua.ll.1
May 18th, 2021
A revised semantics for Local Lexing is presented that entirely avoids priority conflicts between non-empty tokens. Pyramid Grammars are introduced, and their semantics is given in terms of Local Lexing. It is demonstrated that they are more expressive than Context-Free Grammars, and at least as expressive as Parsing Expression Grammars.


In 2017 I introduced in joint work with Phil Scott and Jacques Fleuriot the concept of Local Lexing [1]. Recently, I have applied Local Lexing to design the grammar for Recursive teXt, a language inspired by both Markdown [2] and TeX [3], and also more generally by the paradigm of functional (and declarative) programming [4] →. All articles on obua.com are written in Recursive teXt. Recursive teXt is still in an early stage and experimental.
During the design process I gained (and am still gaining) new insights into applying Local Lexing. Not only that, it also became apparent that the semantics of Local Lexing needs to be revised for practical reasons. So this article does that, and it also introduces new concepts. The article is self-contained and can be read and understood independently from [1] →.


Traditionally, the parsing of (computer) languages happens in two phases: First comes a phase of lexical analysis, which is followed by a phase of syntax analysis [5]. The lexical analysis phase partitions the input consisting of a sequence of characters into a sequence of tokens, and is usually specified via regular expressions. The syntax analysis operates purely on the resulting sequence of tokens, more specifically, on the resulting sequence of token kinds called terminals; this results in a parse tree. The syntax analysis is usually specified by a Context-Free Grammar (CFG).
The three main reasons why parsing is divided into two phases are:
Regular expressions can be implemented by deterministic finite state machines, which is usually much more efficient than parsing with respect to a CFG.
Although regular expressions are strictly less expressive than CFGs, by splitting the analysis into two phases one can insert additional rules between these two phases which increase the overall expressivity over that of CFGs. These additional rules are usually considered to be part of the lexical analysis, and concern mostly the priority with which tokens are selected in case of conflicts:
Issues like the treatment of whitespace and comments can often be implemented by an additional rule to simply drop certain kinds of tokens after the lexical analysis instead of passing them on to the syntax analysis.
One disadvantage of this split into two distinct phases is that two different formalisms are employed for specifying the language: one formalism uses regular expressions, the other one CFGs.
But its most significant disadvantage is that it negatively affects the composability of language and parsing specifications. The syntax analysis phase is unproblematic as it is based on CFGs, which compose. But the lexical analysis phase requires that the input can be uniquely decomposed into a single sequence of tokens. Therefore parts from different language specifications cannot be easily combined, as they have possibly different and conflicting token specifications. Usually, these conflicts cannot be resolved automatically without corrupting the semantics of the original parts.
Local Lexing avoids this problem by dropping the requirement for lexical analysis to produce a unique partition of the input into tokens. Instead, all “sensible” partitions are allowed. Whether or not a partition is “sensible” is determined by a parsing process that intertwines lexical analysis and syntax analysis. This allows for token conflict resolution based on the local parsing context. What exactly “sensible” means will be made concrete in the following sections, and differs for this new semantics of Local Lexing from the one presented in [1] →.
On the basis of Local Lexing, Pyramid Grammars will then be introduced, a simple language and parser specification mechanism which composes and is more expressive than Context-Free Grammars and at least as expressive as Parsing Expression Grammars [6].

Local Lexing

A local lexing is a quadruple (T,Σ,Lex,Sel)(\mathcal{T}, \Sigma, \operatorname{Lex}, \operatorname{Sel}):
The only important thing about T\mathcal{T} and Σ\Sigma is that they are sets. Otherwise, they can be chosen arbitrarily. But before we can understand the role that lexer and selector play, we need to understand what tokens are.


For any set UU we let U{{U}}^* denote the set of all finite sequences formed from elements in UU. In particular, U{{U}}^* contains the empty sequence ε\varepsilon. For two sequences α,βU\alpha, \beta \in {{U}}^* we write αβ\alpha\beta for their concatenation, which is of course also in U{{U}}^*. Given a sequence αU\alpha \in {{U}}^* we write α\left|{{\alpha}}\right| for the length of α\alpha, and αi\alpha_i for the ii-th element of α\alpha where i{0,1,,a1}i \in \{0, 1, \ldots, \left|{{a}}\right| - 1\}.
A token τ\tau is a pair τ=(t,α)\tau = (t, \alpha) such that tTt \in \mathcal{T} and αΣ\alpha \in {{\Sigma}}^*. So a token is just a sequence of characters tagged with a terminal. We let T\mathbb{T} denote the set of all tokens: T={(t,α)tT  αΣ}\mathbb{T} = \left\{(t, \alpha) \mid t \in \mathcal{T} \ {\includegraphics[height=0.5em]{B68EE0FF-F769-478D-90A2-42087C347E93}}\ \alpha \in {{\Sigma}}^*\right\}
We define =t{\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{8BB64BB5-CA64-463C-BDB9-F767555C30C0}} = t and τ=α\overline{{\tau}} = \alpha. These notions are extended in the obvious way to sequences of tokens: =Tτ1τn=τ1τnΣ \begin{array}{rcl} {\includegraphics[height=0.9890079999999999em, totalheight=1.2390079999999999em]{B44D48AF-7EC1-4458-BD53-1D86BD673CD1}} & = & {\includegraphics[height=0.8491079999999999em, totalheight=0.9991079999999999em]{6D043DE1-7E77-4DF7-AFB4-4A977681C8B3}} \ldots {\includegraphics[height=0.8491079999999999em, totalheight=0.9991079999999999em]{AADB1BC9-C78B-4F6E-A98B-BAE19C357077}} \in {{\mathcal{T}}}^* \\[0.3em] \overline{{\tau_1 \ldots \tau_n}} & = & \overline{{\tau_1}} \ldots \overline{{\tau_n}} \in {{\Sigma}}^* \end{array}
Given a set TTT \subseteq \mathbb{T} we denote its subset of empty tokens Tε{{T}}_{\varepsilon} by Tε={τTτ=ε}{{T}}_{\varepsilon} = \{ \tau \in T \mid \overline{{\tau}} = \varepsilon\}

The Lexer

The task of lexical analysis is to break down the input document DΣD \in {{\Sigma}}^* into tokens. A local lexing specifies how to do this locally at a given position kk of DD. The task of the lexer is to recognize all possible token candidates at position kk in DD, and is therefore a function taking a document and a position as arguments, and returning a set of tokens: Lex:Σ×NP(T)\operatorname{Lex} : {{\Sigma}}^* \times \N \rightarrow \mathscr{P}({\mathbb{T}}) We write P(U)\mathscr{P}({U}) to denote the powerset of UU, i.e. the set of all subsets of UU.
Lex\operatorname{Lex} must have the property that for all documents DΣD \in {{\Sigma}}^* and all positions k{0,,D}k \in \{0, \ldots, \left|{{D}}\right|\} within that document, it returns only tokens τLex(D,k)\tau \in \operatorname{Lex}(D, k) such that τ\overline{{\tau}} is actually a subsequence of DD starting at kk, i.e. k+τD  i{0,,τ1}. τi=Dk+ik + \left|{{\overline{{{\tau}}}}}\right| \leq \left|{{D}}\right| \ {\includegraphics[height=0.5em]{7D7E1A5B-265E-4CF1-8E08-283F6920051D}} \ \forall i \in \{0, \ldots, \left|{{\overline{{{\tau}}}}}\right| - 1\}.\ \overline{{\tau}}_i = D_{k + i}

The Selector

The lexer might return more than one token candidate at a given position. Sometimes we might be fine with the resulting ambiguity, but often we want to be able to impose constraints on which token candidates are actually chosen as tokens. One example is that if one token is a prefix of another, longer token, we might want to choose the longer token only and ignore the shorter one.
The selector Sel\operatorname{Sel} allows us to impose such constraints. The basic idea is simple: Given a set of token candidates TTT \subseteq \mathbb{T}, the selector chooses a subset Sel(T)T\operatorname{Sel}(T) \subseteq T such that only tokens in that subset are considered.
The potential presence of empty tokens complicates this simple idea. As we will see soon, if the lexer allows empty tokens in a position, then token discovery at that position turns into a greedy iterative process. Tokens discovered and chosen earlier in that process cannot be ignored later in the same process. We accommodate this by passing Sel\operatorname{Sel} not only an upper bound, from which we are to choose the tokens, but also a lower bound, which contains the tokens that have been chosen so far. Thus the selector has the signature Sel:P(Tε)×P(T)P(T),\operatorname{Sel} : \mathscr{P}({\mathbb{T}_\varepsilon}) \times \mathscr{P}({\mathbb{T}}) \rightarrow \mathscr{P}({\mathbb{T}}), and it must have the property that for all token sets EE and TT with TεETT\mathbb{T}_\varepsilon \supseteq E \subseteq T \subseteq \mathbb{T} the following bounds hold for the selected tokens: ESel(E,T)T.E \subseteq \operatorname{Sel}(E, T) \subseteq T. Note that the lower bound EE is guaranteed to consist only of empty tokens. The idea here is that we delay deciding on which non-empty tokens to choose until we have discovered all empty tokens in a position. This is the difference to the old version of Local Lexing presented in [1] →, where there was no such guarantee for EE.
If no empty tokens are in play in a given position, then the lower bound will necessarily be the empty set, turning Sel\operatorname{Sel} back into a function with a single upper-bound argument Sel(T)=Sel(,T)\operatorname{Sel}(T) = \operatorname{Sel}(\emptyset, T).


In this new version of Local Lexing we require the selector to have one more property, which we call consistency. It allows us to give a uniform treatment of both finite and infinite sets of characters and terminals. We call a family of sets (Ui)iN(U_i)_{i \in \N} a chain iff UiUi+1U_i \subseteq U_{i+1} for all iNi \in \N. We call a selector consistent for document DD at position kk iff for any two chains (Ei)iN(E_i)_{i \in \N} and (Ti)iN(T_i)_{i \in \N} such that E0E1E2Lex(D,k)εT0T1T2Lex(D,k)EiTi  iN \begin{array}{ccccccccl} E_0 & \subseteq & E_1 & \subseteq & E_2 & \subseteq & \ldots & \subseteq & {{\operatorname{Lex}(D, k)}}_{\varepsilon} \\ T_0 & \subseteq & T_1 & \subseteq & T_2 & \subseteq & \ldots & \subseteq & \operatorname{Lex}(D, k) \end{array}\\[0.5em] E_i \subseteq T_i \ {\includegraphics[height=0.5em]{991B3E6A-0E5A-4698-B40D-D00B092A03BF}}\ i \in \N the following holds: Sel(i=0Ei, i=0Ti)i=0Sel(Ei,Ti) \operatorname{Sel}\left(\bigcup_{i=0}^\infty E_i,\ \bigcup_{i=0}^\infty T_i\right) \subseteq \bigcup_{i=0}^\infty \operatorname{Sel}(E_i, T_i) We call a selector consistent everywhere (or just consistent) if it is consistent for all documents DΣD \in {{\Sigma}}^* at all positions k{0,,D}k \in \{0, \ldots, \left|{{D}}\right|\}.
We require Sel\operatorname{Sel} to be consistent everywhere.
Note that if both Σ\Sigma and T\mathcal{T} are finite sets, the selector is clearly consistent everywhere. Because in this case Lex(D,k)\operatorname{Lex}(D, k) is also a finite set, and this means that both chains (Ei)iN(E_i)_{i \in \N} and (Ti)iN(T_i)_{i \in \N} converge. Thus there is JNJ \in \N such that Ei=EJE_i = E_J and Ti=TJT_i = T_J for all iJi \geq J, which implies Sel(i=0Ei, i=0Ti)=Sel(EJ,TJ)i=0Sel(Ei,Ti) \operatorname{Sel}\left(\bigcup_{i=0}^\infty E_i,\ \bigcup_{i=0}^\infty T_i\right) = \operatorname{Sel}(E_J, T_J) \subseteq \bigcup_{i=0}^\infty \operatorname{Sel}(E_i, T_i)

Priority Relations

A convenient way to define a selector is by providing a priority relation \lhd on tokens: T×T\lhd \subseteq \mathbb{T} \times \mathbb{T} For two tokens uTu \in \mathbb{T} and vTv \in \mathbb{T} we write uvu \lhd v to mean that uu has lower priority than vv, i.e. (u,v)(u, v) \in \lhd holds.
For example, let SpecPos:TN\operatorname{SpecPos} : \mathcal{T} \rightarrow \N be an injective function assigning to each terminal the position (e.g. line in a grammar specification file) where it was specified. The priority relation used for traditional lexical analysis is then defined by uv  u<v(u=vSpecPos()>SpecPos()) u \lhd v \ {\includegraphics[height=0.5em]{44F70450-7F66-4420-90D8-CED086C922B2}}\ \left|{{\overline{{{u}}}}}\right| < \left|{{\overline{{{v}}}}}\right| \vee \left(\left|{{\overline{{{u}}}}}\right| = \left|{{\overline{{{v}}}}}\right| \wedge \operatorname{SpecPos}({\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{0D8FE38A-E77F-4FEC-AC68-E55CB8E8799E}}) > \operatorname{SpecPos}({\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{640D8136-18B8-43E3-B311-5BDFF327D4AC}})\right) In this case, \lhd is even a total order when restricted to the set of tokens appearing at the same position in the same document, but we do not require any properties of \lhd beyond being a relation on tokens.
If selectors depended only on their upper bound argument TT, there would be a canonical way how \lhd induces a selector by just choosing all maximal elements in TT: Sel(T)={uTvT. uv}\operatorname{Sel}(T) = \{ u \in T \mid \nexists\, v \in T.\ u \lhd v\}
But selectors must also respect their lower bound argument EE, and therefore the above definition needs to be adjusted. There are at least two ways to do this:
FB (“First in Blocks”). This way defines the selector as Sel(E,T)=E{uT(vT. uv)(vE. vu)}\operatorname{Sel}(E, T) = E \cup \{ u \in T \mid (\nexists\, v \in T.\ u \lhd v) \wedge (\nexists\, v \in E.\ v \lhd u)\} It adds those maximal elements of TT that are not in conflict with any elements in EE. This means that tokens in EE block conflicting tokens in TT from being chosen, even if those tokens in TT have higher priority than those already in EE.
PP (“Priority Prevails”). This way defines the selector as Sel(E,T)=E{uTvT. uv}\operatorname{Sel}(E, T) = E \cup \{ u \in T \mid \nexists\, v \in T.\ u \lhd v\} It adds all maximal elements of TT regardless of the content of EE. As a consequence, lower priority empty tokens might be chosen although higher priority tokens have been chosen as well. On the other hand, a lower priority token can never block a higher priority one.
Note that FB and PP coincide for E=E = \emptyset. In practice PP seems to be a better choice than FB, as it is more in line with our intuitive concept of priority.
It is easy to see that both FB and PP produce selectors that are consistent everywhere. For example, to show consistency of a PP selector, assume tSel(E,T)  E=i=0Ei  T=i=0Tit \in \operatorname{Sel}(E_\infty, T_\infty)\ {\includegraphics[height=0.5em]{7FD733F7-1DA6-4813-A81E-7F73C96F18A2}}\ E_\infty = \bigcup_{i=0}^\infty E_i\ {\includegraphics[height=0.5em]{FB6310F2-9C88-4C49-B47C-7C25AD445CBA}}\ T_\infty = \bigcup_{i=0}^\infty T_i for an otherwise arbitrary token tTt \in \mathbb{T}. Based on the definition of PP this implies tE(tTvT. tv)t \in E_\infty \vee \left(t \in T_\infty \wedge \nexists\, v \in T_\infty.\ t \lhd v \right) In case of tEt \in E_\infty, there exists JNJ \in \N with tEJt \in E_J and therefore also tSel(EJ,TJ)t \in \operatorname{Sel}(E_J, T_J) In case of tEt \notin E_\infty, there exists JNJ \in \N with tTJt \in T_J and vT. tv\nexists\, v \in T_\infty.\ t \lhd v. This implies tTJvTJ. tvt \in T_J \wedge \nexists\, v \in T_J.\ t \lhd v and we again arrive at tSel(EJ,TJ)t \in \operatorname{Sel}(E_J, T_J).

Semantics of Local Lexing

So far we have defined what a local lexing is, but we have not yet talked about how to use it. In fact, on its own a local lexing is pretty useless. For it to obtain a meaning, it needs to be paired with a Context-Free Grammar, resulting in a Local-Lexing Grammar.

Context-Free Grammars

Let us first remind ourselves what a Context-Free Grammar is, and introduce the relevant notation. A Context-Free Grammar GG is a quadruple (N,T,R,S)(\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}) where N\mathcal{N} is its set of nonterminals, T\mathcal{T} its set of terminals, RN×(NT)\mathcal{R} \subseteq \mathcal{N} \times {{(\mathcal{N} \cup \mathcal{T})}}^* its set of rules, and SN\mathcal{S} \in \mathcal{N} its start symbol. Nonterminals and terminals are required to be disjoint sets, i.e. NT=\mathcal{N} \cap \mathcal{T} = \emptyset.
Instead of (N,γ)R(N, \gamma) \in \mathcal{R} we also write NγN \rightarrow \gamma. We write αβ\alpha \Rightarrow \beta for α,β(NT)\alpha, \beta \in {{(\mathcal{N} \cup \mathcal{T})}}^* iff there are α1\alpha_1, NN, α2\alpha_2 and γ\gamma such that α=α1Nα2\alpha = \alpha_1 N \alpha_2, β=α1γα2\beta = \alpha_1 \gamma \alpha_2, and NγN \rightarrow \gamma. Furthermore we let  \overset{*\ }\Rightarrow denote the reflexive and transitive closure of \Rightarrow.
The language L\mathcal{L} of GG is defined as the set of all terminal sequences derivable from the start symbol, i.e. L={ωTS ω}\mathcal{L} = \{ \omega \in {{\mathcal{T}}}^* \mid \mathcal{S} \overset{*\ }\Rightarrow \omega \} The prefix language L\mathcal{L}_{{\includegraphics[height=0.5em]{E21B77F2-652F-47D7-B05E-3ACADDD29408}}} of GG is defined as L={ωTα(NT). S ωα}\mathcal{L}_{{\includegraphics[height=0.5em]{84CD9468-A95A-4D50-B630-4D0E161A8DBB}}} = \{ \omega \in {{\mathcal{T}}}^* \mid \exists\, \alpha \in {{(\mathcal{N} \cup \mathcal{T})}}^*.\ \mathcal{S} \overset{*\ }\Rightarrow \omega \alpha \}

Local Lexing Grammars

A Local Lexing Grammar G\mathcal{G} is a tuple (N,T,R,S,Σ,Lex,Sel)(\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \Sigma, \operatorname{Lex}, \operatorname{Sel}) such that G=(N,T,R,S)G = (\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}) is a Context-Free Grammar and L=(T,Σ,Lex,Sel)L = (\mathcal{T}, \Sigma, \operatorname{Lex}, \operatorname{Sel}) is a local lexing.
To denote the same Local Lexing Grammar as G\mathcal{G}, but with a different start symbol XN\mathcal{X} \in \mathcal{N}, we write G/X{\mathcal{G}} / {\mathcal{X}}.
We will define the semantics \ell\ell of G\mathcal{G} to be a set of token sequences, i.e. T\ell\ell \subseteq {{\mathbb{T}}}^*
We construct \ell\ell by separately constructing P(D)\mathcal{P}(D) for each document DΣD \in {{\Sigma}}^*, where P(D)={pp=D}\mathcal{P}(D) = \left\{ p \in \ell\ell \mid \overline{{p}} = D \right\} Once we have constructed P(D)\mathcal{P}(D), we can define \ell\ell via (G)==DΣP(D)\ell\ell(\mathcal{G}) = \ell\ell = \bigcup_{D \in {{\Sigma}}^*} \mathcal{P}(D)


To construct P(D)\mathcal{P}(D), we will assume a fixed document DΣD \in {{\Sigma}}^* from now on to simplify our notation. For example, we will just write P\mathcal{P} instead of P(D)\mathcal{P}(D).
P\mathcal{P} is defined by recursively constructing {ε}=P1P0P1PD\{ \varepsilon \} = \mathcal{P}_{-1} \subseteq \mathcal{P}_0 \subseteq \mathcal{P}_1 \subseteq \ldots \subseteq \mathcal{P}_{\left|{{D}}\right|} and then setting P={pPDp=DL}\mathcal{P} = \left\{ p \in \mathcal{P}_{\left|{{D}}\right|} \mid \left|{{\overline{{{p}}}}}\right| = \left|{{D}}\right| \wedge {\includegraphics[height=0.8491079999999999em, totalheight=1.043548em]{76632037-D6C6-4395-86E3-83D28E7B7971}} \in \mathcal{L} \right\}
The elements of P\mathcal{P} are the “sensible” partitions of the input DD into tokens we alluded to earlier . Furthermore the elements pPkp \in \mathcal{P}_k are “sensible” partitions of prefixes of the input DD into tokens. We can say a priori (i.e. before actually having constructed Pk\mathcal{P}_k) more about such pp than them just being sequences of tokens. We demand that pp must be a path. We define the set P\mathbb{P} of paths to consist of those pTp \in {{\mathbb{T}}}^* such that:
Furthermore we define the set of paths that stop at position kk in DD as Pk={pPp=k}\mathbb{P}_k = \{ p \in \mathbb{P} \mid \left|{{\overline{{{p}}}}}\right| = k \}
Paths are an auxiliary tool to aid us in defining P\mathcal{P}. But they are an interesting notion in itself already, as they showcase how lexical analysis (Lex)(\operatorname{Lex}) and syntax analysis (L)(\mathcal{L}_{{\includegraphics[height=0.5em]{61968841-9A26-405C-B50E-ADA6F8394013}}}) are intertwined in Local Lexing. They nicely package up the influence of GG and Lex\operatorname{Lex} on the Local Lexing semantics: We will not need to refer to GG or Lex\operatorname{Lex} anymore to finish the definition of the semantics. This also means that the semantics does not depend on the particular shape of GG, but just on GG’s language L\mathcal{L} and its prefix language L\mathcal{L}_{{\includegraphics[height=0.5em]{521AC5EC-0289-4FCB-9C78-654EFEE77B4B}}}. If GG has no unproductive nonterminals, i.e. for all NNN \in \mathcal{N} there is ωT\omega \in {{\mathcal{T}}}^* such that N ωN \overset{*\ }\Rightarrow \omega, then L\mathcal{L}_{{\includegraphics[height=0.5em]{FECD6D99-4DBE-4916-9398-8A1649DCB42B}}} is uniquely determined by L\mathcal{L} and therefore the semantics just depends on GG’s language L\mathcal{L} alone (and on the local lexing LL, of course).
Indeed, if Local Lexing did not provide a mechanism for constraining token selection, or equivalently, if the selector would be trivial, i.e. Sel(E,A)=A\operatorname{Sel}(E, A) = A for all arguments EE and AA, then we would already be done, as we could then just set P={pPDL}\mathcal{P} = \left\{ p \in \mathbb{P}_{\left|{{D}}\right|} \mid {\includegraphics[height=0.8491079999999999em, totalheight=1.043548em]{49B450D2-EB19-45B3-A92B-5B337223EC56}} \in \mathcal{L} \right\}
This also shines a spotlight on the importance of the selector mechanism in Local Lexing. It is its defining feature.

Constructing Pk\mathcal{P}_k

The basic idea for constructing the Pk\mathcal{P}_k for k{0,,D}k \in \{0, \ldots, \left|{{D}}\right|\} is to construct in tandem also a family of token sets ZkLex(D,k)\mathcal{Z}_k \subseteq \operatorname{Lex}(D, k). The dependencies here are as follows: Pk1ZkPk1  ZkPk \begin{array}{ccc} \mathcal{P}_{k-1} & {\includegraphics[height=0.5em]{D50D3FAC-523F-4DB4-8A00-F2E252D90FED}} & \mathcal{Z}_k\\[0.3em] \mathcal{P}_{k-1}\ {\includegraphics[height=0.5em]{79AA3228-EE32-4412-8BEC-0C29E29DC510}}\ \mathcal{Z}_k & {\includegraphics[height=0.5em]{588221D7-955F-4256-AF4D-63FBA3977D64}} & \mathcal{P}_k \end{array} If you think of the (Pk)0kD(\mathcal{P}_k)_{0 \leq k \leq \left|{{D}}\right|} as the result of syntax analysis, and the (Zk)0kD(\mathcal{Z}_k)_{0 \leq k \leq \left|{{D}}\right|} as the result of lexical analysis, then these dependencies show how Local Lexing intertwines syntax analysis and lexical analysis.
The straightforward part is how Pk\mathcal{P}_k is determined by Zk\mathcal{Z}_k and Pk1\mathcal{P}_{k-1}: Pk=AppendkZkPk1\mathcal{P}_k = \operatorname{Append}_k\, \mathcal{Z}_{k}\, \mathcal{P}_{k-1} where AppendkTP=limit(Appendk1T)PAppendk1TP=P{ptPpPPk, tT}limitfX=n=0fn(X) \begin{array}{rcl} \operatorname{Append}_k\, T\, P & = & \operatorname{limit} \left(\operatorname{Append}^1_k\, T\right) P \\[0.3em] \operatorname{Append}^1_k\, T\, P & = & P \cup \left\{\, p\,t \in \mathbb{P} \mid p \in P \cap \mathbb{P}_k,\ t \in T \right\} \\[0.3em] \operatorname{limit} f X & = & \bigcup_{n = 0}^\infty f^n(X) \end{array} Here we just append tokens in Zk\mathcal{Z}_k to those paths in Pk1\mathcal{P}_{k-1} that stop at kk, as long as the result is still a path. Because appending empty tokens in Zk\mathcal{Z}_k potentially leads to new paths stopping at kk which also need to be taken into consideration, in general Appendk\operatorname{Append}_k needs to be specified as a limit process.

Constructing Zk\mathcal{Z}_k

If Zk\mathcal{Z}_k was guaranteed not to contain empty tokens, the other part would be straightforward as well: Zk=Sel(Nextk(Pk1)) \mathcal{Z}_k = \operatorname{Sel}\left(\operatorname{Next}_k\left({\mathcal{P}_{k-1}}\right)\right) where Nextk(P)={tTpPPk. ptP}  PT \operatorname{Next}_k\left({P}\right) = \left\{ t \in \mathbb{T} \mid \exists\, p \in P \cap \mathbb{P}_k.\ p\, t \in \mathbb{P} \right\}\ {\includegraphics[height=0.5em]{9C5E966E-BBE5-4D0F-B61E-21131C69BA94}}\ P \subseteq {{\mathbb{T}}}^* Here Nextk(P)\operatorname{Next}_k\left({P}\right) consists of all tokens that could “sensibly” extend some path in PP that stops at position kk in DD.
The trouble is that if Zk\mathcal{Z}_k contains any empty token, say ee, then we should really have taken ee into account before computing Zk\mathcal{Z}_k as it might create more sensible paths that stop at kk. Therefore, in the presence of empty tokens, we need a more elaborate approach for constructing Zk\mathcal{Z}_k as follows.
We construct chains (Qki)iN(Q_k^i)_{i \in \N}, (Eki)iN(E_k^i)_{i \in \N}, (Tki)iN(T_k^i)_{i \in \N} via Qk0=Pk1Ek0=Tki=Nextk(Qki)Eki+1=Sel(Eki,Tki)εQki+1=AppendkEki+1Qki \begin{array}{rcl} Q_k^0 & = & \mathcal{P}_{k-1}\\[0.3em] E_k^0 & = & \emptyset\\[0.3em] T_k^i & = & \operatorname{Next}_k\left({Q_k^i}\right)\\[0.3em] E_k^{i+1} & = & {{\operatorname{Sel}(E_k^i, T_k^i)}}_{\varepsilon}\\[0.3em] Q_k^{i+1} & = & \operatorname{Append}_k\,E_k^{i+1}\,Q_k^i \end{array} and can then define Zk=Sel(i=0Eki, i=0Tki)\mathcal{Z}_k = \operatorname{Sel}\left(\bigcup_{i=0}^\infty E_k^i,\ \bigcup_{i=0}^\infty T_k^i \right)

Was it all worth it?

But have we actually gained something by this more elaborate approach? After all, if there is an empty token e(Zk)εe \in {{(\mathcal{Z}_k)}}_{\varepsilon} such that ee is not in any of the EkiE_k^i then we would again have the same problem of possibly having missed some sensible paths.
This is where the consistency of the selector comes to the rescue. Because of it we have (Zk)ε=Sel(i=0Eki, i=0Tki)ε(i=0Sel(Eki,Tki))ε=i=0Sel(Eki,Tki)ε=i=0Eki+1=i=0Eki(Zk)ε \begin{array}{rcl} {{(\mathcal{Z}_k)}}_{\varepsilon} & = & {{ \operatorname{Sel}\left(\bigcup_{i=0}^\infty E_k^i,\ \bigcup_{i=0}^\infty T_k^i \right)}}_{\varepsilon} \\[0.3em] & \subseteq & {{\left(\bigcup_{i=0}^\infty \operatorname{Sel}(E_k^i, T_k^i)\right)}}_{\varepsilon}\\[0.3em] & = & \bigcup_{i=0}^\infty {{\operatorname{Sel}(E_k^i, T_k^i)}}_{\varepsilon}\\[0.3em] & = & \bigcup_{i=0}^\infty E_k^{i+1}\\[0.3em] & = & \bigcup_{i=0}^\infty E_k^i\\[0.3em] & \subseteq & {{(\mathcal{Z}_k)}}_{\varepsilon} \end{array} and thus (Zk)ε=i=0Eki{{(\mathcal{Z}_k)}}_{\varepsilon} = \bigcup_{i=0}^\infty E_k^i.
This shows that when constructing Zk\mathcal{Z}_k, before making the final selection of which non-empty tokens to select, it has already been decided which empty tokens to select. The final selection will not add any new empty tokens, and therefore we can safely proceed to the next position k+1k+1 without missing out on sensible paths that have (intermediate) stops at kk. Furthermore, all non-empty tokens at kk will be selected at once. This means that there will not be any priority conflicts between non-empty tokens that need to be resolved the FB or the PP way . Such conflicts can only possibly occur between empty tokens and other empty tokens, or between lower priority empty tokens and higher priority non-empty tokens. This is the major improvement over the original Local Lexing semantics where the presence of empty tokens could introduce such conflicts between non-empty tokens.

Pyramid Grammars

A Local Lexing Grammar is a tuple consisting of seven components and somewhat unwieldy for directly specifying parsers. Furthermore, although I have claimed that Local Lexing is a panacea for composable specifications, I have not provided direct evidence for this claim so far.
Luckily, there is a powerful specification mechanism that is more compact than Local Lexing Grammars but which can still be understood in terms of Local Lexing Grammars.

Definition of Pyramid Grammars

A Pyramid Grammar is a tuple (N,T,R,S,<,)(\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \lt, \lessdot) such that N\mathcal{N} is its set of nonterminals, T\mathcal{T} its set of terminals such that NT=\mathcal{N} \cap \mathcal{T} = \emptyset, R(NT)×(NT)\mathcal{R} \subseteq (\mathcal{N} \cup \mathcal{T}) \times {{(\mathcal{N} \cup \mathcal{T})}}^* its set of rules, SN\mathcal{S} \in \mathcal{N} its start symbol, and <,T×T\lt, \lessdot \subset \mathcal{T} \times \mathcal{T} its priorities.
A Pyramid Grammar looks just like a Context-Free Grammar, with two extensions:

Semantics of Pyramid Grammars

We are giving a Pyramid Grammar YY a meaning by turning it into a Local Lexing Grammar G\mathcal{G}, and then defining its alphabet Σ\Sigma and language L\mathbb{L} by Σ(Y)=ΣGL(Y)={αα(G)}Σ \begin{array}{rcl} \Sigma(Y) & = & \Sigma_\mathcal{G}\\[0.3em] \mathbb{L}(Y) & = & \{\overline{{\alpha}} \mid \alpha \in \ell\ell(\mathcal{G})\} \subseteq {{\Sigma}}^* \end{array}
The terminals and the start symbol of G\mathcal{G} are the same:
TG=T,SG=S \mathcal{T}_\mathcal{G} = \mathcal{T}, \quad \mathcal{S}_\mathcal{G} = \mathcal{S}
The characters ΣGTG\Sigma_\mathcal{G} \subseteq \mathcal{T}_\mathcal{G} of G\mathcal{G} are those terminals which do not appear on the left side of any rule in R\mathcal{R}: ΣG={cTα.(c,α)R}\Sigma_\mathcal{G} = \{ c \in \mathcal{T} \mid \nexists \alpha. (c, \alpha) \in \mathcal{R} \}
For each TTΣGT \in \mathcal{T} \setminus \Sigma_\mathcal{G} we add a fresh nonterminal ST\mathcal{S}_T: NG=N{STTTΣG} \mathcal{N}_\mathcal{G} = \mathcal{N} \uplus \{\mathcal{S}_T \mid T \in \mathcal{T} \setminus \Sigma_\mathcal{G}\}
The rules RG\mathcal{R}_\mathcal{G} of G\mathcal{G} are those rules in R\mathcal{R} with a nonterminal on the left side, plus rules for the new nonterminals: RG={(N,α)RNN}{(ST,α)TTΣG, (T,α)R}\mathcal{R}_\mathcal{G} = \{ (N, \alpha) \in \mathcal{R} \mid N \in \mathcal{N} \} \uplus \{ (\mathcal{S}_T, \alpha) \mid T \in \mathcal{T} \setminus \Sigma_\mathcal{G},\ (T, \alpha) \in \mathcal{R} \}
The selector SelG\operatorname{Sel}_\mathcal{G} of G\mathcal{G} is defined the PP way  by turning <\lt and \lessdot into a priority relation \lhd on tokens via uv  (<=u<vuvu<v) u \lhd v\ {\includegraphics[height=0.5em]{079B2B9D-2A05-402D-BE27-63F04910E99E}}\ \left( \begin{array}{ll} & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{1FCFAFF4-CE03-4AF8-B57A-F2BE5795AAB9}} \lt {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{E7525A81-0DF5-48AE-92D1-8F9A5859F775}} \\ \vee & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{3696B186-CB9A-4456-BE55-CBA5DBF9765D}} = {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{A61876B3-E313-4765-B6DD-532A5453D03F}} \wedge \left|{{\overline{{{u}}}}}\right| < \left|{{\overline{{{v}}}}}\right| \\ \vee & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{22438E83-6588-40A7-8ED0-96721919F7BC}} \lessdot {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{D3179172-D49B-47BE-BB9E-09C36E1D8B9A}} \wedge \left|{{\overline{{{u}}}}}\right| \leq \left|{{\overline{{{v}}}}}\right| \\ \vee & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{80AD6521-BC16-4187-926B-FF9467814AA6}} \lessdot {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{66271B2D-C4D4-4840-A1F7-17FF6F0914E9}} \wedge \left|{{\overline{{{u}}}}}\right| \lt \left|{{\overline{{{v}}}}}\right| \end{array} \right) This definition reflects that <\lt represents unconditional priority and \lessdot represents longest match priority. Furthermore, for two tokens with the same associated terminal, the longer token has higher priority.
The tricky part is the definition of the lexer component. Firstly, we will define a separate lexer LexT\operatorname{Lex}_T for each terminal TTGT \in \mathcal{T}_\mathcal{G} and then define for any DΣGD \in {{\Sigma_\mathcal{G}}}^* and position kk in DD: Lex(D,k)=TTG{(T,α)αLexT(D,k)}\operatorname{Lex}(D, k) = \bigcup_{T \in \mathcal{T}_\mathcal{G}} \{ (T, \alpha) \mid \alpha \in \operatorname{Lex}_T(D, k) \}
For cΣGc \in \Sigma_\mathcal{G} the lexer is easily defined: Lexc(D,k)={{c} 0k<DDk=c \operatorname{Lex}_c(D, k) = \begin{cases} \{c\} & {\includegraphics[height=0.5em]{099F3F7E-A03B-4D6F-91F6-344273B36F15}}\ 0 \leq k < \left|{{D}}\right| \wedge D_k = c \\ \emptyset & {\includegraphics[height=0.5em]{DCA89E20-44F6-430A-9556-6AE0A9DB9D01}} \end{cases}
We use G/ST{\mathcal{G}} / {\mathcal{S}_T}  for parsing TTΣGT \in \mathcal{T} \setminus \Sigma_\mathcal{G}: LexT(D,k)={αα(G/ST){α k0k+αD{α 0i<α. αi=Dk+i} \begin{array}{rl} \operatorname{Lex}_T(D, k) = & \{ \overline{{\alpha}} \mid \alpha \in \ell\ell({\mathcal{G}} / {\mathcal{S}_T})\\[0.3em] & \phantom{\{ \overline{{\alpha}} \mid}\wedge\ k \geq 0 \wedge k + \left|{{\overline{{\alpha}}}}\right| \leq \left|{{D}}\right|\\[0.3em] & \phantom{\{ \overline{{\alpha}} \mid}\wedge\ \forall\,0 \leq i < \left|{{\overline{{\alpha}}}}\right|.\ \overline{{\alpha}}_i = D_{k+i} \} \end{array} This means that we parse TT with the same Local Lexing Grammar G\mathcal{G}, but use ST\mathcal{S}_T instead of S\mathcal{S} as a start symbol!

Wellfounded Pyramid Grammars

The above semantics for Pyramid Grammars is simple and intuitive, but unfortunately as a definition technically wrong. The problem is that we are defining G\mathcal{G} in terms of G/ST{\mathcal{G}} / {\mathcal{S}_T}, leading to a cyclic definition.
All components of G\mathcal{G} are defined nonrecursively except for the definition of the LexT\operatorname{Lex}_T for TXT \in \mathcal{X}, where X={STTTΣG}\mathcal{X} = \{\mathcal{S}_T \mid T \in \mathcal{T} \setminus \Sigma_\mathcal{G}\}. So the problematic dependencies are those between the “calls” to LexT(D,k)\operatorname{Lex}_T(D, k) for TXT \in \mathcal{X}, a fixed DD, and 0kD0 \leq k \leq \left|{{D}}\right|. Therefore, if we can find a wellfounded relation on {(T,k)TX,0kD}\{(T, k) \mid T \in \mathcal{X}, 0 \leq k \leq \left|{{D}}\right|\} along those calls, then G\mathcal{G} is well-defined.
If there are no left-recursive rules for terminals, either directly or indirectly, and assuming that TΣG\mathcal{T} \setminus \Sigma_\mathcal{G} is finite, then such a wellfounded relation exists.
To see why, let us first clarify what is meant by “no left-recursive rules for terminals”.
Consider the Context-Free Grammar E={N(TGΣG),ΣG,R,S}E = \{\mathcal{N} \cup (\mathcal{T}_\mathcal{G} \setminus \Sigma_\mathcal{G}), \Sigma_\mathcal{G}, \mathcal{R}, \mathcal{S}\}. We say that α(NT)\alpha \in {{(\mathcal{N} \cup \mathcal{T})}}^* is nullable iff α Eε\alpha \overset{*\ }\Rightarrow_E \varepsilon.
We define a directed dependency graph DP\operatorname{DP} such that the nodes of DP\operatorname{DP} are the elements of TGΣG\mathcal{T}_\mathcal{G} \setminus \Sigma_\mathcal{G}. There is a directed edge from node UTU \in \mathcal{T} to node VTGV \in \mathcal{T}_\mathcal{G} iff SU αVβ\mathcal{S}_U \overset{*\ }\Rightarrow \alpha V\beta for some nullable α\alpha and some β\beta, and in this case we write UVU \twoheadrightarrow V. “No left-recursive rules for terminals” means then just that DP\operatorname{DP} is acyclic. The relation >> is defined as (U,k)>(V,l)UV0k<lD(U, k) > (V, l) \quad {\includegraphics[height=0.5em]{236DA5EB-3B22-4BD4-9EE9-E257ACE6A389}}\quad U \twoheadrightarrow V \vee 0 \leq k < l \leq \left|{{D}}\right| This relation clearly holds along calls of LexV(D,l)\operatorname{Lex}_V(D, l) from LexU(D,k)\operatorname{Lex}_U(D, k) because of the way P\mathbb{P} is defined. Furthermore >> is wellfounded, because DP\operatorname{DP} is finite and acyclic, and thus \twoheadrightarrow is wellfounded.
This leads us to call Pyramid Grammars wellfounded iff \twoheadrightarrow is a wellfounded relation. The semantics of wellfounded Pyramid Grammars is well-defined via the Local Lexing Grammar G\mathcal{G}.


Composability is a big strength of Pyramid Grammars. As examples, we will consider three operations supporting composition of Pyramid Grammars in the following: renaming, union and import.


Given a bijection R:N(TΣ)BR : \mathcal{N} \cup (\mathcal{T} \setminus {\Sigma}) \rightarrow B such that BΣ=B \cap \Sigma = \emptyset, the renaming of the Pyramid Grammar YY according to RR is the Pyramid Grammar YY' defined in the obvious way. YY' is wellfounded iff YY is wellfounded, and L(Y)=L(Y)\mathbb{L}(Y') = \mathbb{L}(Y).


Given two Pyramid Grammars Y1=(N1,T1,R1,S1,1,<1)Y2=(N2,T2,R2,S2,<2,2)Y_1 = (\mathcal{N}_1, \mathcal{T}_1, \mathcal{R}_1, \mathcal{S}_1, \lessdot_1, \lt_1) \quad Y_2 = (\mathcal{N}_2, \mathcal{T}_2, \mathcal{R}_2, \mathcal{S}_2, \lt_2, \lessdot_2) as long as N1T2=N2T1=\mathcal{N}_1 \cap \mathcal{T}_2 = \mathcal{N}_2 \cap \mathcal{T}_1 = \emptyset, you can form its union as Y=(N,T,R,S,<,)Y = (\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \lt, \lessdot) where S\mathcal{S} is a fresh nonterminal and N=N1N2{S}T=T1T2R=R1R2{SS1,SS2}<=<1<2=12 \begin{array}{rcl} \mathcal{N} & = & \mathcal{N}_1 \cup \mathcal{N}_2 \cup \{\mathcal{S}\}\\ \mathcal{T} & = & \mathcal{T}_1 \cup \mathcal{T}_2 \\ \mathcal{R} & = & \mathcal{R}_1 \cup \mathcal{R}_2 \cup \{\mathcal{S} \rightarrow \mathcal{S}_1, \mathcal{S} \rightarrow \mathcal{S}_2\}\\ \lt & = & \lt_1 \cup \lt_2\\ \lessdot & = & \lessdot_1 \cup \lessdot_2 \end{array}
It is not necessarily true that YY is wellfounded if both Y1Y_1 and Y2Y_2 are wellfounded. Consider Z1=({T},{A,B},{TA,AB},T,,)Z_1 = (\{T\}, \{A, B\}, \{T \rightarrow A, A \rightarrow B\}, T, \emptyset, \emptyset) Z2=({T},{A,B},{TB,BA},T,,)Z_2 = (\{T\}, \{A, B\}, \{T \rightarrow B, B \rightarrow A\}, T, \emptyset, \emptyset) and their union Z12=({S,T},{A,B},R,S,,)Z_{12} = (\{S, T\}, \{A, B\}, \mathcal{R}, S, \emptyset, \emptyset) where R={ST,TA,TB,AB,BA}\mathcal{R} = \{S \rightarrow T, T \rightarrow A, T \rightarrow B, A \rightarrow B, B \rightarrow A\} Z1Z_1 and Z2Z_2 are wellfounded with Σ(Z1)=L(Z1)={B}\Sigma(Z_1) = \mathbb{L}(Z_1) = \{B\} and Σ(Z2)=L(Z2)={A}\Sigma(Z_2) = \mathbb{L}(Z_2) = \{A\}. But Z12Z_{12} is not wellfounded, as there is a cycle in the dependency graph of Z12Z_{12} between AA and BB. Note that Σ(Z12)=\Sigma(Z_{12}) = \emptyset.
It is also not necessarily true that even if YY is wellfounded, that then L(Y)=L(Y1)L(Y2)\mathbb{L}(Y) = \mathbb{L}(Y_1) \cup \mathbb{L}(Y_2). Consider the union Z13Z_{13} of Z1Z_1 and Z3Z_3, where Z3=({T},{B,C},{TB,BC},T,,)Z_3 = (\{T\}, \{B, C\}, \{T \rightarrow B, B \rightarrow C\}, T, \emptyset, \emptyset) We have Σ(Z3)=L(Z3)={C}\Sigma(Z_3) = \mathbb{L}(Z_3) = \{C\} and Σ(Z13)=L(Z13)={C}\Sigma(Z_{13}) = \mathbb{L}(Z_{13}) = \{C\} and therefore clearly L(Z13)L(Z1)L(Z3)={B,C}\mathbb{L}(Z_{13}) \neq \mathbb{L}(Z_1) \cup \mathbb{L}(Z_3) = \{B, C\}.
Given wellfounded Pyramid Grammars Y1Y_1 and Y2Y_2 such that (N1(T1Σ1))(N2(T2Σ2))=(\mathcal{N}_1 \cup (\mathcal{T}_1 \setminus {\Sigma}_1)) \cap (\mathcal{N}_2 \cup (\mathcal{T}_2 \setminus {\Sigma}_2)) = \emptyset holds, then their union YY is wellfounded, too, and L(Y)=L(Y1)L(Y2)\mathbb{L}(Y) = \mathbb{L}(Y_1) \cup \mathbb{L}(Y_2) is actually true.
Forming the language union of two Pyramid Grammars is always possible, because if the above condition does not hold, a renaming can be applied to Y1Y_1 or Y2Y_2 to ensure that the condition holds.


A Pyramid Grammar Fragment F=(NF,TF,RF,<F,F)F = (\mathcal{N}_F, \mathcal{T}_F, \mathcal{R}_F, \lt_F, \lessdot_F) is a Pyramid Grammar without a designated start symbol. The notions of alphabet and wellfoundedness are independent of the start symbol and transfer therefore over to fragments.
Adding a fragment FF to a Pyramid Grammar Y=(N,T,R,S,<,)Y = (\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \lt, \lessdot) results in a Pyramid Grammar Y=(NNF,TTF,RRF,S,<<F,F)Y' = (\mathcal{N} \cup \mathcal{N}_F, \mathcal{T} \cup \mathcal{T}_F, \mathcal{R} \cup \mathcal{R}_F, \mathcal{S}, \lt \cup \lt_F, \lessdot \cup \lessdot_F). Obviously, Σ(Y)Σ(Y)Σ(F)\Sigma(Y') \subseteq \Sigma(Y) \cup \Sigma(F)
We say that YY' is the import of FF to YY if the following conditions hold:
In this case, when YY is wellfounded, so is YY'.
The effect of an import is to swap out some, maybe all, characters of YY by their definitions in FF in terms of Σ(F)\Sigma(F). Therefore FF could define a library of terminals, for example numbers, identifiers, etc., and a Pyramid Grammar YY could import them by simply adding the fragment.
In the extreme case of Σ(Y)TF\Sigma(Y) \subseteq \mathcal{T}_F we have Σ(Y)=Σ(F)\Sigma(Y') = \Sigma(F), i.e. the entire alphabet Σ(Y)\Sigma(Y) is swapped out for the alphabet Σ(F)\Sigma(F) of the fragment. This recreates within the setting of Pyramid Grammars the traditional separation of lexical analysis and syntax analysis, where YY can be considered as being responsible for syntax analysis, FF for lexical analysis, and YY' represents the parser resulting from combining the two.

Parse Trees

When parsing, one is usually not only interested in whether a given document is in the language specified by the Pyramid Grammar, but also in some sort of result derived from the discovered document structure. Traditionally, parser generators allow to associate custom actions with grammar rules. These custom actions then compute a result.
It is ill-advised to pursue this approach in a parser generator for Pyramid Grammars. Two consumers of the same grammar have often not the same use case and need different kinds of results. Given how easy it is to compose Pyramid Grammars, the results of parsing with Pyramid Grammars should also compose nicely.
A simple solution to this is to just use the parse tree as a result, a concrete syntax tree. Often though, such a concrete syntax tree follows too much the structure of the parse and contains therefore too much noise and unwanted structure.
To fix this, one can classify symbols in NT\mathcal{N} \cup \mathcal{T} along two axes: visibility and structure. Both axes consist of two values. The value of visibility can be either visible or auxiliary. The value of structure can be either nested or atomic. For a symbol to be classified as auxiliary means that it has no corresponding node in the parse tree of its own, but is just represented by the collection of its children. For a symbol to be classified as atomic means that its children will not become part of the parse tree. A symbol that is classified as both atomic and auxiliary is effectively hidden and will have no representation in the result.
By default, terminals are atomic and visible, while nonterminals are nested and visible. These defaults are overridable at specification time, and / or by the user of a Pyramid Grammar.
With the help of visibility and structure the parser can generate a tree more akin to an abstract syntax tree than to a concrete syntax tree. By analysing the grammar a parser generator can even automatically generate data types for the parse result useful for static type checking.
Note that parse trees can be ambiguous. While an ambiguous grammar is usually something to be avoided, during the development of the grammar easy inspection of ambiguities is important, and parse trees, properly implemented, can support that.

Grammar Expressions

It is possible to implement a prioritized choice operator // as it is used in Parsing Expression Grammars [6]. Given two sequences α,β(NT)\alpha, \beta \in {{(\mathcal{N} \cup \mathcal{T})}}^*, we can add a fresh nonterminal NN and two fresh terminals AA and BB and add the rules NANBAαBβ \begin{array}{rcl} N & \rightarrow & A \\ N & \rightarrow & B \\ A & \rightarrow & \alpha \\ B & \rightarrow & \beta \\ \end{array} We also add the priority B<A B \lt A The effect of this is that NN now stands for the prioritized choice between α\alpha and β\beta: N=α/β N = \alpha / \beta To aid in parse tree creation, we declare NN, AA and BB as auxiliary and nested.
Of course, we can also add a standard choice operator | by creating a fresh nonterminal NN (auxiliary and nested) and adding the rules NαNβ \begin{array}{rcl} N & \rightarrow & \alpha \\ N & \rightarrow & \beta \end{array} The effect is that N=αβN = \alpha\, |\, \beta
An interesting operator is the isolation operator α\langle {\alpha} \rangle. We implement it by adding a fresh, auxiliary and nested terminal TT, and adding the rule TαT \rightarrow \alpha and thus T=αT = \langle {\alpha} \rangle Isolation has the effect of spawning off a separate parsing process for its argument which cannot be influenced by token selection issues outside of this process.
Note that we cannot have prioritized choice without isolating the arguments first, i.e. α/β\alpha / \beta has the same meaning as α/β\langle {\alpha} \rangle / \langle {\beta} \rangle. On the other hand, αβ\langle {\alpha} \rangle\,|\,\langle {\beta} \rangle has in general a different semantics than just αβ\alpha\,|\,\beta.
Inspired by these examples, we extend the notion of Pyramid Grammars to allow rules of the form XgX \rightarrow g where XNTX \in \mathcal{N} \cup \mathcal{T} and gg is a grammar expression. So far, grammar expressions are inductively defined as follows:
But we don’t have to stop here, there are more grammar expressions we can define:
Here we define g?=g/εg? = g / \varepsilon and [g]=gε[g] = g\, |\, \varepsilon.
And we can also add repetition operators:
Let us first define greedy repetition gg^*. We add a fresh nonterminal NN (auxiliary and nested), and add the rule N(gN)?N \rightarrow (g N)? This yields N=gN = g^*.
For non-greedy repetition {g}\{g\} we proceed analogously. We add a fresh nonterminal NN (auxiliary and nested), and add the rule N[Ng]N \rightarrow [N g] This gives us N={g}N = \{g\}.
The remaining operators can be defined as g+=ggg^+ = g g^* and {+g}=g{g}\{^+ g\} = g \{g\}.

Lookahead Operators

Grammar expressions as defined so far cover the complete Extended Backus-Naur form [7] and almost all constructs of Parsing Expression Grammars (PEG) [6]. What is missing to complete our set are the PEG lookahead operators &g \& g for positive lookahead, and !g!g for negative lookahead.
Unlike all other grammar expression operators we have encountered previously, the lookahead operators cannot be viewed simply as syntactic sugar that can be desugared into our current framework of Pyramid Grammars.
The notion of Pyramid Grammar is itself just “semantic sugar” on top of Local Lexing, though. So all we need to do is to expand our notion of Pyramid Grammar to let some more of the power of Local Lexing back in. The simplest way to do this is to allow that on the right hand side of rules in R\mathcal{R} not only nonterminals NN and terminals TT can appear, but also negated terminals !T!T.
The construction of the semantics G\mathcal{G} stays the same, with the following differences:
TG=T!T\mathcal{T}_\mathcal{G} = \mathcal{T}\, \cup\, !\mathcal{T}
The definition of wellfoundedness is adapted such that there are additional directed edges !TT!T \twoheadrightarrow T in the dependency graph, and such that nullability of !T!T is taken into account.
We can now define grammar expressions for lookahead:
For the definition of !g!g we create fresh terminals TT and UU and add the rules TgU!T \begin{array}{rcl} T & \rightarrow & g \\ U & \rightarrow & !T \\ \end{array} which results in U= !gU =\ !g.
Both TT und UU are declared as auxiliary and atomic, i.e. hidden.
Positive lookahead is just doubly negative lookahead: &g= !!g \&g =\ !!g
This shows that Pyramid Grammars are at least as expressive as PEGs and Context-Free Grammars. Indeed, because there are languages recognized by PEGs that are not context-free, Pyramid Grammars are strictly more expressive than Context-Free Grammars.


So, this is the new semantics for Local Lexing. Its new and hopefully improved way of handling empty tokens and avoiding conflicts between non-empty tokens is promising, especially in how it fits together with the new notion of selector consistency. Selector consistency is guaranteed for both finite local lexings and for priority relations, which covers all types of interesting selectors I have encountered so far.
I have implemented an algorithm for parsing according to this new semantics [8] →. It is adapted from the parsing algorithm from the original Local Lexing paper [1] →, which is based on Earley’s algorithm [9]. I have tried it out on Recursive teXt and it seems to be working correctly. But while the original algorithm has a formal correctness proof [10] →, I have no proof for the new algorithm yet. Nevertheless, I would expect the new proof to be less involved and technical than the original proof as the new semantics seems more intuitive.
The most important point of it all is that the parsing algorithm itself is not that important, and actually not covered at all in this article. This work started out as an investigation into using Earley’s parsing algorithm for composable syntax, and was particularly inspired by the use of blackboxes in [11] →. But what has been discovered through this work is not so much a particular parsing algorithm, but that Local Lexing is a general semantics for language and parser specifications.
Pyramid Grammars are how you use Local Lexing in practice. Each terminal in a Pyramid Grammar gives rise to a subgrammar. It seems obvious that each of these subgrammars should be analysed by a parser generator to find the parsing algorithm best suited for it. This seems to invite a renewed research into parsing algorithms with a focus on how they can be adapted to Local Lexing. Improvements in parsing algorithms could be directly passed on to the user who does not even have to be aware that improvements have been made. This is because the UI of the parser generator is most importantly its semantics, which is Local Lexing / Pyramid Grammars, and stable.
Special care has been taken to include infinite grammars. These arise naturally from parameterized Local Lexing Grammars [12] →. The grammar for Recursive teXt is a parameterized pyramid grammar, and I have found parameters to be very useful indeed. Parameterized Pyramid Grammars are basically just L-attributed [13] Pyramid Grammars. This makes sense, as on one hand L-attributed grammars are the largest class of attribute grammars supporting parser-driven attributes, and on the other hand such support is essential for Local Lexing as attribute evaluation may influence token selection. Parsing techniques for Local Lexing based on GLR parsing [14] and ECLR-attributed parsing [15] seem particularly promising.


[1]Steven Obua, Phil Scott, Jacques Fleuriot. (2017). Local Lexing, https://arxiv.org/abs/1702.03277.
[2]John Gruber. (2004). Markdown, https://daringfireball.net/projects/markdown.
[3]Donald E. Knuth. (1986). The TeXbook.
[4]Larry C. Paulson. (1996). ML for the Working Programmer, https://doi.org/10.1017/CBO9780511811326.
[5]Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman. (2007). Compilers: Principles, Techniques, & Tools.
[6]Bryan Ford. (2004). Parsing Expression Grammars: a recognition-based syntactic foundation, Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, https://doi.org/10.1145/964001.964011.
[7]Richard Pattis. EBNF: A Notation To Describe Syntax, https://www.ics.uci.edu/ pattis/misc/ebnf2.pdf.
[8]ParsingKit, https://github.com/phlegmaticprogrammer/ParsingKit.
[9]Jay Earley. (1970). An efficient context-free parsing algorithm, Communications of the ACM, https://doi.org/10.1145/362007.362035.
[10]Steven Obua. (2017). Local Lexing, Archive of Formal Proofs, https://www.isa-afp.org/entries/LocalLexing.html.
[11]Trevor Jim, Yitzhak Mandelbaum, David Walker. (2010). Semantics and Algorithms for Data-dependent Grammars, Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, https://doi.org/10.1145/1706299.1706347.
[12]Steven Obua. (2017). Parameterized Local Lexing, https://arxiv.org/abs/1704.04215.
[13]Reinhard Wilhelm, Dieter Maurer. (1995). Parser-driven attributes: L-attributed grammars, Compiler Design, Addison-Wesley, p. 447.
[14]Masaru Tomita (ed.). (1991). Generalized LR Parsing, Kluwer Academic Publishers.
[15]Masataka Sassa, Harushi Ishizuka, Ikuo Nakata. (1987). ECLR-Attributed Grammars: A Practical Class of L-Attributed Grammars, Information Processing Letters 24, pp. 31-41, North-Holland.