A New Semantics for Local Lexing

Cite as: https://doi.org/10.47757/obua.ll.1

May 18th, 2021

In 2017 I introduced in joint work with Phil Scott and Jacques Fleuriot the concept of Local Lexing [1]. Recently, I have applied Local Lexing to design the grammar for Recursive teXt, a language inspired by both Markdown [2] and TeX [3], and also more generally by the paradigm of functional (and declarative) programming [4] →. All articles on obua.com are written in Recursive teXt. Recursive teXt is still in an early stage and experimental.

During the design process I gained (and am still gaining) new insights into applying Local Lexing. Not only that, it also became apparent that the semantics of Local Lexing needs to be revised for practical reasons. So this article does that, and it also introduces new concepts. The article is self-contained and can be read and understood independently from [1] →.

Traditionally, the parsing of (computer) languages happens in two phases: First comes a phase of lexical analysis, which is followed by a phase of syntax analysis [5]. The lexical analysis phase partitions the input consisting of a sequence of characters into a sequence of tokens, and is usually speciﬁed via regular expressions. The syntax analysis operates purely on the resulting sequence of tokens, more speciﬁcally, on the resulting sequence of token kinds called terminals; this results in a parse tree. The syntax analysis is usually speciﬁed by a Context-Free Grammar (CFG).

The three main reasons why parsing is divided into two phases are:

One disadvantage of this split into two distinct phases is that two different formalisms are employed for specifying the language: one formalism uses regular expressions, the other one CFGs.

But its most signiﬁcant disadvantage is that it negatively affects the composability of language and parsing speciﬁcations. The syntax analysis phase is unproblematic as it is based on CFGs, which compose. But the lexical analysis phase requires that the input can be uniquely decomposed into a single sequence of tokens. Therefore parts from different language speciﬁcations cannot be easily combined, as they have possibly different and conﬂicting token speciﬁcations. Usually, these conﬂicts cannot be resolved automatically without corrupting the semantics of the original parts.

Local Lexing avoids this problem by dropping the requirement for lexical analysis to produce a unique partition of the input into tokens. Instead, all “sensible” partitions are allowed. Whether or not a partition is “sensible” is determined by a parsing process that intertwines lexical analysis and syntax analysis. This allows for token conﬂict resolution based on the local parsing context. What exactly “sensible” means will be made concrete in the following sections, and differs for this new semantics of Local Lexing from the one presented in [1] →.

On the basis of Local Lexing, Pyramid Grammars will then be introduced, a simple language and parser speciﬁcation mechanism which composes and is more expressive than Context-Free Grammars and at least as expressive as Parsing Expression Grammars [6].

A local lexing is a quadruple

(\mathcal{T}, \Sigma, \operatorname{Lex}, \operatorname{Sel})

:

The only important thing about

\mathcal{T}

and

\Sigma

is that they are sets. Otherwise, they can be chosen arbitrarily. But before we can understand the role that lexer and selector play, we need to understand what tokens are.

For any set

U

we let

{{U}}^*

denote the set of all ﬁnite sequences formed from elements in

U

. In particular,

{{U}}^*

contains the empty sequence

\varepsilon

. For two sequences

\alpha, \beta \in {{U}}^*

we write

\alpha\beta

for their concatenation, which is of course also in

{{U}}^*

. Given a sequence

\alpha \in {{U}}^*

we write

\left|{{\alpha}}\right|

for the length of

\alpha

, and

\alpha_i

for the

i

-th element of

\alpha

where

i \in \{0, 1, \ldots, \left|{{a}}\right| - 1\}

.

A token

\tau

is a pair

\tau = (t, \alpha)

such that

t \in \mathcal{T}

and

\alpha \in {{\Sigma}}^*

. So a token is just a sequence of characters tagged with a terminal. We let

\mathbb{T}

denote the set of all tokens:

\mathbb{T} = \left\{(t, \alpha) \mid t \in \mathcal{T} \ {\includegraphics[height=0.5em]{B68EE0FF-F769-478D-90A2-42087C347E93}}\ \alpha \in {{\Sigma}}^*\right\}

We deﬁne

{\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{8BB64BB5-CA64-463C-BDB9-F767555C30C0}} = t

and

\overline{{\tau}} = \alpha

. These notions are extended in the obvious way to sequences of tokens:

\begin{array}{rcl} {\includegraphics[height=0.9890079999999999em, totalheight=1.2390079999999999em]{B44D48AF-7EC1-4458-BD53-1D86BD673CD1}} & = & {\includegraphics[height=0.8491079999999999em, totalheight=0.9991079999999999em]{6D043DE1-7E77-4DF7-AFB4-4A977681C8B3}} \ldots {\includegraphics[height=0.8491079999999999em, totalheight=0.9991079999999999em]{AADB1BC9-C78B-4F6E-A98B-BAE19C357077}} \in {{\mathcal{T}}}^* \\[0.3em] \overline{{\tau_1 \ldots \tau_n}} & = & \overline{{\tau_1}} \ldots \overline{{\tau_n}} \in {{\Sigma}}^* \end{array}

Given a set

T \subseteq \mathbb{T}

we denote its subset of empty tokens

{{T}}_{\varepsilon}

by

{{T}}_{\varepsilon} = \{ \tau \in T \mid \overline{{\tau}} = \varepsilon\}

The task of lexical analysis is to break down the input document

D \in {{\Sigma}}^*

into tokens. A local lexing speciﬁes how to do this locally at a given position

k

of

D

. The task of the lexer is to recognize all possible token candidates at position

k

in

D

, and is therefore a function taking a document and a position as arguments, and returning a set of tokens:

\operatorname{Lex} : {{\Sigma}}^* \times \N \rightarrow \mathscr{P}({\mathbb{T}})

We write

\mathscr{P}({U})

to denote the powerset of

U

, i.e. the set of all subsets of

U

.

\operatorname{Lex}

must have the property that for all documents

D \in {{\Sigma}}^*

and all positions

k \in \{0, \ldots, \left|{{D}}\right|\}

within that document, it returns only tokens

\tau \in \operatorname{Lex}(D, k)

such that

\overline{{\tau}}

is actually a subsequence of

D

starting at

k

, i.e.

k + \left|{{\overline{{{\tau}}}}}\right| \leq \left|{{D}}\right| \ {\includegraphics[height=0.5em]{7D7E1A5B-265E-4CF1-8E08-283F6920051D}} \ \forall i \in \{0, \ldots, \left|{{\overline{{{\tau}}}}}\right| - 1\}.\ \overline{{\tau}}_i = D_{k + i}

The lexer might return more than one token candidate at a given position. Sometimes we might be ﬁne with the resulting ambiguity, but often we want to be able to impose constraints on which token candidates are actually chosen as tokens. One example is that if one token is a preﬁx of another, longer token, we might want to choose the longer token only and ignore the shorter one.

The selector

\operatorname{Sel}

allows us to impose such constraints. The basic idea is simple: Given a set of token candidates

T \subseteq \mathbb{T}

, the selector chooses a subset

\operatorname{Sel}(T) \subseteq T

such that only tokens in that subset are considered.

The potential presence of empty tokens complicates this simple idea. As we will see soon, if the lexer allows empty tokens in a position, then token discovery at that position turns into a greedy iterative process. Tokens discovered and chosen earlier in that process cannot be ignored later in the same process. We accommodate this by passing

\operatorname{Sel}

not only an upper bound, from which we are to choose the tokens, but also a lower bound, which contains the tokens that have been chosen so far. Thus the selector has the signature

\operatorname{Sel} : \mathscr{P}({\mathbb{T}_\varepsilon}) \times \mathscr{P}({\mathbb{T}}) \rightarrow \mathscr{P}({\mathbb{T}}),

and it must have the property that for all token sets

E

and

T

with

\mathbb{T}_\varepsilon \supseteq E \subseteq T \subseteq \mathbb{T}

the following bounds hold for the selected tokens:

E \subseteq \operatorname{Sel}(E, T) \subseteq T.

Note that the lower bound

E

is guaranteed to consist only of empty tokens. The idea here is that we delay deciding on which non-empty tokens to choose until we have discovered all empty tokens in a position. This is the difference to the old version of Local Lexing presented in [1] →, where there was no such guarantee for

E

.

If no empty tokens are in play in a given position, then the lower bound will necessarily be the empty set, turning

\operatorname{Sel}

back into a function with a single upper-bound argument

\operatorname{Sel}(T) = \operatorname{Sel}(\emptyset, T)

.

In this new version of Local Lexing we require the selector to have one more property, which we call consistency. It allows us to give a uniform treatment of both ﬁnite and inﬁnite sets of characters and terminals. We call a family of sets

(U_i)_{i \in \N}

a chain iff

U_i \subseteq U_{i+1}

for all

i \in \N

. We call a selector consistent for document $D$ at position $k$ iff for any two chains

(E_i)_{i \in \N}

and

(T_i)_{i \in \N}

such that

\begin{array}{ccccccccl} E_0 & \subseteq & E_1 & \subseteq & E_2 & \subseteq & \ldots & \subseteq & {{\operatorname{Lex}(D, k)}}_{\varepsilon} \\ T_0 & \subseteq & T_1 & \subseteq & T_2 & \subseteq & \ldots & \subseteq & \operatorname{Lex}(D, k) \end{array}\\[0.5em] E_i \subseteq T_i \ {\includegraphics[height=0.5em]{991B3E6A-0E5A-4698-B40D-D00B092A03BF}}\ i \in \N

the following holds:

\operatorname{Sel}\left(\bigcup_{i=0}^\infty E_i,\ \bigcup_{i=0}^\infty T_i\right) \subseteq \bigcup_{i=0}^\infty \operatorname{Sel}(E_i, T_i)

We call a selector consistent everywhere (or just consistent) if it is consistent for all documents

D \in {{\Sigma}}^*

at all positions

k \in \{0, \ldots, \left|{{D}}\right|\}

.

We require $\operatorname{Sel}$ to be consistent everywhere.

Note that if both

\Sigma

and

\mathcal{T}

are ﬁnite sets, the selector is clearly consistent everywhere. Because in this case

\operatorname{Lex}(D, k)

is also a ﬁnite set, and this means that both chains

(E_i)_{i \in \N}

and

(T_i)_{i \in \N}

converge. Thus there is

J \in \N

such that

E_i = E_J

and

T_i = T_J

for all

i \geq J

, which implies

\operatorname{Sel}\left(\bigcup_{i=0}^\infty E_i,\ \bigcup_{i=0}^\infty T_i\right) = \operatorname{Sel}(E_J, T_J) \subseteq \bigcup_{i=0}^\infty \operatorname{Sel}(E_i, T_i)

A convenient way to deﬁne a selector is by providing a priority relation

\lhd

on tokens:

\lhd \subseteq \mathbb{T} \times \mathbb{T}

For two tokens

u \in \mathbb{T}

and

v \in \mathbb{T}

we write

u \lhd v

to mean that

u

has lower priority than

v

, i.e.

(u, v) \in \lhd

holds.

For example, let

\operatorname{SpecPos} : \mathcal{T} \rightarrow \N

be an injective function assigning to each terminal the position (e.g. line in a grammar speciﬁcation ﬁle) where it was speciﬁed. The priority relation used for traditional lexical analysis is then deﬁned by

u \lhd v \ {\includegraphics[height=0.5em]{44F70450-7F66-4420-90D8-CED086C922B2}}\ \left|{{\overline{{{u}}}}}\right| < \left|{{\overline{{{v}}}}}\right| \vee \left(\left|{{\overline{{{u}}}}}\right| = \left|{{\overline{{{v}}}}}\right| \wedge \operatorname{SpecPos}({\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{0D8FE38A-E77F-4FEC-AC68-E55CB8E8799E}}) > \operatorname{SpecPos}({\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{640D8136-18B8-43E3-B311-5BDFF327D4AC}})\right)

In this case,

\lhd

is even a total order when restricted to the set of tokens appearing at the same position in the same document, but we do not require any properties of

\lhd

beyond being a relation on tokens.

If selectors depended only on their upper bound argument

T

, there would be a canonical way how

\lhd

induces a selector by just choosing all maximal elements in

T

:

\operatorname{Sel}(T) = \{ u \in T \mid \nexists\, v \in T.\ u \lhd v\}

But selectors must also respect their lower bound argument

E

, and therefore the above deﬁnition needs to be adjusted. There are at least two ways to do this:

FB (“First in Blocks”). This way deﬁnes the selector as

\operatorname{Sel}(E, T) = E \cup \{ u \in T \mid (\nexists\, v \in T.\ u \lhd v) \wedge (\nexists\, v \in E.\ v \lhd u)\}

It adds those maximal elements of

T

that are not in conﬂict with any elements in

E

. This means that tokens in

E

block conﬂicting tokens in

T

from being chosen, even if those tokens in

T

have higher priority than those already in

E

.

PP (“Priority Prevails”). This way deﬁnes the selector as

\operatorname{Sel}(E, T) = E \cup \{ u \in T \mid \nexists\, v \in T.\ u \lhd v\}

It adds all maximal elements of

T

regardless of the content of

E

. As a consequence, lower priority empty tokens might be chosen although higher priority tokens have been chosen as well. On the other hand, a lower priority token can never block a higher priority one.

Note that FB and PP coincide for

E = \emptyset

. In practice PP seems to be a better choice than FB, as it is more in line with our intuitive concept of priority.

It is easy to see that both FB and PP produce selectors that are consistent everywhere. For example, to show consistency of a PP selector, assume

t \in \operatorname{Sel}(E_\infty, T_\infty)\ {\includegraphics[height=0.5em]{7FD733F7-1DA6-4813-A81E-7F73C96F18A2}}\ E_\infty = \bigcup_{i=0}^\infty E_i\ {\includegraphics[height=0.5em]{FB6310F2-9C88-4C49-B47C-7C25AD445CBA}}\ T_\infty = \bigcup_{i=0}^\infty T_i

for an otherwise arbitrary token

t \in \mathbb{T}

. Based on the deﬁnition of PP this implies

t \in E_\infty \vee \left(t \in T_\infty \wedge \nexists\, v \in T_\infty.\ t \lhd v \right)

In case of

t \in E_\infty

, there exists

J \in \N

with

t \in E_J

and therefore also

t \in \operatorname{Sel}(E_J, T_J)

In case of

t \notin E_\infty

, there exists

J \in \N

with

t \in T_J

and

\nexists\, v \in T_\infty.\ t \lhd v

. This implies

t \in T_J \wedge \nexists\, v \in T_J.\ t \lhd v

and we again arrive at

t \in \operatorname{Sel}(E_J, T_J)

.

So far we have deﬁned what a local lexing is, but we have not yet talked about how to use it. In fact, on its own a local lexing is pretty useless. For it to obtain a meaning, it needs to be paired with a Context-Free Grammar, resulting in a Local-Lexing Grammar.

Let us ﬁrst remind ourselves what a Context-Free Grammar is, and introduce the relevant notation. A Context-Free Grammar

G

is a quadruple

(\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S})

where

\mathcal{N}

is its set of nonterminals,

\mathcal{T}

its set of terminals,

\mathcal{R} \subseteq \mathcal{N} \times {{(\mathcal{N} \cup \mathcal{T})}}^*

its set of rules, and

\mathcal{S} \in \mathcal{N}

its start symbol. Nonterminals and terminals are required to be disjoint sets, i.e.

\mathcal{N} \cap \mathcal{T} = \emptyset

.

Instead of

(N, \gamma) \in \mathcal{R}

we also write

N \rightarrow \gamma

. We write

\alpha \Rightarrow \beta

for

\alpha, \beta \in {{(\mathcal{N} \cup \mathcal{T})}}^*

iff there are

\alpha_1

,

N

,

\alpha_2

and

\gamma

such that

\alpha = \alpha_1 N \alpha_2

,

\beta = \alpha_1 \gamma \alpha_2

, and

N \rightarrow \gamma

. Furthermore we let

\overset{*\ }\Rightarrow

denote the reﬂexive and transitive closure of

\Rightarrow

.

The language

\mathcal{L}

of

G

is deﬁned as the set of all terminal sequences derivable from the start symbol, i.e.

\mathcal{L} = \{ \omega \in {{\mathcal{T}}}^* \mid \mathcal{S} \overset{*\ }\Rightarrow \omega \}

The preﬁx language

\mathcal{L}_{{\includegraphics[height=0.5em]{E21B77F2-652F-47D7-B05E-3ACADDD29408}}}

of

G

is deﬁned as

\mathcal{L}_{{\includegraphics[height=0.5em]{84CD9468-A95A-4D50-B630-4D0E161A8DBB}}} = \{ \omega \in {{\mathcal{T}}}^* \mid \exists\, \alpha \in {{(\mathcal{N} \cup \mathcal{T})}}^*.\ \mathcal{S} \overset{*\ }\Rightarrow \omega \alpha \}

A Local Lexing Grammar

\mathcal{G}

is a tuple

(\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \Sigma, \operatorname{Lex}, \operatorname{Sel})

such that

G = (\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S})

is a Context-Free Grammar and

L = (\mathcal{T}, \Sigma, \operatorname{Lex}, \operatorname{Sel})

is a local lexing.

To denote the same Local Lexing Grammar as

\mathcal{G}

, but with a different start symbol

\mathcal{X} \in \mathcal{N}

, we write

{\mathcal{G}} / {\mathcal{X}}

.

We will deﬁne the semantics

\ell\ell

of

\mathcal{G}

to be a set of token sequences, i.e.

\ell\ell \subseteq {{\mathbb{T}}}^*

We construct

\ell\ell

by separately constructing

\mathcal{P}(D)

for each document

D \in {{\Sigma}}^*

, where

\mathcal{P}(D) = \left\{ p \in \ell\ell \mid \overline{{p}} = D \right\}

Once we have constructed

\mathcal{P}(D)

, we can deﬁne

\ell\ell

via

\ell\ell(\mathcal{G}) = \ell\ell = \bigcup_{D \in {{\Sigma}}^*} \mathcal{P}(D)

To construct

\mathcal{P}(D)

, we will assume a ﬁxed document

D \in {{\Sigma}}^*

from now on to simplify our notation. For example, we will just write

\mathcal{P}

instead of

\mathcal{P}(D)

.

\mathcal{P}

is deﬁned by recursively constructing

\{ \varepsilon \} = \mathcal{P}_{-1} \subseteq \mathcal{P}_0 \subseteq \mathcal{P}_1 \subseteq \ldots \subseteq \mathcal{P}_{\left|{{D}}\right|}

and then setting

\mathcal{P} = \left\{ p \in \mathcal{P}_{\left|{{D}}\right|} \mid \left|{{\overline{{{p}}}}}\right| = \left|{{D}}\right| \wedge {\includegraphics[height=0.8491079999999999em, totalheight=1.043548em]{76632037-D6C6-4395-86E3-83D28E7B7971}} \in \mathcal{L} \right\}

The elements of

\mathcal{P}

are the “sensible” partitions of the input

D

into tokens we alluded to earlier →. Furthermore the elements

p \in \mathcal{P}_k

are “sensible” partitions of preﬁxes of the input

D

into tokens. We can say a priori (i.e. before actually having constructed

\mathcal{P}_k

) more about such

p

than them just being sequences of tokens. We demand that

p

must be a path. We deﬁne the set

\mathbb{P}

of paths to consist of those

p \in {{\mathbb{T}}}^*

such that:

Furthermore we deﬁne the set of paths that stop at position

k

in

D

as

\mathbb{P}_k = \{ p \in \mathbb{P} \mid \left|{{\overline{{{p}}}}}\right| = k \}

Paths are an auxiliary tool to aid us in deﬁning

\mathcal{P}

. But they are an interesting notion in itself already, as they showcase how lexical analysis

(\operatorname{Lex})

and syntax analysis

(\mathcal{L}_{{\includegraphics[height=0.5em]{61968841-9A26-405C-B50E-ADA6F8394013}}})

are intertwined in Local Lexing. They nicely package up the inﬂuence of

G

and

\operatorname{Lex}

on the Local Lexing semantics: We will not need to refer to

G

or

\operatorname{Lex}

anymore to ﬁnish the deﬁnition of the semantics. This also means that the semantics does not depend on the particular shape of

G

, but just on

G

’s language

\mathcal{L}

and its preﬁx language

\mathcal{L}_{{\includegraphics[height=0.5em]{521AC5EC-0289-4FCB-9C78-654EFEE77B4B}}}

. If

G

has no unproductive nonterminals, i.e. for all

N \in \mathcal{N}

there is

\omega \in {{\mathcal{T}}}^*

such that

N \overset{*\ }\Rightarrow \omega

, then

\mathcal{L}_{{\includegraphics[height=0.5em]{FECD6D99-4DBE-4916-9398-8A1649DCB42B}}}

is uniquely determined by

\mathcal{L}

and therefore the semantics just depends on

G

’s language

\mathcal{L}

alone (and on the local lexing

L

, of course).

Indeed, if Local Lexing did not provide a mechanism for constraining token selection, or equivalently, if the selector would be trivial, i.e.

\operatorname{Sel}(E, A) = A

for all arguments

E

and

A

, then we would already be done, as we could then just set

\mathcal{P} = \left\{ p \in \mathbb{P}_{\left|{{D}}\right|} \mid {\includegraphics[height=0.8491079999999999em, totalheight=1.043548em]{49B450D2-EB19-45B3-A92B-5B337223EC56}} \in \mathcal{L} \right\}

This also shines a spotlight on the importance of the selector mechanism in Local Lexing. It is its deﬁning feature.

The basic idea for constructing the

\mathcal{P}_k

for

k \in \{0, \ldots, \left|{{D}}\right|\}

is to construct in tandem also a family of token sets

\mathcal{Z}_k \subseteq \operatorname{Lex}(D, k)

. The dependencies here are as follows:

\begin{array}{ccc} \mathcal{P}_{k-1} & {\includegraphics[height=0.5em]{D50D3FAC-523F-4DB4-8A00-F2E252D90FED}} & \mathcal{Z}_k\\[0.3em] \mathcal{P}_{k-1}\ {\includegraphics[height=0.5em]{79AA3228-EE32-4412-8BEC-0C29E29DC510}}\ \mathcal{Z}_k & {\includegraphics[height=0.5em]{588221D7-955F-4256-AF4D-63FBA3977D64}} & \mathcal{P}_k \end{array}

If you think of the

(\mathcal{P}_k)_{0 \leq k \leq \left|{{D}}\right|}

as the result of syntax analysis, and the

(\mathcal{Z}_k)_{0 \leq k \leq \left|{{D}}\right|}

as the result of lexical analysis, then these dependencies show how Local Lexing intertwines syntax analysis and lexical analysis.

The straightforward part is how

\mathcal{P}_k

is determined by

\mathcal{Z}_k

and

\mathcal{P}_{k-1}

:

\mathcal{P}_k = \operatorname{Append}_k\, \mathcal{Z}_{k}\, \mathcal{P}_{k-1}

where

\begin{array}{rcl} \operatorname{Append}_k\, T\, P & = & \operatorname{limit} \left(\operatorname{Append}^1_k\, T\right) P \\[0.3em] \operatorname{Append}^1_k\, T\, P & = & P \cup \left\{\, p\,t \in \mathbb{P} \mid p \in P \cap \mathbb{P}_k,\ t \in T \right\} \\[0.3em] \operatorname{limit} f X & = & \bigcup_{n = 0}^\infty f^n(X) \end{array}

Here we just append tokens in

\mathcal{Z}_k

to those paths in

\mathcal{P}_{k-1}

that stop at

k

, as long as the result is still a path. Because appending empty tokens in

\mathcal{Z}_k

potentially leads to new paths stopping at

k

which also need to be taken into consideration, in general

\operatorname{Append}_k

needs to be speciﬁed as a limit process.

If

\mathcal{Z}_k

was guaranteed not to contain empty tokens, the other part would be straightforward as well:

\mathcal{Z}_k = \operatorname{Sel}\left(\operatorname{Next}_k\left({\mathcal{P}_{k-1}}\right)\right)

where

\operatorname{Next}_k\left({P}\right) = \left\{ t \in \mathbb{T} \mid \exists\, p \in P \cap \mathbb{P}_k.\ p\, t \in \mathbb{P} \right\}\ {\includegraphics[height=0.5em]{9C5E966E-BBE5-4D0F-B61E-21131C69BA94}}\ P \subseteq {{\mathbb{T}}}^*

Here

\operatorname{Next}_k\left({P}\right)

consists of all tokens that could “sensibly” extend some path in

P

that stops at position

k

in

D

.

The trouble is that if $\mathcal{Z}_k$ contains any empty token, say

e

, then we should really have taken

e

into account before computing

\mathcal{Z}_k

as it might create more sensible paths that stop at

k

. Therefore, in the presence of empty tokens, we need a more elaborate approach for constructing

\mathcal{Z}_k

as follows.

We construct chains

(Q_k^i)_{i \in \N}

,

(E_k^i)_{i \in \N}

,

(T_k^i)_{i \in \N}

via

\begin{array}{rcl} Q_k^0 & = & \mathcal{P}_{k-1}\\[0.3em] E_k^0 & = & \emptyset\\[0.3em] T_k^i & = & \operatorname{Next}_k\left({Q_k^i}\right)\\[0.3em] E_k^{i+1} & = & {{\operatorname{Sel}(E_k^i, T_k^i)}}_{\varepsilon}\\[0.3em] Q_k^{i+1} & = & \operatorname{Append}_k\,E_k^{i+1}\,Q_k^i \end{array}

and can then deﬁne

\mathcal{Z}_k = \operatorname{Sel}\left(\bigcup_{i=0}^\infty E_k^i,\ \bigcup_{i=0}^\infty T_k^i \right)

But have we actually gained something by this more elaborate approach? After all, if there is an empty token

e \in {{(\mathcal{Z}_k)}}_{\varepsilon}

such that

e

is not in any of the

E_k^i

then we would again have the same problem of possibly having missed some sensible paths.

This is where the consistency of the selector comes to the rescue. Because of it we have

\begin{array}{rcl} {{(\mathcal{Z}_k)}}_{\varepsilon} & = & {{ \operatorname{Sel}\left(\bigcup_{i=0}^\infty E_k^i,\ \bigcup_{i=0}^\infty T_k^i \right)}}_{\varepsilon} \\[0.3em] & \subseteq & {{\left(\bigcup_{i=0}^\infty \operatorname{Sel}(E_k^i, T_k^i)\right)}}_{\varepsilon}\\[0.3em] & = & \bigcup_{i=0}^\infty {{\operatorname{Sel}(E_k^i, T_k^i)}}_{\varepsilon}\\[0.3em] & = & \bigcup_{i=0}^\infty E_k^{i+1}\\[0.3em] & = & \bigcup_{i=0}^\infty E_k^i\\[0.3em] & \subseteq & {{(\mathcal{Z}_k)}}_{\varepsilon} \end{array}

and thus

{{(\mathcal{Z}_k)}}_{\varepsilon} = \bigcup_{i=0}^\infty E_k^i

.

This shows that when constructing

\mathcal{Z}_k

, before making the ﬁnal selection of which non-empty tokens to select, it has already been decided which empty tokens to select. The ﬁnal selection will not add any new empty tokens, and therefore we can safely proceed to the next position

k+1

without missing out on sensible paths that have (intermediate) stops at

k

. Furthermore, all non-empty tokens at

k

will be selected at once. This means that there will not be any priority conﬂicts between non-empty tokens that need to be resolved the FB or the PP way →. Such conﬂicts can only possibly occur between empty tokens and other empty tokens, or between lower priority empty tokens and higher priority non-empty tokens. This is the major improvement over the original Local Lexing semantics where the presence of empty tokens could introduce such conﬂicts between non-empty tokens.

A Local Lexing Grammar is a tuple consisting of seven components and somewhat unwieldy for directly specifying parsers. Furthermore, although I have claimed that Local Lexing is a panacea for composable speciﬁcations, I have not provided direct evidence for this claim so far.

Luckily, there is a powerful speciﬁcation mechanism that is more compact than Local Lexing Grammars but which can still be understood in terms of Local Lexing Grammars.

A Pyramid Grammar is a tuple

(\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \lt, \lessdot)

such that

\mathcal{N}

is its set of nonterminals,

\mathcal{T}

its set of terminals such that

\mathcal{N} \cap \mathcal{T} = \emptyset

,

\mathcal{R} \subseteq (\mathcal{N} \cup \mathcal{T}) \times {{(\mathcal{N} \cup \mathcal{T})}}^*

its set of rules,

\mathcal{S} \in \mathcal{N}

its start symbol, and

\lt, \lessdot \subset \mathcal{T} \times \mathcal{T}

its priorities.

A Pyramid Grammar looks just like a Context-Free Grammar, with two extensions:

We are giving a Pyramid Grammar

Y

a meaning by turning it into a Local Lexing Grammar

\mathcal{G}

, and then deﬁning its alphabet

\Sigma

and language

\mathbb{L}

by

\begin{array}{rcl} \Sigma(Y) & = & \Sigma_\mathcal{G}\\[0.3em] \mathbb{L}(Y) & = & \{\overline{{\alpha}} \mid \alpha \in \ell\ell(\mathcal{G})\} \subseteq {{\Sigma}}^* \end{array}

The terminals and the start symbol of

\mathcal{G}

are the same:

\mathcal{T}_\mathcal{G} = \mathcal{T}, \quad \mathcal{S}_\mathcal{G} = \mathcal{S}

The characters

\Sigma_\mathcal{G} \subseteq \mathcal{T}_\mathcal{G}

of

\mathcal{G}

are those terminals which do not appear on the left side of any rule in

\mathcal{R}

:

\Sigma_\mathcal{G} = \{ c \in \mathcal{T} \mid \nexists \alpha. (c, \alpha) \in \mathcal{R} \}

For each

T \in \mathcal{T} \setminus \Sigma_\mathcal{G}

we add a fresh nonterminal

\mathcal{S}_T

:

\mathcal{N}_\mathcal{G} = \mathcal{N} \uplus \{\mathcal{S}_T \mid T \in \mathcal{T} \setminus \Sigma_\mathcal{G}\}

The rules

\mathcal{R}_\mathcal{G}

of

\mathcal{G}

are those rules in

\mathcal{R}

with a nonterminal on the left side, plus rules for the new nonterminals:

\mathcal{R}_\mathcal{G} = \{ (N, \alpha) \in \mathcal{R} \mid N \in \mathcal{N} \} \uplus \{ (\mathcal{S}_T, \alpha) \mid T \in \mathcal{T} \setminus \Sigma_\mathcal{G},\ (T, \alpha) \in \mathcal{R} \}

The selector

\operatorname{Sel}_\mathcal{G}

of

\mathcal{G}

is deﬁned the PP way → by turning

\lt

and

\lessdot

into a priority relation

\lhd

on tokens via

u \lhd v\ {\includegraphics[height=0.5em]{079B2B9D-2A05-402D-BE27-63F04910E99E}}\ \left( \begin{array}{ll} & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{1FCFAFF4-CE03-4AF8-B57A-F2BE5795AAB9}} \lt {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{E7525A81-0DF5-48AE-92D1-8F9A5859F775}} \\ \vee & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{3696B186-CB9A-4456-BE55-CBA5DBF9765D}} = {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{A61876B3-E313-4765-B6DD-532A5453D03F}} \wedge \left|{{\overline{{{u}}}}}\right| < \left|{{\overline{{{v}}}}}\right| \\ \vee & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{22438E83-6588-40A7-8ED0-96721919F7BC}} \lessdot {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{D3179172-D49B-47BE-BB9E-09C36E1D8B9A}} \wedge \left|{{\overline{{{u}}}}}\right| \leq \left|{{\overline{{{v}}}}}\right| \\ \vee & {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{80AD6521-BC16-4187-926B-FF9467814AA6}} \lessdot {\includegraphics[height=0.8491079999999999em, totalheight=0.8491079999999999em]{66271B2D-C4D4-4840-A1F7-17FF6F0914E9}} \wedge \left|{{\overline{{{u}}}}}\right| \lt \left|{{\overline{{{v}}}}}\right| \end{array} \right)

This deﬁnition reﬂects that

\lt

represents unconditional priority and

\lessdot

represents longest match priority. Furthermore, for two tokens with the same associated terminal, the longer token has higher priority.

The tricky part is the deﬁnition of the lexer component. Firstly, we will deﬁne a separate lexer

\operatorname{Lex}_T

for each terminal

T \in \mathcal{T}_\mathcal{G}

and then deﬁne for any

D \in {{\Sigma_\mathcal{G}}}^*

and position

k

in

D

:

\operatorname{Lex}(D, k) = \bigcup_{T \in \mathcal{T}_\mathcal{G}} \{ (T, \alpha) \mid \alpha \in \operatorname{Lex}_T(D, k) \}

For

c \in \Sigma_\mathcal{G}

the lexer is easily deﬁned:

\operatorname{Lex}_c(D, k) = \begin{cases} \{c\} & {\includegraphics[height=0.5em]{099F3F7E-A03B-4D6F-91F6-344273B36F15}}\ 0 \leq k < \left|{{D}}\right| \wedge D_k = c \\ \emptyset & {\includegraphics[height=0.5em]{DCA89E20-44F6-430A-9556-6AE0A9DB9D01}} \end{cases}

We use

{\mathcal{G}} / {\mathcal{S}_T}

→ for parsing

T \in \mathcal{T} \setminus \Sigma_\mathcal{G}

:

\begin{array}{rl} \operatorname{Lex}_T(D, k) = & \{ \overline{{\alpha}} \mid \alpha \in \ell\ell({\mathcal{G}} / {\mathcal{S}_T})\\[0.3em] & \phantom{\{ \overline{{\alpha}} \mid}\wedge\ k \geq 0 \wedge k + \left|{{\overline{{\alpha}}}}\right| \leq \left|{{D}}\right|\\[0.3em] & \phantom{\{ \overline{{\alpha}} \mid}\wedge\ \forall\,0 \leq i < \left|{{\overline{{\alpha}}}}\right|.\ \overline{{\alpha}}_i = D_{k+i} \} \end{array}

This means that we parse

T

with the same Local Lexing Grammar

\mathcal{G}

, but use

\mathcal{S}_T

instead of

\mathcal{S}

as a start symbol!

The above semantics for Pyramid Grammars is simple and intuitive, but unfortunately as a deﬁnition technically wrong. The problem is that we are deﬁning

\mathcal{G}

in terms of

{\mathcal{G}} / {\mathcal{S}_T}

, leading to a cyclic deﬁnition.

All components of

\mathcal{G}

are deﬁned nonrecursively except for the deﬁnition of the

\operatorname{Lex}_T

for

T \in \mathcal{X}

, where

\mathcal{X} = \{\mathcal{S}_T \mid T \in \mathcal{T} \setminus \Sigma_\mathcal{G}\}

. So the problematic dependencies are those between the “calls” to

\operatorname{Lex}_T(D, k)

for

T \in \mathcal{X}

, a ﬁxed

D

, and

0 \leq k \leq \left|{{D}}\right|

. Therefore, if we can ﬁnd a wellfounded relation on

\{(T, k) \mid T \in \mathcal{X}, 0 \leq k \leq \left|{{D}}\right|\}

along those calls, then

\mathcal{G}

is well-deﬁned.

If there are no left-recursive rules for terminals, either directly or indirectly, and assuming that

\mathcal{T} \setminus \Sigma_\mathcal{G}

is ﬁnite, then such a wellfounded relation exists.

To see why, let us ﬁrst clarify what is meant by “no left-recursive rules for terminals”.

Consider the Context-Free Grammar

E = \{\mathcal{N} \cup (\mathcal{T}_\mathcal{G} \setminus \Sigma_\mathcal{G}), \Sigma_\mathcal{G}, \mathcal{R}, \mathcal{S}\}

. We say that

\alpha \in {{(\mathcal{N} \cup \mathcal{T})}}^*

is nullable iff

\alpha \overset{*\ }\Rightarrow_E \varepsilon

.

We deﬁne a directed dependency graph $\operatorname{DP}$ such that the nodes of

\operatorname{DP}

are the elements of

\mathcal{T}_\mathcal{G} \setminus \Sigma_\mathcal{G}

. There is a directed edge from node

U \in \mathcal{T}

to node

V \in \mathcal{T}_\mathcal{G}

iff

\mathcal{S}_U \overset{*\ }\Rightarrow \alpha V\beta

for some nullable

\alpha

and some

\beta

, and in this case we write

U \twoheadrightarrow V

. “No left-recursive rules for terminals” means then just that

\operatorname{DP}

is acyclic. The relation

>

is deﬁned as

(U, k) > (V, l) \quad {\includegraphics[height=0.5em]{236DA5EB-3B22-4BD4-9EE9-E257ACE6A389}}\quad U \twoheadrightarrow V \vee 0 \leq k < l \leq \left|{{D}}\right|

This relation clearly holds along calls of

\operatorname{Lex}_V(D, l)

from

\operatorname{Lex}_U(D, k)

because of the way

\mathbb{P}

is deﬁned. Furthermore

>

is wellfounded, because

\operatorname{DP}

is ﬁnite and acyclic, and thus

\twoheadrightarrow

is wellfounded.

This leads us to call Pyramid Grammars wellfounded iff

\twoheadrightarrow

is a wellfounded relation. The semantics of wellfounded Pyramid Grammars is well-deﬁned via the Local Lexing Grammar

\mathcal{G}

.

Composability is a big strength of Pyramid Grammars. As examples, we will consider three operations supporting composition of Pyramid Grammars in the following: renaming, union and import.

Given a bijection

R : \mathcal{N} \cup (\mathcal{T} \setminus {\Sigma}) \rightarrow B

such that

B \cap \Sigma = \emptyset

, the renaming of the Pyramid Grammar

Y

according to

R

is the Pyramid Grammar

Y'

deﬁned in the obvious way.

Y'

is wellfounded iff

Y

is wellfounded, and

\mathbb{L}(Y') = \mathbb{L}(Y)

.

Given two Pyramid Grammars

Y_1 = (\mathcal{N}_1, \mathcal{T}_1, \mathcal{R}_1, \mathcal{S}_1, \lessdot_1, \lt_1) \quad Y_2 = (\mathcal{N}_2, \mathcal{T}_2, \mathcal{R}_2, \mathcal{S}_2, \lt_2, \lessdot_2)

as long as

\mathcal{N}_1 \cap \mathcal{T}_2 = \mathcal{N}_2 \cap \mathcal{T}_1 = \emptyset

, you can form its union as

Y = (\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \lt, \lessdot)

where

\mathcal{S}

is a fresh nonterminal and

\begin{array}{rcl} \mathcal{N} & = & \mathcal{N}_1 \cup \mathcal{N}_2 \cup \{\mathcal{S}\}\\ \mathcal{T} & = & \mathcal{T}_1 \cup \mathcal{T}_2 \\ \mathcal{R} & = & \mathcal{R}_1 \cup \mathcal{R}_2 \cup \{\mathcal{S} \rightarrow \mathcal{S}_1, \mathcal{S} \rightarrow \mathcal{S}_2\}\\ \lt & = & \lt_1 \cup \lt_2\\ \lessdot & = & \lessdot_1 \cup \lessdot_2 \end{array}

It is not necessarily true that

Y

is wellfounded if both

Y_1

and

Y_2

are wellfounded. Consider

Z_1 = (\{T\}, \{A, B\}, \{T \rightarrow A, A \rightarrow B\}, T, \emptyset, \emptyset)

Z_2 = (\{T\}, \{A, B\}, \{T \rightarrow B, B \rightarrow A\}, T, \emptyset, \emptyset)

and their union

Z_{12} = (\{S, T\}, \{A, B\}, \mathcal{R}, S, \emptyset, \emptyset)

where

\mathcal{R} = \{S \rightarrow T, T \rightarrow A, T \rightarrow B, A \rightarrow B, B \rightarrow A\}

Z_1

and

Z_2

are wellfounded with

\Sigma(Z_1) = \mathbb{L}(Z_1) = \{B\}

and

\Sigma(Z_2) = \mathbb{L}(Z_2) = \{A\}

. But

Z_{12}

is not wellfounded, as there is a cycle in the dependency graph of

Z_{12}

between

A

and

B

. Note that

\Sigma(Z_{12}) = \emptyset

.

It is also not necessarily true that even if

Y

is wellfounded, that then

\mathbb{L}(Y) = \mathbb{L}(Y_1) \cup \mathbb{L}(Y_2)

. Consider the union

Z_{13}

of

Z_1

and

Z_3

, where

Z_3 = (\{T\}, \{B, C\}, \{T \rightarrow B, B \rightarrow C\}, T, \emptyset, \emptyset)

We have

\Sigma(Z_3) = \mathbb{L}(Z_3) = \{C\}

and

\Sigma(Z_{13}) = \mathbb{L}(Z_{13}) = \{C\}

and therefore clearly

\mathbb{L}(Z_{13}) \neq \mathbb{L}(Z_1) \cup \mathbb{L}(Z_3) = \{B, C\}

.

Given wellfounded Pyramid Grammars

Y_1

and

Y_2

such that

(\mathcal{N}_1 \cup (\mathcal{T}_1 \setminus {\Sigma}_1)) \cap (\mathcal{N}_2 \cup (\mathcal{T}_2 \setminus {\Sigma}_2)) = \emptyset

holds, then their union

Y

is wellfounded, too, and

\mathbb{L}(Y) = \mathbb{L}(Y_1) \cup \mathbb{L}(Y_2)

is actually true.

Forming the language union of two Pyramid Grammars is always possible, because if the above condition does not hold, a renaming can be applied to

Y_1

or

Y_2

to ensure that the condition holds.

A Pyramid Grammar Fragment

F = (\mathcal{N}_F, \mathcal{T}_F, \mathcal{R}_F, \lt_F, \lessdot_F)

is a Pyramid Grammar without a designated start symbol. The notions of alphabet and wellfoundedness are independent of the start symbol and transfer therefore over to fragments.

Adding a fragment

F

to a Pyramid Grammar

Y = (\mathcal{N}, \mathcal{T}, \mathcal{R}, \mathcal{S}, \lt, \lessdot)

results in a Pyramid Grammar

Y' = (\mathcal{N} \cup \mathcal{N}_F, \mathcal{T} \cup \mathcal{T}_F, \mathcal{R} \cup \mathcal{R}_F, \mathcal{S}, \lt \cup \lt_F, \lessdot \cup \lessdot_F)

. Obviously,

\Sigma(Y') \subseteq \Sigma(Y) \cup \Sigma(F)

We say that

Y'

is the import of

F

to

Y

if the following conditions hold:

In this case, when

Y

is wellfounded, so is

Y'

.

The effect of an import is to swap out some, maybe all, characters of

Y

by their deﬁnitions in

F

in terms of

\Sigma(F)

. Therefore

F

could deﬁne a library of terminals, for example numbers, identiﬁers, etc., and a Pyramid Grammar

Y

could import them by simply adding the fragment.

In the extreme case of

\Sigma(Y) \subseteq \mathcal{T}_F

we have

\Sigma(Y') = \Sigma(F)

, i.e. the entire alphabet

\Sigma(Y)

is swapped out for the alphabet

\Sigma(F)

of the fragment. This recreates within the setting of Pyramid Grammars the traditional separation of lexical analysis and syntax analysis, where

Y

can be considered as being responsible for syntax analysis,

F

for lexical analysis, and

Y'

represents the parser resulting from combining the two.

When parsing, one is usually not only interested in whether a given document is in the language speciﬁed by the Pyramid Grammar, but also in some sort of result derived from the discovered document structure. Traditionally, parser generators allow to associate custom actions with grammar rules. These custom actions then compute a result.

It is ill-advised to pursue this approach in a parser generator for Pyramid Grammars. Two consumers of the same grammar have often not the same use case and need different kinds of results. Given how easy it is to compose Pyramid Grammars, the results of parsing with Pyramid Grammars should also compose nicely.

A simple solution to this is to just use the parse tree as a result, a concrete syntax tree. Often though, such a concrete syntax tree follows too much the structure of the parse and contains therefore too much noise and unwanted structure.

To ﬁx this, one can classify symbols in

\mathcal{N} \cup \mathcal{T}

along two axes: visibility and structure. Both axes consist of two values. The value of visibility can be either visible or auxiliary. The value of structure can be either nested or atomic. For a symbol to be classiﬁed as auxiliary means that it has no corresponding node in the parse tree of its own, but is just represented by the collection of its children. For a symbol to be classiﬁed as atomic means that its children will not become part of the parse tree. A symbol that is classiﬁed as both atomic and auxiliary is effectively hidden and will have no representation in the result.

By default, terminals are atomic and visible, while nonterminals are nested and visible. These defaults are overridable at speciﬁcation time, and / or by the user of a Pyramid Grammar.

With the help of visibility and structure the parser can generate a tree more akin to an abstract syntax tree than to a concrete syntax tree. By analysing the grammar a parser generator can even automatically generate data types for the parse result useful for static type checking.

Note that parse trees can be ambiguous. While an ambiguous grammar is usually something to be avoided, during the development of the grammar easy inspection of ambiguities is important, and parse trees, properly implemented, can support that.

It is possible to implement a prioritized choice operator $/$ as it is used in Parsing Expression Grammars [6]. Given two sequences

\alpha, \beta \in {{(\mathcal{N} \cup \mathcal{T})}}^*

, we can add a fresh nonterminal

N

and two fresh terminals

A

and

B

and add the rules

\begin{array}{rcl} N & \rightarrow & A \\ N & \rightarrow & B \\ A & \rightarrow & \alpha \\ B & \rightarrow & \beta \\ \end{array}

We also add the priority

B \lt A

The effect of this is that

N

now stands for the prioritized choice between

\alpha

and

\beta

:

N = \alpha / \beta

To aid in parse tree creation, we declare

N

,

A

and

B

as auxiliary and nested.

Of course, we can also add a standard choice operator $|$ by creating a fresh nonterminal

N

(auxiliary and nested) and adding the rules

\begin{array}{rcl} N & \rightarrow & \alpha \\ N & \rightarrow & \beta \end{array}

The effect is that

N = \alpha\, |\, \beta

An interesting operator is the isolation operator

\langle {\alpha} \rangle

. We implement it by adding a fresh, auxiliary and nested terminal

T

, and adding the rule

T \rightarrow \alpha

and thus

T = \langle {\alpha} \rangle

Isolation has the effect of spawning off a separate parsing process for its argument which cannot be inﬂuenced by token selection issues outside of this process.

Note that we cannot have prioritized choice without isolating the arguments ﬁrst, i.e.

\alpha / \beta

has the same meaning as

\langle {\alpha} \rangle / \langle {\beta} \rangle

. On the other hand,

\langle {\alpha} \rangle\,|\,\langle {\beta} \rangle

has in general a different semantics than just

\alpha\,|\,\beta

.

Inspired by these examples, we extend the notion of Pyramid Grammars to allow rules of the form

X \rightarrow g

where

X \in \mathcal{N} \cup \mathcal{T}

and

g

is a grammar expression. So far, grammar expressions are inductively deﬁned as follows:

But we don’t have to stop here, there are more grammar expressions we can deﬁne:

Here we deﬁne

g? = g / \varepsilon

and

[g] = g\, |\, \varepsilon

.

And we can also add repetition operators:

Let us ﬁrst deﬁne greedy repetition $g^*$ . We add a fresh nonterminal

N

(auxiliary and nested), and add the rule

N \rightarrow (g N)?

This yields

N = g^*

.

For non-greedy repetition $\{g\}$ we proceed analogously. We add a fresh nonterminal

N

(auxiliary and nested), and add the rule

N \rightarrow [N g]

This gives us

N = \{g\}

.

The remaining operators can be deﬁned as

g^+ = g g^*

and

\{^+ g\} = g \{g\}

.

Grammar expressions as deﬁned so far cover the complete Extended Backus-Naur form [7] and almost all constructs of Parsing Expression Grammars (PEG) [6]. What is missing to complete our set are the PEG lookahead operators

\& g

for positive lookahead, and

!g

for negative lookahead.

Unlike all other grammar expression operators we have encountered previously, the lookahead operators cannot be viewed simply as syntactic sugar that can be desugared into our current framework of Pyramid Grammars.

The notion of Pyramid Grammar is itself just “semantic sugar” on top of Local Lexing, though. So all we need to do is to expand our notion of Pyramid Grammar to let some more of the power of Local Lexing back in. The simplest way to do this is to allow that on the right hand side of rules in

\mathcal{R}

not only nonterminals

N

and terminals

T

can appear, but also negated terminals

!T

.

The construction of the semantics

\mathcal{G}

stays the same, with the following differences:

\mathcal{T}_\mathcal{G} = \mathcal{T}\, \cup\, !\mathcal{T}

The deﬁnition of wellfoundedness is adapted such that there are additional directed edges

!T \twoheadrightarrow T

in the dependency graph, and such that nullability of

!T

is taken into account.

We can now deﬁne grammar expressions for lookahead:

For the deﬁnition of

!g

we create fresh terminals

T

and

U

and add the rules

\begin{array}{rcl} T & \rightarrow & g \\ U & \rightarrow & !T \\ \end{array}

which results in

U =\ !g

.

Both

T

und

U

are declared as auxiliary and atomic, i.e. hidden.

Positive lookahead is just doubly negative lookahead:

\&g =\ !!g

This shows that Pyramid Grammars are at least as expressive as PEGs and Context-Free Grammars. Indeed, because there are languages recognized by PEGs that are not context-free, Pyramid Grammars are strictly more expressive than Context-Free Grammars.

So, this is the new semantics for Local Lexing. Its new and hopefully improved way of handling empty tokens and avoiding conﬂicts between non-empty tokens is promising, especially in how it ﬁts together with the new notion of selector consistency. Selector consistency is guaranteed for both ﬁnite local lexings and for priority relations, which covers all types of interesting selectors I have encountered so far.

I have implemented an algorithm for parsing according to this new semantics [8] →. It is adapted from the parsing algorithm from the original Local Lexing paper [1] →, which is based on Earley’s algorithm [9]. I have tried it out on Recursive teXt and it seems to be working correctly. But while the original algorithm has a formal correctness proof [10] →, I have no proof for the new algorithm yet. Nevertheless, I would expect the new proof to be less involved and technical than the original proof as the new semantics seems more intuitive.

The most important point of it all is that the parsing algorithm itself is not that important, and actually not covered at all in this article. This work started out as an investigation into using Earley’s parsing algorithm for composable syntax, and was particularly inspired by the use of blackboxes in [11] →. But what has been discovered through this work is not so much a particular parsing algorithm, but that Local Lexing is a general semantics for language and parser speciﬁcations.

Pyramid Grammars are how you use Local Lexing in practice. Each terminal in a Pyramid Grammar gives rise to a subgrammar. It seems obvious that each of these subgrammars should be analysed by a parser generator to ﬁnd the parsing algorithm best suited for it. This seems to invite a renewed research into parsing algorithms with a focus on how they can be adapted to Local Lexing. Improvements in parsing algorithms could be directly passed on to the user who does not even have to be aware that improvements have been made. This is because the UI of the parser generator is most importantly its semantics, which is Local Lexing / Pyramid Grammars, and stable.

Special care has been taken to include inﬁnite grammars. These arise naturally from parameterized Local Lexing Grammars [12] →. The grammar for Recursive teXt is a parameterized pyramid grammar, and I have found parameters to be very useful indeed. Parameterized Pyramid Grammars are basically just L-attributed [13] Pyramid Grammars. This makes sense, as on one hand L-attributed grammars are the largest class of attribute grammars supporting parser-driven attributes, and on the other hand such support is essential for Local Lexing as attribute evaluation may inﬂuence token selection. Parsing techniques for Local Lexing based on GLR parsing [14] and ECLR-attributed parsing [15] seem particularly promising.

A New Semantics for Local Lexing

Foreword

Introduction

Local Lexing

Tokens

The Lexer

The Selector

Consistency

Priority Relations

Semantics of Local Lexing

Context-Free Grammars

Local Lexing Grammars

Paths

Constructing $\mathcal{P}_k$

Constructing $\mathcal{Z}_k$

Was it all worth it?

Pyramid Grammars

Deﬁnition of Pyramid Grammars

Semantics of Pyramid Grammars

Wellfounded Pyramid Grammars

Composability

Renaming

Union

Import

Parse Trees

Grammar Expressions

Lookahead Operators

Conclusion

References

A New Semantics for Local Lexing

Foreword

Introduction

Local Lexing

Tokens

The Lexer

The Selector

Consistency

Priority Relations

Semantics of Local Lexing

Context-Free Grammars

Local Lexing Grammars

Paths

Constructing Pk\mathcal{P}_kPk​

Constructing Zk\mathcal{Z}_kZk​

Was it all worth it?

Pyramid Grammars

Deﬁnition of Pyramid Grammars

Semantics of Pyramid Grammars

Wellfounded Pyramid Grammars

Composability

Renaming

Union

Import

Parse Trees

Grammar Expressions

Lookahead Operators

Conclusion

References

Constructing $\mathcal{P}_k$

Constructing $\mathcal{Z}_k$