A New Semantics for Local Lexing
May 18th, 2021
A revised semantics for Local Lexing is presented that entirely avoids priority conﬂicts between non-empty tokens. Pyramid Grammars are introduced, and their semantics is given in terms of Local Lexing. It is demonstrated that they are more expressive than Context-Free Grammars, and at least as expressive as Parsing Expression Grammars.
In 2017 I introduced in joint work with Phil Scott and Jacques Fleuriot the concept of Local Lexing 
. Recently, I have applied Local Lexing to design the grammar for Recursive teXt
, a language inspired by both Markdown 
and TeX 
, and also more generally by the paradigm of functional (and declarative) programming  →
. All articles on obua.com
are written in Recursive teXt. Recursive teXt is still in an early stage and experimental.
During the design process I gained (and am still gaining) new insights into applying Local Lexing. Not only that, it also became apparent that the semantics of Local Lexing needs to be revised for practical reasons. So this article does that, and it also introduces new concepts. The article is self-contained and can be read and understood independently from  →
Traditionally, the parsing of (computer) languages happens in two phases: First comes a phase of lexical analysis
, which is followed by a phase of syntax analysis 
. The lexical analysis phase partitions the input consisting of a sequence of characters
into a sequence of tokens
, and is usually speciﬁed via regular expressions
. The syntax analysis operates purely on the resulting sequence of tokens, more speciﬁcally, on the resulting sequence of token kinds called terminals
; this results in a parse tree
. The syntax analysis is usually speciﬁed by a Context-Free Grammar
The three main reasons why parsing is divided into two phases are:
- Regular expressions can be implemented by deterministic ﬁnite state machines, which is usually much more eﬃcient than parsing with respect to a CFG.
Although regular expressions are strictly less expressive than CFGs, by splitting the analysis into two phases one can insert additional rules between these two phases which increase the overall expressivity over that of CFGs. These additional rules are usually considered to be part of the lexical analysis, and concern mostly the priority with which tokens are selected in case of conﬂicts:
- The longest match rule prioritizes longer tokens over shorter tokens.
- Furthermore, token speciﬁcations are totally ordered. In case of conﬂicting tokens of the same length, the token with the higher-ranked speciﬁcation is chosen.
- Issues like the treatment of whitespace and comments can often be implemented by an additional rule to simply drop certain kinds of tokens after the lexical analysis instead of passing them on to the syntax analysis.
One disadvantage of this split into two distinct phases is that two different formalisms are employed for specifying the language: one formalism uses regular expressions, the other one CFGs.
But its most signiﬁcant disadvantage is that it negatively affects the composability of language and parsing speciﬁcations. The syntax analysis phase is unproblematic as it is based on CFGs, which compose. But the lexical analysis phase requires that the input can be uniquely decomposed into a single sequence of tokens. Therefore parts from different language speciﬁcations cannot be easily combined, as they have possibly different and conﬂicting token speciﬁcations. Usually, these conﬂicts cannot be resolved automatically without corrupting the semantics of the original parts.
avoids this problem by dropping the requirement for lexical analysis to produce a unique partition of the input into tokens. Instead, all “sensible” partitions are allowed. Whether or not a partition is “sensible” is determined by a parsing process that intertwines lexical analysis and syntax analysis. This allows for token conﬂict resolution based on the local parsing context. What exactly “sensible” means will be made concrete in the following sections, and differs for this new semantics of Local Lexing from the one presented in  →
On the basis of Local Lexing, Pyramid Grammars
will then be introduced, a simple language and parser speciﬁcation mechanism which composes and is more expressive than Context-Free Grammars and at least as expressive as Parsing Expression Grammars 
A local lexing
is a quadruple
- is the set of terminals.
- is a set of characters called the alphabet .
- is a function with certain properties, and called the lexer.
- is a function with certain properties, and called the selector.
The only important thing about
is that they are sets. Otherwise, they can be chosen arbitrarily. But before we can understand the role that lexer and selector play, we need to understand what tokens
For any set
denote the set of all ﬁnite sequences formed from elements in
. In particular,
contains the empty sequence
. For two sequences
for their concatenation, which is of course also in
. Given a sequence
for the length of
-th element of
is a pair
. So a token is just a sequence of characters tagged with a terminal. We let
denote the set of all tokens:
. These notions are extended in the obvious way to sequences of tokens:
Given a set
we denote its subset of empty tokens
The task of lexical analysis is to break down the input document
into tokens. A local lexing speciﬁes how to do this locally
at a given position
. The task of the lexer is to recognize all possible token candidates at position
, and is therefore a function taking a document and a position as arguments, and returning a set of tokens:
to denote the powerset of
, i.e. the set of all subsets of
must have the property that for all documents
and all positions
within that document, it returns only tokens
is actually a subsequence of
The lexer might return more than one token candidate at a given position. Sometimes we might be ﬁne with the resulting ambiguity, but often we want to be able to impose constraints on which token candidates are actually chosen as tokens. One example is that if one token is a preﬁx of another, longer token, we might want to choose the longer token only and ignore the shorter one.
allows us to impose such constraints. The basic idea is simple: Given a set of token candidates
, the selector chooses a subset
such that only tokens in that subset are considered.
The potential presence of empty tokens
complicates this simple idea. As we will see soon, if the lexer allows empty tokens in a position, then token discovery at that position turns into a greedy iterative process. Tokens discovered and chosen earlier in that process cannot be ignored later in the same process. We accommodate this by passing
not only an upper bound, from which we are to choose the tokens, but also a lower bound, which contains the tokens that have been chosen so far. Thus the selector has the signature
and it must have the property that for all token sets
the following bounds hold for the selected tokens:
Note that the lower bound
is guaranteed to consist only of empty tokens. The idea here is that we delay deciding on which non-empty tokens to choose until we have discovered all empty tokens in a position. This is the difference to the old version of Local Lexing presented in  →
, where there was no such guarantee for
If no empty tokens are in play in a given position, then the lower bound will necessarily be the empty set, turning
back into a function with a single upper-bound argument
In this new version of Local Lexing we require the selector to have one more property, which we call consistency
. It allows us to give a uniform treatment of both ﬁnite and inﬁnite sets of characters and terminals. We call a family of sets
. We call a selector consistent for document at position
iff for any two chains
the following holds:
We call a selector consistent everywhere
(or just consistent
) if it is consistent for all documents
at all positions
We require to be consistent everywhere.
Note that if both
are ﬁnite sets, the selector is clearly consistent everywhere. Because in this case
is also a ﬁnite set, and this means that both chains
converge. Thus there is
, which implies
A convenient way to deﬁne a selector is by providing a priority relation
For two tokens
to mean that
has lower priority than
For example, let
be an injective function assigning to each terminal the position (e.g. line in a grammar speciﬁcation ﬁle) where it was speciﬁed. The priority relation used for traditional lexical analysis is then deﬁned by
In this case,
is even a total order when restricted to the set of tokens appearing at the same position in the same document, but we do not require any properties of
beyond being a relation on tokens.
If selectors depended only on their upper bound argument
, there would be a canonical way how
induces a selector by just choosing all maximal elements in
But selectors must also respect their lower bound argument
, and therefore the above deﬁnition needs to be adjusted. There are at least two ways to do this:
irst in B
locks”). This way deﬁnes the selector as
It adds those maximal elements of
that are not in conﬂict with any elements in
. This means that tokens in
block conﬂicting tokens in
from being chosen, even if those tokens in
have higher priority than those already in
revails”). This way deﬁnes the selector as
It adds all maximal elements of
regardless of the content of
. As a consequence, lower priority empty tokens might be chosen although higher priority tokens have been chosen as well. On the other hand, a lower priority token can never block a higher priority one.
Note that FB and PP coincide for
. In practice PP seems to be a better choice than FB, as it is more in line with our intuitive concept of priority.
It is easy to see that both FB and PP produce selectors that are consistent everywhere. For example, to show consistency of a PP selector, assume
for an otherwise arbitrary token
. Based on the deﬁnition of PP this implies
In case of
, there exists
and therefore also
In case of
, there exists
. This implies
and we again arrive at
Semantics of Local Lexing
So far we have deﬁned what a local lexing is, but we have not yet talked about how to use it. In fact, on its own a local lexing is pretty useless. For it to obtain a meaning, it needs to be paired with a Context-Free Grammar, resulting in a Local-Lexing Grammar.
Let us ﬁrst remind ourselves what a Context-Free Grammar is, and introduce the relevant notation. A Context-Free Grammar
is a quadruple
is its set of nonterminals
its set of terminals
its set of rules
its start symbol
. Nonterminals and terminals are required to be disjoint sets, i.e.
we also write
. We write
iff there are
. Furthermore we let
denote the reﬂexive and transitive closure of
is deﬁned as the set of all terminal sequences derivable from the start symbol, i.e.
The preﬁx language
is deﬁned as
Local Lexing Grammars
A Local Lexing Grammar
is a tuple
is a Context-Free Grammar and
is a local lexing.
To denote the same Local Lexing Grammar as
, but with a different start symbol
, we write
We will deﬁne the semantics
to be a set of token sequences, i.e.
by separately constructing
for each document
Once we have constructed
, we can deﬁne
, we will assume a ﬁxed document
from now on to simplify our notation. For example, we will just write
is deﬁned by recursively constructing
and then setting
The elements of
are the “sensible” partitions of the input
into tokens we alluded to earlier →
. Furthermore the elements
are “sensible” partitions of preﬁxes
of the input
into tokens. We can say a priori (i.e. before actually having constructed
) more about such
than them just being sequences of tokens. We demand that
must be a path
. We deﬁne the set
of paths to consist of those
- Given a decomposition of into tokens we demand that each of these tokens can be lexed at the appropriate position: This implies that is a preﬁx of , of course.
Furthermore we deﬁne the set of paths that stop at position
Paths are an auxiliary tool to aid us in deﬁning
. But they are an interesting notion in itself already, as they showcase how lexical analysis
and syntax analysis
are intertwined in Local Lexing. They nicely package up the inﬂuence of
on the Local Lexing semantics: We will not need to refer to
anymore to ﬁnish the deﬁnition of the semantics. This also means that the semantics does not depend on the particular shape of
, but just on
and its preﬁx language
has no unproductive nonterminals, i.e. for all
is uniquely determined by
and therefore the semantics just depends on
alone (and on the local lexing
, of course).
Indeed, if Local Lexing did not provide a mechanism for constraining token selection, or equivalently, if the selector would be trivial, i.e.
for all arguments
, then we would already be done, as we could then just set
This also shines a spotlight on the importance of the selector mechanism in Local Lexing. It is its deﬁning feature.
The basic idea for constructing the
is to construct in tandem also a family of token sets
. The dependencies here are as follows:
If you think of the
as the result of syntax analysis, and the
as the result of lexical analysis, then these dependencies show how Local Lexing intertwines syntax analysis and lexical analysis.
The straightforward part is how
is determined by
Here we just append tokens in
to those paths in
that stop at
, as long as the result is still a path. Because appending empty tokens in
potentially leads to new paths stopping at
which also need to be taken into consideration, in general
needs to be speciﬁed as a limit process.
was guaranteed not to contain empty tokens, the other part would be straightforward as well:
consists of all tokens that could “sensibly” extend some path in
that stops at position
The trouble is that if contains any empty token
, then we should really have taken
into account before
as it might create more sensible paths that stop at
. Therefore, in the presence of empty tokens, we need a more elaborate approach for constructing
We construct chains
and can then deﬁne
Was it all worth it?
But have we actually gained something by this more elaborate approach? After all, if there is an empty token
is not in any of the
then we would again
have the same problem of possibly having missed some sensible paths.
This is where the consistency of the selector comes to the rescue. Because of it we have
This shows that when constructing
making the ﬁnal selection of which non-empty
tokens to select, it has already been decided which empty
tokens to select. The ﬁnal selection will not add any new empty tokens, and therefore we can safely proceed to the next position
without missing out on sensible paths that have (intermediate) stops at
. Furthermore, all non-empty tokens at
will be selected at once
. This means that there will not be any priority conﬂicts between non-empty tokens that need to be resolved the FB or the PP way →
. Such conﬂicts can only possibly occur between empty tokens and other empty tokens, or between lower priority empty tokens and higher priority non-empty tokens. This is the major improvement over the original Local Lexing semantics where the presence of empty tokens could introduce such conﬂicts between non-empty tokens.
A Local Lexing Grammar is a tuple consisting of seven components and somewhat unwieldy for directly specifying parsers. Furthermore, although I have claimed that Local Lexing is a panacea for composable speciﬁcations, I have not provided direct evidence for this claim so far.
Luckily, there is a powerful speciﬁcation mechanism that is more compact than Local Lexing Grammars but which can still be understood in terms of Local Lexing Grammars.
Deﬁnition of Pyramid Grammars
A Pyramid Grammar
is a tuple
is its set of nonterminals
its set of terminals
its set of rules
its start symbol
A Pyramid Grammar looks just like a Context-Free Grammar, with two extensions:
- The left hand side of a rule can be a terminal, too.
- There are relations and on terminals jointly specifying priorities.
Semantics of Pyramid Grammars
We are giving a Pyramid Grammar
a meaning by turning it into a Local Lexing Grammar
, and then deﬁning its alphabet
The terminals and the start symbol of
are the same:
are those terminals which do not appear on the left side of any rule in
we add a fresh nonterminal
are those rules in
with a nonterminal on the left side, plus rules for the new nonterminals:
is deﬁned the PP way →
into a priority relation
on tokens via
This deﬁnition reﬂects that
represents unconditional priority
represents longest match priority
. Furthermore, for two tokens with the same associated terminal, the longer token has higher priority.
The tricky part is the deﬁnition of the lexer component. Firstly, we will deﬁne a separate lexer
for each terminal
and then deﬁne for any
the lexer is easily deﬁned:
We use →
This means that we parse
with the same Local Lexing Grammar
, but use
as a start symbol!
Wellfounded Pyramid Grammars
The above semantics for Pyramid Grammars is simple and intuitive, but unfortunately as a deﬁnition technically wrong. The problem is that we are deﬁning
in terms of
, leading to a cyclic deﬁnition.
All components of
are deﬁned nonrecursively except for the deﬁnition of the
. So the problematic dependencies are those between the “calls” to
, a ﬁxed
. Therefore, if we can ﬁnd a wellfounded relation on
along those calls, then
If there are no left-recursive rules for terminals, either directly or indirectly, and assuming that
is ﬁnite, then such a wellfounded relation exists.
To see why, let us ﬁrst clarify what is meant by “no left-recursive rules for terminals”.
Consider the Context-Free Grammar
. We say that
We deﬁne a directed dependency graph
such that the nodes of
are the elements of
. There is a directed edge from node
for some nullable
, and in this case we write
. “No left-recursive rules for terminals” means then just that
is acyclic. The relation
is deﬁned as
This relation clearly holds along calls of
because of the way
is deﬁned. Furthermore
is wellfounded, because
is ﬁnite and acyclic, and thus
This leads us to call Pyramid Grammars wellfounded
is a wellfounded relation. The semantics of wellfounded Pyramid Grammars is well-deﬁned via the Local Lexing Grammar
Composability is a big strength of Pyramid Grammars. As examples, we will consider three operations supporting composition of Pyramid Grammars in the following: renaming, union and import.
Given a bijection
, the renaming
of the Pyramid Grammar
is the Pyramid Grammar
deﬁned in the obvious way.
is wellfounded iff
is wellfounded, and
Given two Pyramid Grammars
as long as
, you can form its union
is a fresh nonterminal and
It is not necessarily true that
is wellfounded if both
are wellfounded. Consider
and their union
are wellfounded with
is not wellfounded, as there is a cycle in the dependency graph of
. Note that
It is also not necessarily true that even if
is wellfounded, that then
. Consider the union
and therefore clearly
Given wellfounded Pyramid Grammars
holds, then their union
is wellfounded, too, and
is actually true.
Forming the language union of two Pyramid Grammars is always possible, because if the above condition does not hold, a renaming can be applied to
to ensure that the condition holds.
A Pyramid Grammar Fragment
is a Pyramid Grammar without a designated start symbol. The notions of alphabet and wellfoundedness are independent of the start symbol and transfer therefore over to fragments.
Adding a fragment
to a Pyramid Grammar
results in a Pyramid Grammar
We say that
is the import
if the following conditions hold:
- is a wellfounded fragment
In this case, when
is wellfounded, so is
The effect of an import is to swap out some, maybe all, characters of
by their deﬁnitions in
in terms of
could deﬁne a library of terminals, for example numbers, identiﬁers, etc., and a Pyramid Grammar
could import them by simply adding the fragment.
In the extreme case of
, i.e. the entire alphabet
is swapped out for the alphabet
of the fragment. This recreates within the setting of Pyramid Grammars the traditional separation of lexical analysis and syntax analysis, where
can be considered as being responsible for syntax analysis,
for lexical analysis, and
represents the parser resulting from combining the two.
When parsing, one is usually not only interested in whether a given document is in the language speciﬁed by the Pyramid Grammar, but also in some sort of result derived from the discovered document structure. Traditionally, parser generators allow to associate custom actions with grammar rules. These custom actions then compute a result.
It is ill-advised to pursue this approach in a parser generator for Pyramid Grammars. Two consumers of the same grammar have often not the same use case and need different kinds of results. Given how easy it is to compose Pyramid Grammars, the results of parsing with Pyramid Grammars should also compose nicely.
A simple solution to this is to just use the parse tree as a result, a concrete syntax tree. Often though, such a concrete syntax tree follows too much the structure of the parse and contains therefore too much noise and unwanted structure.
To ﬁx this, one can classify symbols in
along two axes: visibility
. Both axes consist of two values. The value of visibility can be either visible
. The value of structure can be either nested
. For a symbol to be classiﬁed as auxiliary means that it has no corresponding node in the parse tree of its own, but is just represented by the collection of its children. For a symbol to be classiﬁed as atomic
means that its children will not become part of the parse tree. A symbol that is classiﬁed as both atomic
is effectively hidden
and will have no representation in the result.
By default, terminals are atomic and visible, while nonterminals are nested and visible. These defaults are overridable at speciﬁcation time, and / or by the user of a Pyramid Grammar.
With the help of visibility and structure the parser can generate a tree more akin to an abstract syntax tree than to a concrete syntax tree. By analysing the grammar a parser generator can even automatically generate data types for the parse result useful for static type checking.
Note that parse trees can be ambiguous. While an ambiguous grammar is usually something to be avoided, during the development of the grammar easy inspection of ambiguities is important, and parse trees, properly implemented, can support that.
It is possible to implement a prioritized choice operator
as it is used in Parsing Expression Grammars 
. Given two sequences
, we can add a fresh nonterminal
and two fresh terminals
and add the rules
We also add the priority
The effect of this is that
now stands for the prioritized choice between
To aid in parse tree creation, we declare
Of course, we can also add a standard choice operator
by creating a fresh nonterminal
(auxiliary and nested) and adding the rules
The effect is that
An interesting operator is the isolation
. We implement it by adding a fresh, auxiliary and nested terminal
, and adding the rule
Isolation has the effect of spawning off a separate parsing process for its argument which cannot be inﬂuenced by token selection issues outside of this process.
Note that we cannot have prioritized choice without isolating the arguments ﬁrst, i.e.
has the same meaning as
. On the other hand,
has in general a different semantics than just
Inspired by these examples, we extend the notion of Pyramid Grammars to allow rules of the form
is a grammar expression
. So far, grammar expressions are inductively deﬁned as follows:
- Any is a grammar expression.
- If and are grammar expressions, then so is their concatenation .
- If is a grammar expression, then so is .
- If and are grammar expressions, then so are and .
But we don’t have to stop here, there are more grammar expressions we can deﬁne:
- If is a grammar expression, then so are and .
Here we deﬁne
And we can also add repetition operators:
- If is a parsing expression, then so are , , and .
Let us ﬁrst deﬁne greedy repetition
. We add a fresh nonterminal
(auxiliary and nested), and add the rule
For non-greedy repetition
we proceed analogously. We add a fresh nonterminal
(auxiliary and nested), and add the rule
This gives us
The remaining operators can be deﬁned as
Unlike all other grammar expression operators we have encountered previously, the lookahead operators cannot be viewed simply as syntactic sugar that can be desugared into our current framework of Pyramid Grammars.
The notion of Pyramid Grammar is itself just “semantic sugar” on top of Local Lexing, though. So all we need to do is to expand our notion of Pyramid Grammar to let some more of the power of Local Lexing back in. The simplest way to do this is to allow that on the right hand side of rules in
not only nonterminals
can appear, but also negated
The construction of the semantics
stays the same, with the following differences:
- now also includes the negated terminals , i.e.
- We need to deﬁne additional lexers for the negated terminals. For we set
The deﬁnition of wellfoundedness is adapted such that there are additional directed edges
in the dependency graph, and such that nullability of
is taken into account.
We can now deﬁne grammar expressions for lookahead:
- If is a grammar expression, then so are and .
For the deﬁnition of
we create fresh terminals
and add the rules
which results in
are declared as auxiliary and atomic, i.e. hidden.
Positive lookahead is just doubly negative lookahead:
This shows that Pyramid Grammars are at least as expressive as PEGs and Context-Free Grammars. Indeed, because there are languages recognized by PEGs that are not context-free, Pyramid Grammars are strictly more expressive than Context-Free Grammars.
So, this is the new semantics for Local Lexing. Its new and hopefully improved way of handling empty tokens and avoiding conﬂicts between non-empty tokens is promising, especially in how it ﬁts together with the new notion of selector consistency. Selector consistency is guaranteed for both ﬁnite local lexings and for priority relations, which covers all types of interesting selectors I have encountered so far.
I have implemented an algorithm for parsing according to this new semantics  →
. It is adapted from the parsing algorithm from the original Local Lexing paper  →
, which is based on Earley’s algorithm 
. I have tried it out on Recursive teXt and it seems to be working correctly. But while the original algorithm has a formal correctness proof  →
, I have no proof for the new algorithm yet. Nevertheless, I would expect the new proof to be less involved and technical than the original proof as the new semantics seems more intuitive.
The most important point of it all is that the parsing algorithm itself is not that important, and actually not covered at all in this article. This work started out as an investigation into using Earley’s parsing algorithm for composable syntax, and was particularly inspired by the use of blackboxes
in  →
. But what has been discovered through this work is not so much a particular parsing algorithm, but that Local Lexing is a general semantics for language and parser speciﬁcations
Pyramid Grammars are how you use Local Lexing in practice. Each terminal in a Pyramid Grammar gives rise to a subgrammar. It seems obvious that each of these subgrammars should be analysed by a parser generator to ﬁnd the parsing algorithm best suited for it. This seems to invite a renewed research into parsing algorithms with a focus on how they can be adapted to Local Lexing. Improvements in parsing algorithms could be directly passed on to the user who does not even have to be aware that improvements have been made. This is because the UI of the parser generator is most importantly its semantics, which is Local Lexing / Pyramid Grammars, and stable.
Special care has been taken to include inﬁnite grammars. These arise naturally from parameterized
Local Lexing Grammars  →
. The grammar for Recursive teXt is a parameterized pyramid grammar, and I have found parameters to be very useful indeed. Parameterized Pyramid Grammars are basically just L-attributed 
Pyramid Grammars. This makes sense, as on one hand L-attributed grammars are the largest class of attribute grammars supporting parser-driven attributes, and on the other hand such support is essential for Local Lexing as attribute evaluation may inﬂuence token selection. Parsing techniques for Local Lexing based on GLR parsing 
and ECLR-attributed parsing 
seem particularly promising.
(2007). Compilers: Principles, Techniques, & Tools. (2004). Parsing Expression Grammars: a recognition-based syntactic foundation, Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, https://doi.org/10.1145/964001.964011. (2010). Semantics and Algorithms for Data-dependent Grammars, Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, https://doi.org/10.1145/1706299.1706347. (1995). Parser-driven attributes: L-attributed grammars, Compiler Design, Addison-Wesley, p. 447. Masaru Tomita (ed.). (1991). Generalized LR Parsing, Kluwer Academic Publishers. (1987). ECLR-Attributed Grammars: A Practical Class of L-Attributed Grammars, Information Processing Letters 24, pp. 31-41, North-Holland.