I propose a simple Unicode-based lexical syntax for programming language identifiers using characters from international scripts (currently Latin, Greek, Cyrillic and Math). Such cosmopolitan identifiers are designed to achieve much of the simplicity of Fortran identifiers while acknowledging a modern international outlook. This seems particularly advantageous in contexts where such identifiers are not (only) used by professional programmers, but are exposed to normal users, for example through scriptable applications.
Introduction
The possibly oldest programming language still in use, Fortran[1], has an especially simple lexical syntax for identifiers: They consist of letters A to Z and a to z, digits 0 to 9, and the underscore _. An identifier must start with a letter. Moreover, identifiers are case-insensitive: The identifier Fortran denotes the same thing as the identifier fortran, for example.
Especially for users of programming languages who are not professional programmers, such as many scientists and engineers, this syntactical simplicity seems to be a big advantage. Given that Fortran code is used, among other things, to control nuclear power plants, it certainly is a good thing to reduce the potential for confusion and misunderstanding as much as possible.
Identifiers are an important concept in computing: They allow to reference and abbreviate things, and to introduce a level of indirection. Computer users who are not programmers are exposed to the concept of identifiers as well, through the filesystem abstraction of their operating system. Attempts to make this abstraction unavailable to the user in Apple’s iOS had to be rolled back in more recent versions via the introduction of a dedicated Files app.
Scriptable computer applications like Blender also make some of their internals available to power users for automation purposes via built-in scripting engines. These are of course just programming languages, and therefore heavily reliant on identifiers.
Even though artificial intelligence is revolutionizing the interaction between humans and computers, it is my conviction that identifiers will nevertheless gain importance as a concept and become even more mainstream than they already are. In many situations, use of a precise identifier just beats a hand-wavy negotiation with an AI, and that is not going to change.
Given the fundamental importance of identifiers, people must be able to express them in the script of their native tongue. This is self-evident in file systems: It would be unthinkable in a modern setting if people could only use Fortran identifiers as file names. It is also much easier to teach the use of identifiers to children if identifiers can be expressed in their native tongue.
Furthermore, it can be very convenient, especially in scientific applications, to be able to use symbols like π directly as identifiers, or to use them as part of identifiers, as in approximation-of-π.
Our design goal for identifiers consists therefore of three subgoals:
They should be almost as simple as Fortran identifiers.
They should support international scripts.
They should support the use of symbols as (part of) identifiers.
Unicode Identifiers
Unicode, billing itself as the “World Standard for Text and Emoji”, recognizes the importance of identifiers and actually provides the concept of Unicode identifiers[2]. Modern programming languages like Swift have incorporated Unicode identifiers into their syntax: ℱ𝒪ℛ𝒯ℛ𝒜𝒩 is a perfectly valid identifier in Swift, and so is Φορτραν.
Nevertheless, this lexical diversity also comes at a prize. Having to deal with Greek letters in an application programming interface would inconvenience me very much, at least when they are used casually instead of just designating entities like π.
There is also a certain unwieldiness that Unicode brings to the table. Properties like boldness are part of the definition of some characters, and so you can have two different identifiers Fortran and 𝐅𝐨𝐫𝐭𝐫𝐚𝐧. This is messy, and very far from the clarity and simplicity that identifiers in Fortran provide. Granted, one will rarely encounter such a misuse of Unicode in practice, but why invite it in the first place?
Intuitively, there is a difference between names and symbols. That is the whole reason that most programming languages make this distinction in the first place. In my opinion, the concept of Unicode identifiers does not respect this distinction enough to be adequate in a programming language context. For example, the characters ℱ and 𝔽 are better treated as symbols than as arbitrary letters.
Clearly, while Unicode identifiers support international scripts, and can contain symbols, they are not as simple as Fortran identifiers at all, but a rather messy affair. Therefore Unicode identifiers only fulfill two of our three design goals for identifiers.
Cosmopolitan Identifiers
I propose cosmopolitan identifiers (CIDs) as the golden middle between Fortran identifiers and Unicode identifiers. A CID is a sequence of Unicode characters consisting of letters, digits and separators. The following properties hold for a CID:
A CID must start with a letter or symbol.
A CID must not end with a separator.
A CID must not contain consecutive separators.
A separator is a hyphen-U+002D. I favour the hyphen over the underscore for aesthetic reasons. Later → we will look at how it is possible to use other characters like underscores and spaces as separators as well.
A digit is a character between (and including) 0U+0030 and 9U+0039.
To fully define what a CID is, it is necessary to describe the set of letters, which sequences of letters are allowed, and to state when two CIDs are considered to be equivalent. All of this will be done in the following sections.
The set of letters contains initially just the set of lowercase letters from aU+0061 to zU+007A. We will later extend this set with Latin, Greek and Cyrillic characters. We will also extend it with letter-like mathematical symbols.
Equivalence of CIDs
Every CID is a Unicode identifier, but not every Unicode identifier is a CID. Also, equivalence of CIDs is defined differently than both canonical equivalence and compatibility[3] as defined in the Unicode standard. Two cosmopolitan identifiers which are canonically equivalent as Unicode identifiers are always also equivalent in the cosmopolitan sense, but not necessarily the other way around.
To decide equivalence of two CIDs, we map each CID c to a normal CID N(c). Then two CIDs a and b are considered to be equivalent iff N(a) and N(b) are identical.
We construct N(c) by dividing c into maximal non-empty subsequences s1,…,sk consisting either only of letters or only of digits. Separators are ignored themselves but serve to restrict how far subsequences of letters or digits can expand. We map each si separately to its normal formNF(si), and then form the CID NF(s1)-NF(s2)-…-NF(sk) Note that if the normal form for one of the si does not exist, i.e. is undefined, then the normal form for c does not exist as well.
For a complete description of N we therefore only need to describe NF(l) for a sequence l of letters, and NF(d) for a sequence d of digits.
Digits are easy to handle, we just set NF(d)=d How we map general sequences of letters to their normal forms is described in the following sections. For a sequence x consisting just of lowercase letters between a and z, we proceed just as we do for digits: NF(x)=x As an example, consider the CIDs xyz12p5 and xyz-12p-5. They both decompose into the subsequences xyz, 12, p and 5, and have therefore the same normal CID xyz-12-p-5.
Words vs. Symbols
On one hand we would like to treat Tree and tree as equivalent identifiers. On the other hand it is often convenient to treat T and t as different symbols. After all, they look entirely different. This creates a tension which we resolve by distinguishing between three different classes of sequences of letters:
words
symbols
invalid letter sequences
Dependent on class there are different normal forms NFword and NFsymbol, and we define NF(l)=⎩⎨⎧NFword(l)NFsymbol(l)undefined
when l is a word
when l is a symbol
otherwise
A symbol is any letter sequence s such that NFsymbol(s) is defined. Currently all symbols consist of a single letter.
A word is any letter sequence w such that NFword(w) is defined, non-empty, and not a symbol.
All letter sequences that are neither words nor symbols are invalid. For example, Ä is invalid, because it is neither a word (because NFword(Ä)=A, and A is a symbol) nor a symbol.
Extension Steps
We define the full set of cosmopolitan identifiers in several stages. Stage i consists of a set of letters Li, and normal forms NFwordi
and NFsymboli
operating on sequences in Li∗.
Our initial set L0 of letters consists of the lowercase letters from aU+0061 to zU+007A. The two normal forms NFword0
and NFsymbol0
are defined for x∈L0∗ by NFword0
(x)NFsymbol0
(x)==x{xundefined
if x consists of a single letter
otherwise
In short, the symbols of stage 0 are those sequences in L0∗ consisting of a single letter, and the words are all sequences in L0∗ that have at least two elements.
Starting from this initial stage 0, we proceed as follows:
Each extension step from stage i to stage i+1 follows these rules:
A step adds a set of Unicode letters, i.e. Li⊂Li+1. The additional letters come from a modern alphabet which is actively and widely used today.
Normal forms NFwordi+1
and NFsymboli+1
are defined on Li+1∗ such that their restrictions to Li∗ equal NFwordi
and NFsymboli
, respectively.
The normal forms work more or less transliterally, that is they transform their input letter by letter. Sometimes multiple consecutive letters may be translated at once, and the immediate context of letters may be taken into account as well.
The normal forms are pure functions in that they do not depend on anything else other than their input. In particular, they do not depend on things like the current geographical locale.
The following sections describe these four extension steps. They take advantage of Unicode features that we will be looking at first.
Primary Characters and Allowed Marks
Unicode characters come in the form of grapheme clusters[4], which are certain sequences of Unicode codepoints. For example, the Unicode character ä is represented by the grapheme cluster U+0061U+0308, consisting of the codepoint aU+0061 and the codepoint ̈U+0308 which is called a combining diacritical mark. The character ä can also be represented by the grapheme cluster consisting of only the single codepoint U+00E4, so the same character can have multiple different representations as grapheme clusters.
Each Unicode character u has a canonically decomposed grapheme cluster normalform pd1…dn called NFD. We call p the primary character of u, and each di a mark of u. In the following extension steps, we are only interested in those letters u where all marks are combining diacritical marks.
Furthermore, I have observed that only a subset of all combining diacritical marks actually occurs in letters of modern alphabets, and therefore only the marks listed in Appendix M → are allowed.
Stage 1: Latin Script
The set L1 consists of all unicode characters u such that all marks of u are allowed, and such that the primary character p of u is listed in Appendix L →. The normal form NFword1
(u) is the translation of p as shown in Appendix L →. Note that all marks of u are dropped for translation.
For letter sequences w=u1…un∈L1∗,NFword1
(w) is defined transliterally, i.e. NFword1
(w)=NFword1
(u1)…NFword1
(un).
The symbols s of stage 1 are listed in Appendix LS →. They are all new symbols, which is another way of saying that NFsymbol1
(s)=s holds.
For example, Lothar-Matthäus-10 is a valid cosmopolitan identifier with normal form lothar-matthaus-10. The normal form Lothar-M is also a CID with normal form lothar-M.
Stage 2: Greek Script
Support for Greek characters is based on ISO 843:1997 [5] transliteration.
This step adds all unicode characters u such that all its marks are allowed, and such that its primary character is listed in Appendix G →.
Certain 2-letter combinations must be considered specially when translating a word w to NFword2
(w). Possible marks do not matter as long as they are allowed, and are simply dropped during translation:
ΑΥU+0391U+03A5, ΑυU+0391U+03C5, αΥU+03B1U+03A5 and αυU+03B1U+03C5 all translate to auU+0061U+0075
ΕΥU+0395U+03A5, ΕυU+0395U+03C5, εΥU+03B5U+03A5 and ευU+03B5U+03C5 all translate to euU+0065U+0075
ΟΥU+039FU+03A5, ΟυU+039FU+03C5, οΥU+03BFU+03A5 and ουU+03BFU+03C5 all translate to ouU+006FU+0075
Otherwise, if none of the above situations apply, we translate single Unicode characters with their primary character listed in Appendix G → as described there, and by dropping all marks.
Words are translated transliterally based on their division into 2-letter combinations and single characters.
The Greek symbols are listed in Appendix GS →. They are all different from each other, but some of them are equivalent to Latin symbols. For example ΒU+0392 is equivalent to BU+0042, i.e. NFsymbol2
(U+0392)=U+0042. On the other hand ΓU+0393 is a new symbol, and thus NFsymbol2
(U+0393)=U+0393.
For example, Μπέχρος is a CID, and translates to the simple identifier mpechros.
This step adds all unicode characters u such that all its marks are allowed, and such that its primary character is listed in Appendix C →.
There are a few special cases to consider where the translation from u to NFword3
(u) does not rely only on the primary character of u, but also on its marks:
Unicode characters with primary character U+0418 or U+0438 are translated to jU+006A if U+0306 is among their marks.
Unicode characters with primary characters U+0406 or U+0456 are translated to yiU+0079U+0069 if U+0308 is among their marks.
Unicode characters with primary characters U+0415 or U+0435 are translated to yoU+0079U+006F if U+0308 is among their marks.
All other Unicode characters with their primary character listed in Appendix C → are translated by dropping all marks and performing a translation of the primary character as described in Appendix C →.
Words w are then translated transliterally to NFword3
(w).
The Cyrillic symbols are listed in Appendix CS →. Some of them are equivalent to each other because upper and lower case letters are just scaled versions of each other. Some of them are furthermore equivalent to Latin and/or Greek symbols.
For example, Андре́й-Никола́евич-Колмого́ров is a CID, and translates to the normalform andrej-nikolaevich-kolmogorov.
Stage 4: Mathematical Symbols
This step is adding letter-like mathematical symbols as listed in Appendix MS →. We have NFword4
=NFword3
, and NFsymbol4
extends NFsymbol3
by acting as the identity on all symbols in Appendix MS →.
For example, ℕ, ℕ1 and ℕ-ℕ are CIDs, but ℕℕ is not.
Separators and External CIDs
Although we have defined the hyphen-U+002D as the only possible separator, in practice we may want to use a different set of separators, depending on the situation. We might for example want to use the underscore_U+005F instead, or also allow spacesU+0020 as separators.
We can accomodate for this by using external CIDs. What exactly an external CID is depends on your situation. All you need to do is to provide a description of the syntax of an external CID, and how to translate it to an ordinary CID.
For example, we might define an external CID to be a sequence of letters, digits, hyphens, underscores and spaces such that after a cleanup it becomes a CID. Here we define a cleanup as making the following modifications in order:
Trimming spaces from the left start and right end.
Replacing consecutive spaces with a single space.
Replacing spaces and underscores with hyphens.
In this example, x z_9-t would be a valid external CID, corresponding to the CID x-z-9-t. On the other hand, x - z would not be a valid external CID, as after cleanup it becomes x---z, which is not a CID.
Cosmopolitan identifiers retain much of the simplicity and clarity of Fortran identifiers, but allow users to use their native scripts and letter-like math symbols. This is achieved by mapping each cosmopolitan identifier to a normal form which is basically a Fortran identifier, apart from the fact that it can also include symbols.
We have proceeded in four stages to define cosmopolitan identifiers. Ideally, in your application you would use CIDs as defined here. But if that is not possible for some reason, then CIDs might be adaptable to your needs by adding more stages, for example to allow more symbols to function as identifiers.
While it is possible to use equivalent but not identical identifiers in the same context, this is not recommended. For example, dx and Δx are equivalent, but not identical. Obviously, one should not write expressions like dx * Δx, but write either Δx * Δx or dx * dx. Ideally, these are not just conventions, but part of the definition of programming languages based on cosmopolitan identifiers, which would issue warnings or even errors in such situations. On the other hand, when accessing identifiers from another context it is OK to change them. For example, when accessing a library which exposes an API based on Greek identifiers, it is OK to use equivalent Latin identifiers at the calling site.
Currently the supported scripts are Latin, Greek, Cyrillic and Math. It would be great if it would be possible to extend cosmopolitan identifiers to other widely used scripts without compromising their conceptual and technical simplicity.
References
[1]Stephen J. Chapman. (2017). Fortran for Scientists and Engineers, 4th Edition.
[4]Mark Davis, Christopher Chapman (eds.). (2020). Unicode Text Segmentation, Unicode Standard Annex #29, https://unicode.org/reports/tr29/.
[5](1997). ISO 843:1997: Information and documentation — Conversion of Greek characters into Latin characters, International Organization for Standardization, https://www.iso.org/standard/5215.html.
[6](2002). GOST 7.79-2000: System of standards on information, librarianship and publishing. Rules of transliteration of Cyrillic script by Latin alphabet, https://runorm.com/catalog/1004/741213/.
Appendix M: Allowed Marks
̀U+0300
́U+0301
̂U+0302
̃U+0303
̄U+0304
̆U+0306
̇U+0307
̈U+0308
̉U+0309
̊U+030A
̋U+030B
̌U+030C
̏U+030F
̑U+0311
̓U+0313
̔U+0314
̛U+031B
̣U+0323
̤U+0324
̥U+0325
̦U+0326
̧U+0327
̨U+0328
̭U+032D
̮U+032E
̰U+0330
̱U+0331
͂U+0342
ͅU+0345
Appendix L: Latin Primary Characters
AU+0041 translates to aU+0061
BU+0042 translates to bU+0062
CU+0043 translates to cU+0063
DU+0044 translates to dU+0064
EU+0045 translates to eU+0065
FU+0046 translates to fU+0066
GU+0047 translates to gU+0067
HU+0048 translates to hU+0068
IU+0049 translates to iU+0069
JU+004A translates to jU+006A
KU+004B translates to kU+006B
LU+004C translates to lU+006C
MU+004D translates to mU+006D
NU+004E translates to nU+006E
OU+004F translates to oU+006F
PU+0050 translates to pU+0070
QU+0051 translates to qU+0071
RU+0052 translates to rU+0072
SU+0053 translates to sU+0073
TU+0054 translates to tU+0074
UU+0055 translates to uU+0075
VU+0056 translates to vU+0076
WU+0057 translates to wU+0077
XU+0058 translates to xU+0078
YU+0059 translates to yU+0079
ZU+005A translates to zU+007A
aU+0061 translates to aU+0061
bU+0062 translates to bU+0062
cU+0063 translates to cU+0063
dU+0064 translates to dU+0064
eU+0065 translates to eU+0065
fU+0066 translates to fU+0066
gU+0067 translates to gU+0067
hU+0068 translates to hU+0068
iU+0069 translates to iU+0069
jU+006A translates to jU+006A
kU+006B translates to kU+006B
lU+006C translates to lU+006C
mU+006D translates to mU+006D
nU+006E translates to nU+006E
oU+006F translates to oU+006F
pU+0070 translates to pU+0070
qU+0071 translates to qU+0071
rU+0072 translates to rU+0072
sU+0073 translates to sU+0073
tU+0074 translates to tU+0074
uU+0075 translates to uU+0075
vU+0076 translates to vU+0076
wU+0077 translates to wU+0077
xU+0078 translates to xU+0078
yU+0079 translates to yU+0079
zU+007A translates to zU+007A
ÆU+00C6 translates to aeU+0061U+0065
ØU+00D8 translates to oU+006F
ßU+00DF translates to ssU+0073U+0073
æU+00E6 translates to aeU+0061U+0065
øU+00F8 translates to oU+006F
ĐU+0110 translates to dU+0064
đU+0111 translates to dU+0064
ŁU+0141 translates to lU+006C
łU+0142 translates to lU+006C
ŒU+0152 translates to oeU+006FU+0065
œU+0153 translates to oeU+006FU+0065
DŽU+01C4 translates to dzU+0064U+007A
DžU+01C5 translates to dzU+0064U+007A
džU+01C6 translates to dzU+0064U+007A
LjU+01C8 translates to ljU+006CU+006A
NjU+01CB translates to njU+006EU+006A
DZU+01F1 translates to dzU+0064U+007A
DzU+01F2 translates to dzU+0064U+007A
dzU+01F3 translates to dzU+0064U+007A
ẞU+1E9E translates to ssU+0073U+0073
ffU+FB00 translates to ffU+0066U+0066
fiU+FB01 translates to fiU+0066U+0069
flU+FB02 translates to flU+0066U+006C
ffiU+FB03 translates to ffiU+0066U+0066U+0069
fflU+FB04 translates to fflU+0066U+0066U+006C
stU+FB06 translates to stU+0073U+0074
Appendix G: Greek Primary Characters
ΑU+0391 translates to aU+0061
ΒU+0392 translates to vU+0076
ΓU+0393 translates to gU+0067
ΔU+0394 translates to dU+0064
ΕU+0395 translates to eU+0065
ΖU+0396 translates to zU+007A
ΗU+0397 translates to iU+0069
ΘU+0398 translates to thU+0074U+0068
ΙU+0399 translates to iU+0069
ΚU+039A translates to kU+006B
ΛU+039B translates to lU+006C
ΜU+039C translates to mU+006D
ΝU+039D translates to nU+006E
ΞU+039E translates to xU+0078
ΟU+039F translates to oU+006F
ΠU+03A0 translates to pU+0070
ΡU+03A1 translates to rU+0072
ΣU+03A3 translates to sU+0073
ΤU+03A4 translates to tU+0074
ΥU+03A5 translates to yU+0079
ΦU+03A6 translates to fU+0066
ΧU+03A7 translates to chU+0063U+0068
ΨU+03A8 translates to psU+0070U+0073
ΩU+03A9 translates to oU+006F
αU+03B1 translates to aU+0061
βU+03B2 translates to vU+0076
γU+03B3 translates to gU+0067
δU+03B4 translates to dU+0064
εU+03B5 translates to eU+0065
ζU+03B6 translates to zU+007A
ηU+03B7 translates to iU+0069
θU+03B8 translates to thU+0074U+0068
ιU+03B9 translates to iU+0069
κU+03BA translates to kU+006B
λU+03BB translates to lU+006C
μU+03BC translates to mU+006D
νU+03BD translates to nU+006E
ξU+03BE translates to xU+0078
οU+03BF translates to oU+006F
πU+03C0 translates to pU+0070
ρU+03C1 translates to rU+0072
ςU+03C2 translates to sU+0073
σU+03C3 translates to sU+0073
τU+03C4 translates to tU+0074
υU+03C5 translates to yU+0079
φU+03C6 translates to fU+0066
χU+03C7 translates to chU+0063U+0068
ψU+03C8 translates to psU+0070U+0073
ωU+03C9 translates to oU+006F
Appendix C: Cyrillic Primary Characters
ЄU+0404 translates to YeU+0059U+0065
ЅU+0405 translates to zU+007A
ІU+0406 translates to iU+0069
ЈU+0408 translates to jU+006A
ЉU+0409 translates to lU+006C
ЊU+040A translates to nU+006E
ЏU+040F translates to dhU+0064U+0068
АU+0410 translates to aU+0061
БU+0411 translates to bU+0062
ВU+0412 translates to vU+0076
ГU+0413 translates to gU+0067
ДU+0414 translates to dU+0064
ЕU+0415 translates to eU+0065
ЖU+0416 translates to zhU+007AU+0068
ЗU+0417 translates to zU+007A
ИU+0418 translates to iU+0069
КU+041A translates to kU+006B
ЛU+041B translates to lU+006C
МU+041C translates to mU+006D
НU+041D translates to nU+006E
ОU+041E translates to oU+006F
ПU+041F translates to pU+0070
РU+0420 translates to rU+0072
СU+0421 translates to sU+0073
ТU+0422 translates to tU+0074
УU+0423 translates to uU+0075
ФU+0424 translates to fU+0066
ХU+0425 translates to xU+0078
ЦU+0426 translates to czU+0063U+007A
ЧU+0427 translates to chU+0063U+0068
ШU+0428 translates to shU+0073U+0068
ЩU+0429 translates to shhU+0073U+0068U+0068
ЪU+042A translates to an empty sequence of characters
ЫU+042B translates to yU+0079
ЬU+042C translates to an empty sequence of characters
ЭU+042D translates to eU+0065
ЮU+042E translates to yuU+0079U+0075
ЯU+042F translates to yaU+0079U+0061
аU+0430 translates to aU+0061
бU+0431 translates to bU+0062
вU+0432 translates to vU+0076
гU+0433 translates to gU+0067
дU+0434 translates to dU+0064
еU+0435 translates to eU+0065
жU+0436 translates to zhU+007AU+0068
зU+0437 translates to zU+007A
иU+0438 translates to iU+0069
кU+043A translates to kU+006B
лU+043B translates to lU+006C
мU+043C translates to mU+006D
нU+043D translates to nU+006E
оU+043E translates to oU+006F
пU+043F translates to pU+0070
рU+0440 translates to rU+0072
сU+0441 translates to sU+0073
тU+0442 translates to tU+0074
уU+0443 translates to uU+0075
фU+0444 translates to fU+0066
хU+0445 translates to xU+0078
цU+0446 translates to czU+0063U+007A
чU+0447 translates to chU+0063U+0068
шU+0448 translates to shU+0073U+0068
щU+0449 translates to shhU+0073U+0068U+0068
ъU+044A translates to an empty sequence of characters
ыU+044B translates to yU+0079
ьU+044C translates to an empty sequence of characters