This article is obsolete. It has been superseded by https://doi.org/10.47757/obua.cosmo-id.2.

Cosmopolitan Identiﬁers

Cite as: https://doi.org/10.47757/obua.cosmo-id.1

April 14th, 2021

Abstract

I propose a simple Unicode-based lexical syntax for programming language identiﬁers using characters from international scripts (currently Latin, Greek and Cyrillic). What makes such cosmopolitan identiﬁers special is that each identiﬁer is equivalent to a uniquely determined simple identiﬁer consisting only of ASCII characters. This makes collaboration in an international setting easier, especially in contexts where such identiﬁers are not only used by professional programmers, but are also present in the domain of normal users, for example through scriptable applications.

Download Source (TypeScript)

Introduction

The possibly oldest programming language still in use, Fortran [1], has an especially simple lexical syntax for identiﬁers: They consist of letters A to Z and a to z, digits 0 to 9, and the underscore _. An identiﬁer must start with a letter. Moreover, identiﬁers are case-insensitive: The identiﬁer Fortran denotes the same thing as the identiﬁer fortran, for example.

Especially to users of programming languages who are not professional programmers, such as many scientists and engineers, this syntactical simplicity is a big advantage. Given that Fortran code is used, among other things, to control nuclear power plants, it certainly is a good thing to reduce the potential for confusion and misunderstanding as much as possible.

Identiﬁers are an important concept in computing: They allow to reference and abbreviate things, and to introduce a level of redirection. Computer users who are not programmers are exposed to the concept of identiﬁers as well, through the ﬁlesystem abstraction of their operating system. Attempts to make this abstraction unavailable to the user in Apple’s iOS had to be rolled back in more recent versions via the introduction of a dedicated Files app.

Scriptable computer applications like Blender also make some of their internals available to power users for automation purposes via built-in scripting engines. These are of course just programming languages, and therefore heavily reliant on identiﬁers.

Even though artiﬁcial intelligence is revolutionizing the interaction between humans and computers, it is my conviction that identiﬁers will nevertheless gain importance as a concept and become even more mainstream than they already are. In many situations, use of a precise identiﬁer just beats a hand-wavy negotiation with an AI, and that is not going to change.

Given the fundamental importance of identiﬁers, people must be able to express them in their native script. This is self-evident in ﬁle systems: It would be unthinkable in a modern setting if people could only use Fortran identiﬁers as ﬁle names. It is also much easier to teach the use of identiﬁers to children if identiﬁers can be expressed in their native tongue.

Unicode Identiﬁers

Unicode, billing itself as the “World Standard for Text and Emoji”, recognizes the importance of identiﬁers and actually incorporates the explicitly deﬁned concept of a Unicode identiﬁer [2]. Modern programming languages like Swift have incorporated Unicode identiﬁers into their syntax: ℱ𝒪ℛ𝒯ℛ𝒜𝒩 is a perfectly valid identiﬁer in Swift, and so is Φορτραν.

Nevertheless, this lexical diversity also comes at a prize. Anecdotally, while typing the previous paragraph I experienced serious problems with the text editing software I am using (Quiver, which is otherwise excellent). Typing a character would insert it in an unexpected position, and copying and pasting the text in the paragraph was impossible.

Aside from these technological teething problems, more importantly, having to deal with Greek letters in an application programming interface would inconvenience me very much.

There is also a certain unwieldiness that Unicode brings to the table. Properties like boldness are part of the deﬁnition of some characters, and so you can have two different identiﬁers Fortran and 𝐅𝐨𝐫𝐭𝐫𝐚𝐧. This is messy, and very far from the clarity and simplicity that identiﬁers in Fortran provide.

Intuitively, there is a difference between letters and symbols. That is the whole reason that most programming languages make this distinction in the ﬁrst place. In my opinion, the concept of Unicode identiﬁer does not respect this distinction enough to be adequate in a programming language context. For example, the characters ℱ and 𝔽 are better treated as symbols, than as normal parts of an identiﬁer. If the programming language wishes, it can carefully introduce symbolic identiﬁers that capture the use of such symbols as identiﬁers, similarly to how symbols like + are introduced into the programming language.

Simple and Cosmopolitan Identiﬁers

I propose cosmopolitan identiﬁers as the golden middle between Fortran identiﬁers and Unicode identiﬁers. These are deﬁned by extending simple identiﬁers with characters from various scripts in a principled way. Currently, Latin, Greek and Cyrillic scripts are supported.

Every cosmopolitan identiﬁer is a Unicode identiﬁer, but not every Unicode identiﬁer is a cosmopolitan identiﬁer. Also, equivalence of cosmopolitan identiﬁers is deﬁned differently than both canonical equivalence and compatibility [3] as deﬁned in the Unicode standard. Two cosmopolitan identiﬁers which are canonically equivalent are always also equivalent in the cosmopolitan sense, but not necessarily the other way around.

Simple Identiﬁers

Let me ﬁrst deﬁne simple identiﬁers. These are sequences consisting of

letters a U+0061 to z U+007A
digits 0 U+0030 to 9 U+0039
the hyphen - U+002D

such that the sequence starts with a letter, does not end with a hyphen, and such that there are no two consecutive hyphens appearing in the sequence. Two simple identiﬁers are equivalent iff they are identical.

For example, lothar-matthaus-10 is a simple identiﬁer. These are not simple identiﬁers: Lothar-Matthaus-10, lothar-matthäus, 10-lothar-matthaus, lothar-matthaus-, lothar--matthaus.

Simple identiﬁers are basically aesthetically pleasing Fortran identiﬁers. As such, they inherit their clarity and simplicity. One might debate choosing the hyphen over the underscore. This is obviously a somewhat arbitrary choice, but unproblematic in my opinion. If you mean x - 1, you really should not write x-1 instead.

Extension Steps

Let us now extend the notion of simple identiﬁers in steps until we reach cosmopolitan identiﬁers.

Each step is required to follow these rules:

A step adds a set of Unicode letters for use in identiﬁers. These additional letters must come from a modern alphabet which is actively and widely used today.
A simple algorithm must be given that decides which identiﬁers are valid, and how to translate all valid identiﬁers into simple identiﬁers.
This algorithm must be compatible with the algorithms of previous steps.
The algorithm works more or less transliterally, that is translates the identiﬁer character by character (sometimes multiple consecutive characters may be translated at once, and the immediate context of characters may be taken into account as well).
The algorithm is a function on identiﬁers only and does not depend on anything else. In particular, it does not depend on things like the current geographical locale.

Two valid identiﬁers are considered to be equivalent if they translate to identical simple identiﬁers.

Step 0: Space

This is strictly speaking not an extension step as described, because we are not adding a letter, but the space character U+0020, which translates to the hyphen - U+002D. An identiﬁer is valid if it translates to a simple identiﬁer after ﬁrst removing all leading and trailing spaces, and then replacing all consecutive occurrences of spaces with a single hyphen.

For example, lothar matthaus 10 is now a valid identiﬁer and is equivalent to the simple identiﬁer lothar-matthaus-10.

Programming languages will usually forbid the direct use of spaces in identiﬁers, but such identiﬁers can still be fruitfully applied when shared outside the context of the programming language, for example when the identiﬁer simultaneously denotes a ﬁle name, or for pretty printing the identiﬁer.

Primary Characters and Allowed Marks

Unicode characters come in the form of grapheme clusters [4], which are certain sequences of Unicode codepoints. For example, the Unicode character ä is represented by the grapheme cluster U+0061 U+0308, consisting of the codepoint a U+0061 and the codepoint ̈ U+0308 which is called a combining diacritical mark. The character ä can also be represented by the grapheme cluster consisting of only the single codepoint U+00E4, so the same character can have multiple different representations as grapheme clusters.

Each Unicode character

u

has a canonically decomposed grapheme cluster normalform

p\ d_1 \ldots d_n

called NFD. We call

p

the primary character of

u

, and each

d_i

a mark of

u

. In the following extension steps, we are only interested in those letters

u

where all marks are combining diacritical marks.

Furthermore, I have observed that only a subset of all combining diacritical marks actually occurs in letters of modern alphabets, and therefore only the marks listed in Appendix M → are allowed.

Step 1: Latin Script

This step adds all unicode characters

u

such that all its marks are allowed, and such that its primary character is listed in Appendix L →.

Identiﬁers are translated to simple identiﬁers by dropping all marks from all characters and translating each primary character as described in Appendix L →.

For example, Lothar Matthäus 10 is now a valid identiﬁer, and is equivalent to the simple identiﬁer lothar-matthaus-10.

Step 2: Greek Script

Support for Greek characters is based on ISO 843:1997 [5] transliteration.

This step adds all unicode characters

u

such that all its marks are allowed, and such that its primary character is listed in Appendix G →.

Certain 2-letter combinations must be considered specially for translation. Possible marks do not matter as long as they are allowed, and are simply dropped during translation:

ΑΥ U+0391 U+03A5, Αυ U+0391 U+03C5, αΥ U+03B1 U+03A5 and αυ U+03B1 U+03C5 all translate to au U+0061 U+0075
ΕΥ U+0395 U+03A5, Ευ U+0395 U+03C5, εΥ U+03B5 U+03A5 and ευ U+03B5 U+03C5 all translate to eu U+0065 U+0075
ΟΥ U+039F U+03A5, Ου U+039F U+03C5, οΥ U+03BF U+03A5 and ου U+03BF U+03C5 all translate to ou U+006F U+0075

Otherwise, if none of the above situations apply, we translate Unicode characters with their primary character listed in Appendix G → as described there, and by dropping all marks.

For example, Μπέχρος is now a valid identiﬁer, and translates to the simple identiﬁer mpechros.

Step 3: Cyrillic Script

Support for Cyrillic characters is added by applying GOST 7.79-2000 System B [6].

This step adds all unicode characters

u

such that all its marks are allowed, and such that its primary character is listed in Appendix C →.

There are a few special cases to consider where the translation does not rely only on the primary character, but also on the marks:

The Unicode characters Й U+0418 U+0306 and й U+0438 U+0306 are translated to j U+006A.
The Unicode characters Ї U+0406 U+0308 and ї U+0456 U+0308 are translated to yi U+0079 U+0069.
The Unicode characters Ё U+0415 U+0308 and ё U+0435 U+0308 are translated to yo U+0079 U+006F.

All other Unicode characters with their primary character listed in Appendix C → are translated by dropping all marks and performing a translation of the primary character as described in Appendix C →.

For example, Андре́й Никола́евич Колмого́ров is now a valid identiﬁer, and translates to the simple identiﬁer andrej-nikolaevich-kolmogorov.

Conclusion

Extending simple identiﬁers via steps 0, 1, 2 and 3 we obtain cosmopolitian identiﬁers. Cosmopolitan identiﬁers retain the simplicity and clarity of Fortran identiﬁers, but allow users to use their native scripts when the situation calls for it. This is achieved by mapping each cosmopolitan identiﬁer to a uniquely determined simple identiﬁer, and deciding equality of identiﬁers based on just this mapping.

While it is possible to use equivalent but not identical identiﬁers in the same context, this is not recommended. For example, dx and Δx are equivalent, but not identical. Obviously, one should not write expressions like dx * Δx, but write either Δx * Δx or dx * dx. Ideally, these are not just conventions, but part of the deﬁnition of programming languages based on cosmopolitan identiﬁers, which would issue warnings or even errors in such situations. On the other hand, when accessing identiﬁers from another context it is OK to change them. For example, when accessing a library which exposes an API based on Greek identiﬁers, it is OK to use equivalent Latin identiﬁers at the calling site.

Currently the supported scripts are Latin, Greek and Cyrillic. It would be great if it would be possible to extend cosmopolitan identiﬁers to other widely used scripts without compromising their conceptual and technical simplicity.

References

[1]Stephen J. Chapman. (2017). Fortran for Scientists and Engineers, 4th Edition.

[2]Mark Davis (ed.). (2020). Unicode Identifier and Pattern Syntax, Unicode Standard Annex #31, https://unicode.org/reports/tr31/.

[3]Ken Whistler (ed.). (2020). Unicode Normalization Forms, Unicode Standard Annex #15, https://unicode.org/reports/tr15/.

[4]Mark Davis, Christopher Chapman (eds.). (2020). Unicode Text Segmentation, Unicode Standard Annex #29, https://unicode.org/reports/tr29/.

[5](1997). ISO 843:1997: Information and documentation — Conversion of Greek characters into Latin characters, International Organization for Standardization, https://www.iso.org/standard/5215.html.

[6](2002). GOST 7.79-2000: System of standards on information, librarianship and publishing. Rules of transliteration of Cyrillic script by Latin alphabet, https://runorm.com/catalog/1004/741213/.

Appendix M: Allowed Marks

̀ U+0300
́ U+0301
̂ U+0302
̃ U+0303
̄ U+0304
̆ U+0306
̇ U+0307
̈ U+0308
̉ U+0309
̊ U+030A
̋ U+030B
̌ U+030C
̏ U+030F
̑ U+0311
̓ U+0313
̔ U+0314
̛ U+031B
̣ U+0323
̤ U+0324
̥ U+0325
̦ U+0326
̧ U+0327
̨ U+0328
̭ U+032D
̮ U+032E
̰ U+0330
̱ U+0331
͂ U+0342
ͅ U+0345

Appendix L: Latin Primary Characters

A U+0041 translates to a U+0061
B U+0042 translates to b U+0062
C U+0043 translates to c U+0063
D U+0044 translates to d U+0064
E U+0045 translates to e U+0065
F U+0046 translates to f U+0066
G U+0047 translates to g U+0067
H U+0048 translates to h U+0068
I U+0049 translates to i U+0069
J U+004A translates to j U+006A
K U+004B translates to k U+006B
L U+004C translates to l U+006C
M U+004D translates to m U+006D
N U+004E translates to n U+006E
O U+004F translates to o U+006F
P U+0050 translates to p U+0070
Q U+0051 translates to q U+0071
R U+0052 translates to r U+0072
S U+0053 translates to s U+0073
T U+0054 translates to t U+0074
U U+0055 translates to u U+0075
V U+0056 translates to v U+0076
W U+0057 translates to w U+0077
X U+0058 translates to x U+0078
Y U+0059 translates to y U+0079
Z U+005A translates to z U+007A
a U+0061 translates to a U+0061
b U+0062 translates to b U+0062
c U+0063 translates to c U+0063
d U+0064 translates to d U+0064
e U+0065 translates to e U+0065
f U+0066 translates to f U+0066
g U+0067 translates to g U+0067
h U+0068 translates to h U+0068
i U+0069 translates to i U+0069
j U+006A translates to j U+006A
k U+006B translates to k U+006B
l U+006C translates to l U+006C
m U+006D translates to m U+006D
n U+006E translates to n U+006E
o U+006F translates to o U+006F
p U+0070 translates to p U+0070
q U+0071 translates to q U+0071
r U+0072 translates to r U+0072
s U+0073 translates to s U+0073
t U+0074 translates to t U+0074
u U+0075 translates to u U+0075
v U+0076 translates to v U+0076
w U+0077 translates to w U+0077
x U+0078 translates to x U+0078
y U+0079 translates to y U+0079
z U+007A translates to z U+007A
Æ U+00C6 translates to ae U+0061 U+0065
Ø U+00D8 translates to o U+006F
ß U+00DF translates to ss U+0073 U+0073
æ U+00E6 translates to ae U+0061 U+0065
ø U+00F8 translates to o U+006F
Đ U+0110 translates to d U+0064
đ U+0111 translates to d U+0064
Ł U+0141 translates to l U+006C
ł U+0142 translates to l U+006C
Œ U+0152 translates to oe U+006F U+0065
œ U+0153 translates to oe U+006F U+0065
Ǆ U+01C4 translates to dz U+0064 U+007A
ǅ U+01C5 translates to dz U+0064 U+007A
ǆ U+01C6 translates to dz U+0064 U+007A
ǈ U+01C8 translates to lj U+006C U+006A
ǋ U+01CB translates to nj U+006E U+006A
Ǳ U+01F1 translates to dz U+0064 U+007A
ǲ U+01F2 translates to dz U+0064 U+007A
ǳ U+01F3 translates to dz U+0064 U+007A
ẞ U+1E9E translates to ss U+0073 U+0073
ﬀ U+FB00 translates to ff U+0066 U+0066
ﬁ U+FB01 translates to fi U+0066 U+0069
ﬂ U+FB02 translates to fl U+0066 U+006C
ﬃ U+FB03 translates to ffi U+0066 U+0066 U+0069
ﬄ U+FB04 translates to ffl U+0066 U+0066 U+006C
ﬆ U+FB06 translates to st U+0073 U+0074

Appendix G: Greek Primary Characters

Α U+0391 translates to a U+0061
Β U+0392 translates to v U+0076
Γ U+0393 translates to g U+0067
Δ U+0394 translates to d U+0064
Ε U+0395 translates to e U+0065
Ζ U+0396 translates to z U+007A
Η U+0397 translates to i U+0069
Θ U+0398 translates to th U+0074 U+0068
Ι U+0399 translates to i U+0069
Κ U+039A translates to k U+006B
Λ U+039B translates to l U+006C
Μ U+039C translates to m U+006D
Ν U+039D translates to n U+006E
Ξ U+039E translates to x U+0078
Ο U+039F translates to o U+006F
Π U+03A0 translates to p U+0070
Ρ U+03A1 translates to r U+0072
Σ U+03A3 translates to s U+0073
Τ U+03A4 translates to t U+0074
Υ U+03A5 translates to y U+0079
Φ U+03A6 translates to f U+0066
Χ U+03A7 translates to ch U+0063 U+0068
Ψ U+03A8 translates to ps U+0070 U+0073
Ω U+03A9 translates to o U+006F
α U+03B1 translates to a U+0061
β U+03B2 translates to v U+0076
γ U+03B3 translates to g U+0067
δ U+03B4 translates to d U+0064
ε U+03B5 translates to e U+0065
ζ U+03B6 translates to z U+007A
η U+03B7 translates to i U+0069
θ U+03B8 translates to th U+0074 U+0068
ι U+03B9 translates to i U+0069
κ U+03BA translates to k U+006B
λ U+03BB translates to l U+006C
μ U+03BC translates to m U+006D
ν U+03BD translates to n U+006E
ξ U+03BE translates to x U+0078
ο U+03BF translates to o U+006F
π U+03C0 translates to p U+0070
ρ U+03C1 translates to r U+0072
ς U+03C2 translates to s U+0073
σ U+03C3 translates to s U+0073
τ U+03C4 translates to t U+0074
υ U+03C5 translates to y U+0079
φ U+03C6 translates to f U+0066
χ U+03C7 translates to ch U+0063 U+0068
ψ U+03C8 translates to ps U+0070 U+0073
ω U+03C9 translates to o U+006F

Appendix C: Cyrillic Primary Characters

Є U+0404 translates to Ye U+0059 U+0065
Ѕ U+0405 translates to z U+007A
І U+0406 translates to i U+0069
Ј U+0408 translates to j U+006A
Љ U+0409 translates to l U+006C
Њ U+040A translates to n U+006E
Џ U+040F translates to dh U+0064 U+0068
А U+0410 translates to a U+0061
Б U+0411 translates to b U+0062
В U+0412 translates to v U+0076
Г U+0413 translates to g U+0067
Д U+0414 translates to d U+0064
Е U+0415 translates to e U+0065
Ж U+0416 translates to zh U+007A U+0068
З U+0417 translates to z U+007A
И U+0418 translates to i U+0069
К U+041A translates to k U+006B
Л U+041B translates to l U+006C
М U+041C translates to m U+006D
Н U+041D translates to n U+006E
О U+041E translates to o U+006F
П U+041F translates to p U+0070
Р U+0420 translates to r U+0072
С U+0421 translates to s U+0073
Т U+0422 translates to t U+0074
У U+0423 translates to u U+0075
Ф U+0424 translates to f U+0066
Х U+0425 translates to x U+0078
Ц U+0426 translates to cz U+0063 U+007A
Ч U+0427 translates to ch U+0063 U+0068
Ш U+0428 translates to sh U+0073 U+0068
Щ U+0429 translates to shh U+0073 U+0068 U+0068
Ъ U+042A translates to an empty sequence of characters
Ы U+042B translates to y U+0079
Ь U+042C translates to an empty sequence of characters
Э U+042D translates to e U+0065
Ю U+042E translates to yu U+0079 U+0075
Я U+042F translates to ya U+0079 U+0061
а U+0430 translates to a U+0061
б U+0431 translates to b U+0062
в U+0432 translates to v U+0076
г U+0433 translates to g U+0067
д U+0434 translates to d U+0064
е U+0435 translates to e U+0065
ж U+0436 translates to zh U+007A U+0068
з U+0437 translates to z U+007A
и U+0438 translates to i U+0069
к U+043A translates to k U+006B
л U+043B translates to l U+006C
м U+043C translates to m U+006D
н U+043D translates to n U+006E
о U+043E translates to o U+006F
п U+043F translates to p U+0070
р U+0440 translates to r U+0072
с U+0441 translates to s U+0073
т U+0442 translates to t U+0074
у U+0443 translates to u U+0075
ф U+0444 translates to f U+0066
х U+0445 translates to x U+0078
ц U+0446 translates to cz U+0063 U+007A
ч U+0447 translates to ch U+0063 U+0068
ш U+0448 translates to sh U+0073 U+0068
щ U+0449 translates to shh U+0073 U+0068 U+0068
ъ U+044A translates to an empty sequence of characters
ы U+044B translates to y U+0079
ь U+044C translates to an empty sequence of characters
э U+044D translates to e U+0065
ю U+044E translates to yu U+0079 U+0075
я U+044F translates to ya U+0079 U+0061
є U+0454 translates to Ye U+0059 U+0065
ѕ U+0455 translates to z U+007A
і U+0456 translates to i U+0069
ј U+0458 translates to j U+006A
љ U+0459 translates to l U+006C
њ U+045A translates to n U+006E
џ U+045F translates to dh U+0064 U+0068
Ґ U+0490 translates to g U+0067
ґ U+0491 translates to g U+0067