Cosmopolitan Identifiers
April 14th, 2021
Abstract
I propose a simple Unicode-based lexical syntax for programming language identifiers using characters from international scripts (currently Latin, Greek and Cyrillic). What makes such cosmopolitan identifiers special is that each identifier is equivalent to a uniquely determined simple identifier consisting only of ASCII characters. This makes collaboration in an international setting easier, especially in contexts where such identifiers are not only used by professional programmers, but are also present in the domain of normal users, for example through scriptable applications.
Introduction
The possibly oldest programming language still in use,
Fortran [1], has an especially simple lexical syntax for identifiers: They consist of letters
A to
Z and
a to
z, digits
0 to
9, and the underscore
_. An identifier must start with a letter. Moreover, identifiers are case-insensitive: The identifier
Fortran denotes the same thing as the identifier
fortran, for example.
Especially to users of programming languages who are not professional programmers, such as many scientists and engineers, this syntactical simplicity is a big advantage. Given that Fortran code is used, among other things, to control nuclear power plants, it certainly is a good thing to reduce the potential for confusion and misunderstanding as much as possible.
Identifiers are an important concept in computing: They allow to reference and abbreviate things, and to introduce a level of redirection. Computer users who are not programmers are exposed to the concept of identifiers as well, through the filesystem abstraction of their operating system. Attempts to make this abstraction unavailable to the user in Apple’s iOS had to be rolled back in more recent versions via the introduction of a dedicated Files app.
Scriptable computer applications like
Blender also make some of their internals available to power users for automation purposes via built-in scripting engines. These are of course just programming languages, and therefore heavily reliant on identifiers.
Even though artificial intelligence is revolutionizing the interaction between humans and computers, it is my conviction that identifiers will nevertheless gain importance as a concept and become even more mainstream than they already are. In many situations, use of a precise identifier just beats a hand-wavy negotiation with an AI, and that is not going to change.
Given the fundamental importance of identifiers, people must be able to express them in their native script. This is self-evident in file systems: It would be unthinkable in a modern setting if people could only use Fortran identifiers as file names. It is also much easier to teach the use of identifiers to children if identifiers can be expressed in their native tongue.
Unicode Identifiers
Unicode, billing itself as the “World Standard for Text and Emoji”, recognizes the importance of identifiers and actually incorporates the explicitly defined concept of a
Unicode identifier [2]. Modern programming languages like
Swift have incorporated Unicode identifiers into their syntax:
ℱ𝒪ℛ𝒯ℛ𝒜𝒩 is a perfectly valid identifier in Swift, and so is
Φορτραν.
Nevertheless, this lexical diversity also comes at a prize. Anecdotally, while typing the previous paragraph I experienced serious problems with the text editing software I am using (
Quiver, which is otherwise excellent). Typing a character would insert it in an unexpected position, and copying and pasting the text in the paragraph was impossible.
Aside from these technological teething problems, more importantly, having to deal with Greek letters in an application programming interface would inconvenience me very much.
There is also a certain unwieldiness that Unicode brings to the table. Properties like boldness are part of the definition of some characters, and so you can have two different identifiers Fortran and 𝐅𝐨𝐫𝐭𝐫𝐚𝐧. This is messy, and very far from the clarity and simplicity that identifiers in Fortran provide.
Intuitively, there is a difference between letters and symbols. That is the whole reason that most programming languages make this distinction in the first place. In my opinion, the concept of Unicode identifier does not respect this distinction enough to be adequate in a programming language context. For example, the characters ℱ and 𝔽 are better treated as symbols, than as normal parts of an identifier. If the programming language wishes, it can carefully introduce symbolic identifiers that capture the use of such symbols as identifiers, similarly to how symbols like + are introduced into the programming language.
Simple and Cosmopolitan Identifiers
I propose cosmopolitan identifiers as the golden middle between Fortran identifiers and Unicode identifiers. These are defined by extending simple identifiers with characters from various scripts in a principled way. Currently, Latin, Greek and Cyrillic scripts are supported.
Every cosmopolitan identifier is a Unicode identifier, but not every Unicode identifier is a cosmopolitan identifier. Also, equivalence of cosmopolitan identifiers is defined differently than both
canonical equivalence and compatibility [3] as defined in the Unicode standard. Two cosmopolitan identifiers which are canonically equivalent are always also equivalent in the cosmopolitan sense, but not necessarily the other way around.
Simple Identifiers
Let me first define simple identifiers. These are sequences consisting of
- letters a U+0061 to z U+007A
- digits 0 U+0030 to 9 U+0039
- the hyphen - U+002D
such that the sequence starts with a letter, does not end with a hyphen, and such that there are no two consecutive hyphens appearing in the sequence. Two simple identifiers are equivalent iff they are identical.
For example, lothar-matthaus-10 is a simple identifier. These are not simple identifiers: Lothar-Matthaus-10, lothar-matthäus, 10-lothar-matthaus, lothar-matthaus-, lothar--matthaus.
Simple identifiers are basically aesthetically pleasing Fortran identifiers. As such, they inherit their clarity and simplicity. One might debate choosing the hyphen over the underscore. This is obviously a somewhat arbitrary choice, but unproblematic in my opinion. If you mean x - 1, you really should not write x-1 instead.
Extension Steps
Let us now extend the notion of simple identifiers in steps until we reach cosmopolitan identifiers.
Each step is required to follow these rules:
- A step adds a set of Unicode letters for use in identifiers. These additional letters must come from a modern alphabet which is actively and widely used today.
- A simple algorithm must be given that decides which identifiers are valid, and how to translate all valid identifiers into simple identifiers.
- This algorithm must be compatible with the algorithms of previous steps.
- The algorithm works more or less transliterally, that is translates the identifier character by character (sometimes multiple consecutive characters may be translated at once, and the immediate context of characters may be taken into account as well).
- The algorithm is a function on identifiers only and does not depend on anything else. In particular, it does not depend on things like the current geographical locale.
Two valid identifiers are considered to be equivalent if they translate to identical simple identifiers.
Step 0: Space
This is strictly speaking not an extension step as described, because we are not adding a letter, but the space character U+0020, which translates to the hyphen - U+002D. An identifier is valid if it translates to a simple identifier after first removing all leading and trailing spaces, and then replacing all consecutive occurrences of spaces with a single hyphen.
For example, lothar matthaus 10 is now a valid identifier and is equivalent to the simple identifier lothar-matthaus-10.
Programming languages will usually forbid the direct use of spaces in identifiers, but such identifiers can still be fruitfully applied when shared outside the context of the programming language, for example when the identifier simultaneously denotes a file name, or for pretty printing the identifier.
Primary Characters and Allowed Marks
Unicode characters come in the form of
grapheme clusters [4], which are certain sequences of Unicode codepoints. For example, the Unicode character
ä is represented by the grapheme cluster
U+0061 U+0308, consisting of the codepoint
a U+0061 and the codepoint
̈ U+0308 which is called a
combining diacritical mark. The character
ä can also be represented by the grapheme cluster consisting of only the single codepoint
U+00E4, so the same character can have multiple different representations as grapheme clusters.
Each Unicode character
u has a canonically decomposed grapheme cluster normalform
p d1…dn called
NFD. We call
p the
primary character of
u, and each
di a
mark of
u. In the following extension steps, we are only interested in those letters
u where all marks are combining diacritical marks.
Furthermore, I have observed that only a subset of all combining diacritical marks actually occurs in letters of modern alphabets, and therefore only the marks listed in
Appendix M → are allowed.
Step 1: Latin Script
This step adds all unicode characters
u such that all its marks are allowed, and such that its primary character is listed in
Appendix L →.
Identifiers are translated to simple identifiers by dropping all marks from all characters and translating each primary character as described in
Appendix L →.
For example, Lothar Matthäus 10 is now a valid identifier, and is equivalent to the simple identifier lothar-matthaus-10.
Step 2: Greek Script
This step adds all unicode characters
u such that all its marks are allowed, and such that its primary character is listed in
Appendix G →.
Certain 2-letter combinations must be considered specially for translation. Possible marks do not matter as long as they are allowed, and are simply dropped during translation:
- ΑΥ U+0391 U+03A5, Αυ U+0391 U+03C5, αΥ U+03B1 U+03A5 and αυ U+03B1 U+03C5 all translate to au U+0061 U+0075
- ΕΥ U+0395 U+03A5, Ευ U+0395 U+03C5, εΥ U+03B5 U+03A5 and ευ U+03B5 U+03C5 all translate to eu U+0065 U+0075
- ΟΥ U+039F U+03A5, Ου U+039F U+03C5, οΥ U+03BF U+03A5 and ου U+03BF U+03C5 all translate to ou U+006F U+0075
Otherwise, if none of the above situations apply, we translate Unicode characters with their primary character listed in
Appendix G → as described there, and by dropping all marks.
For example, Μπέχρος is now a valid identifier, and translates to the simple identifier mpechros.
Step 3: Cyrillic Script
This step adds all unicode characters
u such that all its marks are allowed, and such that its primary character is listed in
Appendix C →.
There are a few special cases to consider where the translation does not rely only on the primary character, but also on the marks:
- The Unicode characters Й U+0418 U+0306 and й U+0438 U+0306 are translated to j U+006A.
- The Unicode characters Ї U+0406 U+0308 and ї U+0456 U+0308 are translated to yi U+0079 U+0069.
- The Unicode characters Ё U+0415 U+0308 and ё U+0435 U+0308 are translated to yo U+0079 U+006F.
All other Unicode characters with their primary character listed in
Appendix C → are translated by dropping all marks and performing a translation of the primary character as described in
Appendix C →.
For example, Андре́й Никола́евич Колмого́ров is now a valid identifier, and translates to the simple identifier andrej-nikolaevich-kolmogorov.
Conclusion
Extending simple identifiers via steps 0, 1, 2 and 3 we obtain cosmopolitian identifiers. Cosmopolitan identifiers retain the simplicity and clarity of Fortran identifiers, but allow users to use their native scripts when the situation calls for it. This is achieved by mapping each cosmopolitan identifier to a uniquely determined simple identifier, and deciding equality of identifiers based on just this mapping.
While it is possible to use equivalent but not identical identifiers in the same context, this is not recommended. For example, dx and Δx are equivalent, but not identical. Obviously, one should not write expressions like dx * Δx, but write either Δx * Δx or dx * dx. Ideally, these are not just conventions, but part of the definition of programming languages based on cosmopolitan identifiers, which would issue warnings or even errors in such situations. On the other hand, when accessing identifiers from another context it is OK to change them. For example, when accessing a library which exposes an API based on Greek identifiers, it is OK to use equivalent Latin identifiers at the calling site.
Currently the supported scripts are Latin, Greek and Cyrillic. It would be great if it would be possible to extend cosmopolitan identifiers to other widely used scripts without compromising their conceptual and technical simplicity.
References
[1]Stephen J. Chapman. (2017). Fortran for Scientists and Engineers, 4th Edition. [5](1997). ISO 843:1997: Information and documentation — Conversion of Greek characters into Latin characters, International Organization for Standardization, https://www.iso.org/standard/5215.html. [6](2002). GOST 7.79-2000: System of standards on information, librarianship and publishing. Rules of transliteration of Cyrillic script by Latin alphabet, https://runorm.com/catalog/1004/741213/. Appendix M: Allowed Marks
- ̀ U+0300
- ́ U+0301
- ̂ U+0302
- ̃ U+0303
- ̄ U+0304
- ̆ U+0306
- ̇ U+0307
- ̈ U+0308
- ̉ U+0309
- ̊ U+030A
- ̋ U+030B
- ̌ U+030C
- ̏ U+030F
- ̑ U+0311
- ̓ U+0313
- ̔ U+0314
- ̛ U+031B
- ̣ U+0323
- ̤ U+0324
- ̥ U+0325
- ̦ U+0326
- ̧ U+0327
- ̨ U+0328
- ̭ U+032D
- ̮ U+032E
- ̰ U+0330
- ̱ U+0331
- ͂ U+0342
- ͅ U+0345
Appendix L: Latin Primary Characters
- A U+0041 translates to a U+0061
- B U+0042 translates to b U+0062
- C U+0043 translates to c U+0063
- D U+0044 translates to d U+0064
- E U+0045 translates to e U+0065
- F U+0046 translates to f U+0066
- G U+0047 translates to g U+0067
- H U+0048 translates to h U+0068
- I U+0049 translates to i U+0069
- J U+004A translates to j U+006A
- K U+004B translates to k U+006B
- L U+004C translates to l U+006C
- M U+004D translates to m U+006D
- N U+004E translates to n U+006E
- O U+004F translates to o U+006F
- P U+0050 translates to p U+0070
- Q U+0051 translates to q U+0071
- R U+0052 translates to r U+0072
- S U+0053 translates to s U+0073
- T U+0054 translates to t U+0074
- U U+0055 translates to u U+0075
- V U+0056 translates to v U+0076
- W U+0057 translates to w U+0077
- X U+0058 translates to x U+0078
- Y U+0059 translates to y U+0079
- Z U+005A translates to z U+007A
- a U+0061 translates to a U+0061
- b U+0062 translates to b U+0062
- c U+0063 translates to c U+0063
- d U+0064 translates to d U+0064
- e U+0065 translates to e U+0065
- f U+0066 translates to f U+0066
- g U+0067 translates to g U+0067
- h U+0068 translates to h U+0068
- i U+0069 translates to i U+0069
- j U+006A translates to j U+006A
- k U+006B translates to k U+006B
- l U+006C translates to l U+006C
- m U+006D translates to m U+006D
- n U+006E translates to n U+006E
- o U+006F translates to o U+006F
- p U+0070 translates to p U+0070
- q U+0071 translates to q U+0071
- r U+0072 translates to r U+0072
- s U+0073 translates to s U+0073
- t U+0074 translates to t U+0074
- u U+0075 translates to u U+0075
- v U+0076 translates to v U+0076
- w U+0077 translates to w U+0077
- x U+0078 translates to x U+0078
- y U+0079 translates to y U+0079
- z U+007A translates to z U+007A
- Æ U+00C6 translates to ae U+0061 U+0065
- Ø U+00D8 translates to o U+006F
- ß U+00DF translates to ss U+0073 U+0073
- æ U+00E6 translates to ae U+0061 U+0065
- ø U+00F8 translates to o U+006F
- Đ U+0110 translates to d U+0064
- đ U+0111 translates to d U+0064
- Ł U+0141 translates to l U+006C
- ł U+0142 translates to l U+006C
- Œ U+0152 translates to oe U+006F U+0065
- œ U+0153 translates to oe U+006F U+0065
- DŽ U+01C4 translates to dz U+0064 U+007A
- Dž U+01C5 translates to dz U+0064 U+007A
- dž U+01C6 translates to dz U+0064 U+007A
- Lj U+01C8 translates to lj U+006C U+006A
- Nj U+01CB translates to nj U+006E U+006A
- DZ U+01F1 translates to dz U+0064 U+007A
- Dz U+01F2 translates to dz U+0064 U+007A
- dz U+01F3 translates to dz U+0064 U+007A
- ẞ U+1E9E translates to ss U+0073 U+0073
- ff U+FB00 translates to ff U+0066 U+0066
- fi U+FB01 translates to fi U+0066 U+0069
- fl U+FB02 translates to fl U+0066 U+006C
- ffi U+FB03 translates to ffi U+0066 U+0066 U+0069
- ffl U+FB04 translates to ffl U+0066 U+0066 U+006C
- st U+FB06 translates to st U+0073 U+0074
Appendix G: Greek Primary Characters
- Α U+0391 translates to a U+0061
- Β U+0392 translates to v U+0076
- Γ U+0393 translates to g U+0067
- Δ U+0394 translates to d U+0064
- Ε U+0395 translates to e U+0065
- Ζ U+0396 translates to z U+007A
- Η U+0397 translates to i U+0069
- Θ U+0398 translates to th U+0074 U+0068
- Ι U+0399 translates to i U+0069
- Κ U+039A translates to k U+006B
- Λ U+039B translates to l U+006C
- Μ U+039C translates to m U+006D
- Ν U+039D translates to n U+006E
- Ξ U+039E translates to x U+0078
- Ο U+039F translates to o U+006F
- Π U+03A0 translates to p U+0070
- Ρ U+03A1 translates to r U+0072
- Σ U+03A3 translates to s U+0073
- Τ U+03A4 translates to t U+0074
- Υ U+03A5 translates to y U+0079
- Φ U+03A6 translates to f U+0066
- Χ U+03A7 translates to ch U+0063 U+0068
- Ψ U+03A8 translates to ps U+0070 U+0073
- Ω U+03A9 translates to o U+006F
- α U+03B1 translates to a U+0061
- β U+03B2 translates to v U+0076
- γ U+03B3 translates to g U+0067
- δ U+03B4 translates to d U+0064
- ε U+03B5 translates to e U+0065
- ζ U+03B6 translates to z U+007A
- η U+03B7 translates to i U+0069
- θ U+03B8 translates to th U+0074 U+0068
- ι U+03B9 translates to i U+0069
- κ U+03BA translates to k U+006B
- λ U+03BB translates to l U+006C
- μ U+03BC translates to m U+006D
- ν U+03BD translates to n U+006E
- ξ U+03BE translates to x U+0078
- ο U+03BF translates to o U+006F
- π U+03C0 translates to p U+0070
- ρ U+03C1 translates to r U+0072
- ς U+03C2 translates to s U+0073
- σ U+03C3 translates to s U+0073
- τ U+03C4 translates to t U+0074
- υ U+03C5 translates to y U+0079
- φ U+03C6 translates to f U+0066
- χ U+03C7 translates to ch U+0063 U+0068
- ψ U+03C8 translates to ps U+0070 U+0073
- ω U+03C9 translates to o U+006F
Appendix C: Cyrillic Primary Characters
- Є U+0404 translates to Ye U+0059 U+0065
- Ѕ U+0405 translates to z U+007A
- І U+0406 translates to i U+0069
- Ј U+0408 translates to j U+006A
- Љ U+0409 translates to l U+006C
- Њ U+040A translates to n U+006E
- Џ U+040F translates to dh U+0064 U+0068
- А U+0410 translates to a U+0061
- Б U+0411 translates to b U+0062
- В U+0412 translates to v U+0076
- Г U+0413 translates to g U+0067
- Д U+0414 translates to d U+0064
- Е U+0415 translates to e U+0065
- Ж U+0416 translates to zh U+007A U+0068
- З U+0417 translates to z U+007A
- И U+0418 translates to i U+0069
- К U+041A translates to k U+006B
- Л U+041B translates to l U+006C
- М U+041C translates to m U+006D
- Н U+041D translates to n U+006E
- О U+041E translates to o U+006F
- П U+041F translates to p U+0070
- Р U+0420 translates to r U+0072
- С U+0421 translates to s U+0073
- Т U+0422 translates to t U+0074
- У U+0423 translates to u U+0075
- Ф U+0424 translates to f U+0066
- Х U+0425 translates to x U+0078
- Ц U+0426 translates to cz U+0063 U+007A
- Ч U+0427 translates to ch U+0063 U+0068
- Ш U+0428 translates to sh U+0073 U+0068
- Щ U+0429 translates to shh U+0073 U+0068 U+0068
- Ъ U+042A translates to an empty sequence of characters
- Ы U+042B translates to y U+0079
- Ь U+042C translates to an empty sequence of characters
- Э U+042D translates to e U+0065
- Ю U+042E translates to yu U+0079 U+0075
- Я U+042F translates to ya U+0079 U+0061
- а U+0430 translates to a U+0061
- б U+0431 translates to b U+0062
- в U+0432 translates to v U+0076
- г U+0433 translates to g U+0067
- д U+0434 translates to d U+0064
- е U+0435 translates to e U+0065
- ж U+0436 translates to zh U+007A U+0068
- з U+0437 translates to z U+007A
- и U+0438 translates to i U+0069
- к U+043A translates to k U+006B
- л U+043B translates to l U+006C
- м U+043C translates to m U+006D
- н U+043D translates to n U+006E
- о U+043E translates to o U+006F
- п U+043F translates to p U+0070
- р U+0440 translates to r U+0072
- с U+0441 translates to s U+0073
- т U+0442 translates to t U+0074
- у U+0443 translates to u U+0075
- ф U+0444 translates to f U+0066
- х U+0445 translates to x U+0078
- ц U+0446 translates to cz U+0063 U+007A
- ч U+0447 translates to ch U+0063 U+0068
- ш U+0448 translates to sh U+0073 U+0068
- щ U+0449 translates to shh U+0073 U+0068 U+0068
- ъ U+044A translates to an empty sequence of characters
- ы U+044B translates to y U+0079
- ь U+044C translates to an empty sequence of characters
- э U+044D translates to e U+0065
- ю U+044E translates to yu U+0079 U+0075
- я U+044F translates to ya U+0079 U+0061
- є U+0454 translates to Ye U+0059 U+0065
- ѕ U+0455 translates to z U+007A
- і U+0456 translates to i U+0069
- ј U+0458 translates to j U+006A
- љ U+0459 translates to l U+006C
- њ U+045A translates to n U+006E
- џ U+045F translates to dh U+0064 U+0068
- Ґ U+0490 translates to g U+0067
- ґ U+0491 translates to g U+0067