© 2021 Steven Obua

Cosmopolitan Identifiers

by Steven Obua
Cite as: https://doi.org/10.47757/obua.cosmo-id.3
July 1st, 2021
Abstract
I propose a simple Unicode-based lexical syntax for programming language identifiers using characters from international scripts (currently Latin, Greek, Cyrillic and Math). Such cosmopolitan identifiers are designed to achieve much of the simplicity of Fortran identifiers while acknowledging a modern international outlook. This seems particularly advantageous in contexts where such identifiers are not (only) used by professional programmers, but are exposed to normal users, for example through scriptable applications.

Introduction

The possibly oldest programming language still in use, Fortran [1], has an especially simple lexical syntax for identifiers: They consist of letters A to Z and a to z, digits 0 to 9, and the underscore _. An identifier must start with a letter. Moreover, identifiers are case-insensitive: The identifier Fortran denotes the same thing as the identifier fortran, for example.
Especially for users of programming languages who are not professional programmers, such as many scientists and engineers, this syntactical simplicity seems to be a big advantage. Given that Fortran code is used, among other things, to control nuclear power plants, it certainly is a good thing to reduce the potential for confusion and misunderstanding as much as possible.
Identifiers are an important concept in computing: They allow to reference and abbreviate things, and to introduce a level of indirection. Computer users who are not programmers are exposed to the concept of identifiers as well, through the filesystem abstraction of their operating system. Attempts to make this abstraction unavailable to the user in Apple’s iOS had to be rolled back in more recent versions via the introduction of a dedicated Files app.
Scriptable computer applications like Blender also make some of their internals available to power users for automation purposes via built-in scripting engines. These are of course just programming languages, and therefore heavily reliant on identifiers.
Even though artificial intelligence is revolutionizing the interaction between humans and computers, it is my conviction that identifiers will nevertheless gain importance as a concept and become even more mainstream than they already are. In many situations, use of a precise identifier just beats a hand-wavy negotiation with an AI, and that is not going to change.
Given the fundamental importance of identifiers, people must be able to express them in the script of their native tongue. This is self-evident in file systems: It would be unthinkable in a modern setting if people could only use Fortran identifiers as file names. It is also much easier to teach the use of identifiers to children if identifiers can be expressed in their native tongue.
Furthermore, it can be very convenient, especially in scientific applications, to be able to use symbols like π directly as identifiers, or to use them as part of identifiers, as in approximation-of-π.
Our design goal for identifiers consists therefore of three subgoals:
  1. They should be almost as simple as Fortran identifiers.
  2. They should support international scripts.
  3. They should support the use of symbols as (part of) identifiers.

Unicode Identifiers

Unicode, billing itself as the “World Standard for Text and Emoji”, recognizes the importance of identifiers and actually provides the concept of Unicode identifiers [2]. Modern programming languages like Swift have incorporated Unicode identifiers into their syntax: ℱ𝒪ℛ𝒯ℛ𝒜𝒩 is a perfectly valid identifier in Swift, and so is Φορτραν.
Nevertheless, this lexical diversity also comes at a prize. Having to deal with Greek letters in an application programming interface would inconvenience me very much, at least when they are used casually instead of just designating entities like π.
There is also a certain unwieldiness that Unicode brings to the table. Properties like boldness are part of the definition of some characters, and so you can have two different identifiers Fortran and 𝐅𝐨𝐫𝐭𝐫𝐚𝐧. This is messy, and very far from the clarity and simplicity that identifiers in Fortran provide. Granted, one will rarely encounter such a misuse of Unicode in practice, but why invite it in the first place?
Intuitively, there is a difference between names and symbols. That is the whole reason that most programming languages make this distinction in the first place. In my opinion, the concept of Unicode identifiers does not respect this distinction enough to be adequate in a programming language context. For example, the characters and 𝔽 are better treated as symbols than as arbitrary letters.
Clearly, while Unicode identifiers support international scripts, and can contain symbols, they are not as simple as Fortran identifiers at all, but a rather messy affair. Therefore Unicode identifiers only fulfill two of our three design goals for identifiers.

Cosmopolitan Identifiers

I propose cosmopolitan identifiers (CIDs) as the golden middle between Fortran identifiers and Unicode identifiers. A CID is a sequence of Unicode characters consisting of letters, digits and separators. The following properties hold for a CID:
A separator is a hyphen - U+002D. I favour the hyphen over the underscore for aesthetic reasons. Later  we will look at how it is possible to use other characters like underscores and spaces as separators as well.
A digit is a character between (and including) 0 U+0030 and 9 U+0039.
To fully define what a CID is, it is necessary to describe the set of letters, which sequences of letters are allowed, and to state when two CIDs are considered to be equivalent. All of this will be done in the following sections.
The set of letters contains initially just the set of lowercase letters from a U+0061 to z U+007A. We will later extend this set with Latin, Greek and Cyrillic characters. We will also extend it with letter-like mathematical symbols.

Equivalence of CIDs

Every CID is a Unicode identifier, but not every Unicode identifier is a CID. Also, equivalence of CIDs is defined differently than both canonical equivalence and compatibility [3] as defined in the Unicode standard. Two cosmopolitan identifiers which are canonically equivalent as Unicode identifiers are always also equivalent in the cosmopolitan sense, but not necessarily the other way around.
To decide equivalence of two CIDs, we map each CID cc to a normal CID N(c)N(c). Then two CIDs aa and bb are considered to be equivalent iff N(a)N(a) and N(b)N(b) are identical.
We construct N(c)N(c) by dividing cc into maximal non-empty subsequences s1,,sks_1, \ldots, s_k consisting either only of letters or only of digits. Separators are ignored themselves but serve to restrict how far subsequences of letters or digits can expand. We map each sis_i separately to its normal form NF(si)NF(s_i), and then form the CID NF(s1)NF(s2)NF(sk)NF(s_1){\includegraphics[height=0.5em]{339AAAC0-2F9D-4412-A1D2-039C5112B429}}NF(s_2){\includegraphics[height=0.5em]{95C16940-2933-41F3-B697-8EB1377B479D}}\ldots{\includegraphics[height=0.5em]{C1596ABA-0F74-44B6-9CB8-0195E4D1D0C6}}NF(s_k) Note that if the normal form for one of the sis_i does not exist, i.e. is undefined, then the normal form for cc does not exist as well.
For a complete description of NN we therefore only need to describe NF(l)NF(l) for a sequence ll of letters, and NF(d)NF(d) for a sequence dd of digits.
Digits are easy to handle, we just set NF(d)=dNF(d) = d How we map general sequences of letters to their normal forms is described in the following sections. For a sequence xx consisting just of lowercase letters between a and z, we proceed just as we do for digits: NF(x)=xNF(x) = x As an example, consider the CIDs xyz12p5 and xyz-12p-5. They both decompose into the subsequences xyz, 12, p and 5, and have therefore the same normal CID xyz-12-p-5.

Words vs. Symbols

On one hand we would like to treat Tree and tree as equivalent identifiers. On the other hand it is often convenient to treat T and t as different symbols. After all, they look entirely different. This creates a tension which we resolve by distinguishing between three different classes of sequences of letters:
Dependent on class there are different normal forms NFword{NF}_{\text{word}} and NFsymbol{NF}_{\text{symbol}}, and we define NF(l)={NFword(l)NFsymbol(l) NF(l) = \begin{cases} {NF}_{\text{word}}(l) & {\includegraphics[height=0.69444em, totalheight=0.69444em]{95B1F536-8ED9-4A54-AA5C-C20A01DC0CB9}}\\ {NF}_{\text{symbol}}(l) & {\includegraphics[height=0.69444em, totalheight=0.69444em]{2F904E5A-B622-4259-94F8-966B94E7F156}}\\ {\includegraphics[height=0.5em]{BF505A82-2B9C-4A8F-B225-1B6597026C6A}} & {\includegraphics[height=0.5em]{61FE1020-BDB9-49D1-9C7A-A2F06130E932}} \end{cases}
A symbol is any letter sequence ss such that NFsymbol(s){NF}_{\text{symbol}}(s) is defined. Currently all symbols consist of a single letter.
A word is any letter sequence ww such that NFword(w){NF}_{\text{word}}(w) is defined, non-empty, and not a symbol.
All letter sequences that are neither words nor symbols are invalid. For example, Ä is invalid, because it is neither a word (because NFword()={NF}_{\text{word}}({\includegraphics[height=0.5em]{65C9CD53-A8EA-465B-BCEC-E75F95FF9577}}) = {\includegraphics[height=0.5em]{5AA2DFB0-C60E-4D1F-B6C2-369B01389073}}, and A is a symbol) nor a symbol.

Extension Steps

We define the full set of cosmopolitan identifiers in several stages. Stage ii consists of a set of letters LiL_i, and normal forms NFword{NF}^{{\includegraphics[height=0.5em]{50839CF0-A1CD-4601-9960-999FFE969C03}}}_{\text{word}} and NFsymbol{NF}^{{\includegraphics[height=0.5em]{8DCCC8CC-0DB4-4C68-A740-87E6E1858CB1}}}_{\text{symbol}} operating on sequences in LiL_i^*.
Our initial set L0L_0 of letters consists of the lowercase letters from a U+0061 to z U+007A. The two normal forms NFword{NF}^{{\includegraphics[height=0.5em]{63C47685-F3FE-4245-A82B-6BEB8B6CD7DA}}}_{\text{word}} and NFsymbol{NF}^{{\includegraphics[height=0.5em]{57F43563-C924-445D-8995-8C8CF3522076}}}_{\text{symbol}} are defined for xL0x \in L_0^* by NFword(x)=xNFsymbol(x)={x \begin{array}{rcl} {NF}^{{\includegraphics[height=0.5em]{B9659FA4-47A4-4C4B-8746-386619EA634B}}}_{\text{word}}(x) & = & x \\ {NF}^{{\includegraphics[height=0.5em]{03A0F85F-C768-40B7-ADFC-CD5450F1BEB9}}}_{\text{symbol}}(x) & = & \begin{cases} x & {\includegraphics[height=0.43056em, totalheight=0.43056em]{329B04EE-20B7-4522-92C8-8D1ED8922AD9}} \\ {\includegraphics[height=0.5em]{EF4E26E8-6A2C-46BA-A9EA-866AD24CA9F5}} & {\includegraphics[height=0.5em]{F3DDBF24-971F-4BCE-8F94-D623ED8F070D}} \end{cases} \end{array} In short, the symbols of stage 0 are those sequences in L0L_0^* consisting of a single letter, and the words are all sequences in L0L_0^* that have at least two elements.
Starting from this initial stage 0, we proceed as follows:
  1. Stage 1 adds Latin letters .
  2. Stage 2 adds Greek letters .
  3. Stage 3 adds Cyrillic letters .
  4. Stage 4 adds letter-like mathematical symbols .
Each extension step from stage ii to stage i+1i+1 follows these rules:
The following sections describe these four extension steps. They take advantage of Unicode features that we will be looking at first.

Primary Characters and Allowed Marks

Unicode characters come in the form of grapheme clusters [4], which are certain sequences of Unicode codepoints. For example, the Unicode character ä is represented by the grapheme cluster U+0061 U+0308, consisting of the codepoint a U+0061 and the codepoint  ̈ U+0308 which is called a combining diacritical mark. The character ä can also be represented by the grapheme cluster consisting of only the single codepoint U+00E4, so the same character can have multiple different representations as grapheme clusters.
Each Unicode character uu has a canonically decomposed grapheme cluster normalform p d1dnp\ d_1 \ldots d_n called NFD. We call pp the primary character of uu, and each did_i a mark of uu. In the following extension steps, we are only interested in those letters uu where all marks are combining diacritical marks.
Furthermore, I have observed that only a subset of all combining diacritical marks actually occurs in letters of modern alphabets, and therefore only the marks listed in Appendix M  are allowed.

Stage 1: Latin Script

The set L1L_1 consists of all unicode characters uu such that all marks of uu are allowed, and such that the primary character pp of uu is listed in Appendix L . The normal form NFword(u){NF}^{{\includegraphics[height=0.5em]{8E017689-A03D-4986-AE0A-D9FAC8D294A0}}}_{\text{word}}(u) is the translation of pp as shown in Appendix L . Note that all marks of uu are dropped for translation.
For letter sequences w=u1unL1,w = u_1 \ldots u_n \in L_1^*, NFword(w){NF}^{{\includegraphics[height=0.5em]{99005E86-698C-48C2-B648-588C99833C47}}}_{\text{word}}(w) is defined transliterally, i.e. NFword(w)=NFword(u1)NFword(un).{NF}^{{\includegraphics[height=0.5em]{C22C0270-97E6-4470-8423-F3401390FDF0}}}_{\text{word}}(w) = {NF}^{{\includegraphics[height=0.5em]{2782C04C-CFBE-46DF-BBCF-8648C5488673}}}_{\text{word}}(u_1) \ldots {NF}^{{\includegraphics[height=0.5em]{C5EBAD34-4140-4DB8-8E4E-57138692DEE5}}}_{\text{word}}(u_n).
The symbols ss of stage 1 are listed in Appendix LS . They are all new symbols, which is another way of saying that NFsymbol(s)=s{NF}^{{\includegraphics[height=0.5em]{CB04953D-7798-4595-93CA-6656AFA8CE74}}}_{\text{symbol}}(s) = s holds.
For example, Lothar-Matthäus-10 is a valid cosmopolitan identifier with normal form lothar-matthaus-10. The normal form Lothar-M is also a CID with normal form lothar-M.

Stage 2: Greek Script

Support for Greek characters is based on ISO 843:1997 [5] transliteration.
This step adds all unicode characters uu such that all its marks are allowed, and such that its primary character is listed in Appendix G .
Certain 2-letter combinations must be considered specially when translating a word ww to NFword(w){NF}^{{\includegraphics[height=0.5em]{2AD50766-9CD5-430E-824D-6D3AEE57D24A}}}_{\text{word}}(w). Possible marks do not matter as long as they are allowed, and are simply dropped during translation:
Otherwise, if none of the above situations apply, we translate single Unicode characters with their primary character listed in Appendix G  as described there, and by dropping all marks.
Words are translated transliterally based on their division into 2-letter combinations and single characters.
The Greek symbols are listed in Appendix GS . They are all different from each other, but some of them are equivalent to Latin symbols. For example Β U+0392 is equivalent to B U+0042, i.e. NFsymbol()={NF}^{{\includegraphics[height=0.5em]{E79DED07-1072-4D48-AE0B-ED01BA645764}}}_{\text{symbol}}({\includegraphics[height=0.5em]{61227A0B-2430-4F15-AE70-F0FB56354011}}) = {\includegraphics[height=0.5em]{19BD04EE-8392-4C9D-82F0-F117F54E547A}}. On the other hand Γ U+0393 is a new symbol, and thus NFsymbol()={NF}^{{\includegraphics[height=0.5em]{107FDEEC-5AA5-40B3-B500-FFC5BBE4BE09}}}_{\text{symbol}}({\includegraphics[height=0.5em]{DBDDBF88-346D-4EC4-BB07-7FA8CC4AE39E}}) = {\includegraphics[height=0.5em]{E6B41612-07A2-4B95-8A53-3400E7E6C894}}.
For example, Μπέχρος is a CID, and translates to the simple identifier mpechros.

Stage 3: Cyrillic Script

Support for Cyrillic characters is added by applying GOST 7.79-2000 System B [6].
This step adds all unicode characters uu such that all its marks are allowed, and such that its primary character is listed in Appendix C .
There are a few special cases to consider where the translation from uu to NFword(u){NF}^{{\includegraphics[height=0.5em]{FD8B4301-7AEB-4C88-996D-B7E05B686D53}}}_{\text{word}}(u) does not rely only on the primary character of uu, but also on its marks:
All other Unicode characters with their primary character listed in Appendix C  are translated by dropping all marks and performing a translation of the primary character as described in Appendix C .
Words ww are then translated transliterally to NFword(w){NF}^{{\includegraphics[height=0.5em]{1EF81285-F0DE-4C6A-B083-2F54A388FF57}}}_{\text{word}}(w).
The Cyrillic symbols are listed in Appendix CS . Some of them are equivalent to each other because upper and lower case letters are just scaled versions of each other. Some of them are furthermore equivalent to Latin and/or Greek symbols.
For example, Андре́й-Никола́евич-Колмого́ров is a CID, and translates to the normalform andrej-nikolaevich-kolmogorov.

Stage 4: Mathematical Symbols

This step is adding letter-like mathematical symbols as listed in Appendix MS . We have NFword=NFword{NF}^{{\includegraphics[height=0.5em]{281161BD-9DEE-4FF8-953D-6DF26491A7A7}}}_{\text{word}} = {NF}^{{\includegraphics[height=0.5em]{A0D89E39-A5EE-4975-9A15-1AD5BD4C33B5}}}_{\text{word}}, and NFsymbol{NF}^{{\includegraphics[height=0.5em]{39E367F4-3ECA-46E1-9A88-F1A762B5C32D}}}_{\text{symbol}} extends NFsymbol{NF}^{{\includegraphics[height=0.5em]{0A15D949-DC9C-40BB-8364-47F89D526CDE}}}_{\text{symbol}} by acting as the identity on all symbols in Appendix MS .
For example, , ℕ1 and ℕ-ℕ are CIDs, but ℕℕ is not.

Separators and External CIDs

Although we have defined the hyphen - U+002D as the only possible separator, in practice we may want to use a different set of separators, depending on the situation. We might for example want to use the underscore _ U+005F instead, or also allow spaces U+0020 as separators.
We can accomodate for this by using external CIDs. What exactly an external CID is depends on your situation. All you need to do is to provide a description of the syntax of an external CID, and how to translate it to an ordinary CID.
For example, we might define an external CID to be a sequence of letters, digits, hyphens, underscores and spaces such that after a cleanup it becomes a CID. Here we define a cleanup as making the following modifications in order:
  1. Trimming spaces from the left start and right end.
  2. Replacing consecutive spaces with a single space.
  3. Replacing spaces and underscores with hyphens.
In this example, x z_9-t would be a valid external CID, corresponding to the CID x-z-9-t. On the other hand, x - z would not be a valid external CID, as after cleanup it becomes x---z, which is not a CID.

Reference Implementation

A reference implementation of cosmopolitan identifiers as described in this paper is available at https://github.com/phlegmaticprogrammer/CosmopolitanIdentifiers.

Conclusion

Cosmopolitan identifiers retain much of the simplicity and clarity of Fortran identifiers, but allow users to use their native scripts and letter-like math symbols. This is achieved by mapping each cosmopolitan identifier to a normal form which is basically a Fortran identifier, apart from the fact that it can also include symbols.
We have proceeded in four stages to define cosmopolitan identifiers. Ideally, in your application you would use CIDs as defined here. But if that is not possible for some reason, then CIDs might be adaptable to your needs by adding more stages, for example to allow more symbols to function as identifiers.
While it is possible to use equivalent but not identical identifiers in the same context, this is not recommended. For example, dx and Δx are equivalent, but not identical. Obviously, one should not write expressions like dx * Δx, but write either Δx * Δx or dx * dx. Ideally, these are not just conventions, but part of the definition of programming languages based on cosmopolitan identifiers, which would issue warnings or even errors in such situations. On the other hand, when accessing identifiers from another context it is OK to change them. For example, when accessing a library which exposes an API based on Greek identifiers, it is OK to use equivalent Latin identifiers at the calling site.
Currently the supported scripts are Latin, Greek, Cyrillic and Math. It would be great if it would be possible to extend cosmopolitan identifiers to other widely used scripts without compromising their conceptual and technical simplicity.

References

[1]Stephen J. Chapman. (2017). Fortran for Scientists and Engineers, 4th Edition.
[2]Mark Davis (ed.). (2020). Unicode Identifier and Pattern Syntax, Unicode Standard Annex #31, https://unicode.org/reports/tr31/.
[3]Ken Whistler (ed.). (2020). Unicode Normalization Forms, Unicode Standard Annex #15, https://unicode.org/reports/tr15/.
[4]Mark Davis, Christopher Chapman (eds.). (2020). Unicode Text Segmentation, Unicode Standard Annex #29, https://unicode.org/reports/tr29/.
[5](1997). ISO 843:1997: Information and documentation — Conversion of Greek characters into Latin characters, International Organization for Standardization, https://www.iso.org/standard/5215.html.
[6](2002). GOST 7.79-2000: System of standards on information, librarianship and publishing. Rules of transliteration of Cyrillic script by Latin alphabet, https://runorm.com/catalog/1004/741213/.

Appendix M: Allowed Marks

Appendix L: Latin Primary Characters

Appendix G: Greek Primary Characters

Appendix C: Cyrillic Primary Characters

Appendix LS: Latin Symbols

Appendix GS: Greek Symbols

Appendix CS: Cyrillic Symbols

Appendix MS: Math Symbols