© 2021 Steven Obua

Cosmopolitan Identifiers

by Steven Obua
Cite as: https://doi.org/10.47757/obua.cosmo-id.2
May 13th, 2021
Abstract
I propose a simple Unicode-based lexical syntax for programming language identifiers using characters from international scripts (currently Latin, Greek and Cyrillic). What makes such cosmopolitan identifiers special is that each identifier is either equivalent to a uniquely determined simple identifier consisting only of ASCII characters, or that the identifier is a symbolic identifier. This makes collaboration in an international setting easier, especially in contexts where such identifiers are not only used by professional programmers, but are also present in the domain of normal users, for example through scriptable applications.

Introduction

The possibly oldest programming language still in use, Fortran [1], has an especially simple lexical syntax for identifiers: They consist of letters A to Z and a to z, digits 0 to 9, and the underscore _. An identifier must start with a letter. Moreover, identifiers are case-insensitive: The identifier Fortran denotes the same thing as the identifier fortran, for example.
Especially to users of programming languages who are not professional programmers, such as many scientists and engineers, this syntactical simplicity is a big advantage. Given that Fortran code is used, among other things, to control nuclear power plants, it certainly is a good thing to reduce the potential for confusion and misunderstanding as much as possible.
Identifiers are an important concept in computing: They allow to reference and abbreviate things, and to introduce a level of redirection. Computer users who are not programmers are exposed to the concept of identifiers as well, through the filesystem abstraction of their operating system. Attempts to make this abstraction unavailable to the user in Apple’s iOS had to be rolled back in more recent versions via the introduction of a dedicated Files app.
Scriptable computer applications like Blender also make some of their internals available to power users for automation purposes via built-in scripting engines. These are of course just programming languages, and therefore heavily reliant on identifiers.
Even though artificial intelligence is revolutionizing the interaction between humans and computers, it is my conviction that identifiers will nevertheless gain importance as a concept and become even more mainstream than they already are. In many situations, use of a precise identifier just beats a hand-wavy negotiation with an AI, and that is not going to change.
Given the fundamental importance of identifiers, people must be able to express them in their native script. This is self-evident in file systems: It would be unthinkable in a modern setting if people could only use Fortran identifiers as file names. It is also much easier to teach the use of identifiers to children if identifiers can be expressed in their native tongue.

Unicode Identifiers

Unicode, billing itself as the “World Standard for Text and Emoji”, recognizes the importance of identifiers and actually incorporates the explicitly defined concept of a Unicode identifier [2]. Modern programming languages like Swift have incorporated Unicode identifiers into their syntax: ℱ𝒪ℛ𝒯ℛ𝒜𝒩 is a perfectly valid identifier in Swift, and so is Φορτραν.
Nevertheless, this lexical diversity also comes at a prize. Anecdotally, while typing the previous paragraph I experienced serious problems with the text editing software I am using (Quiver, which is otherwise excellent). Typing a character would insert it in an unexpected position, and copying and pasting the text in the paragraph was impossible.
Aside from these technological teething problems, more importantly, having to deal with Greek letters in an application programming interface would inconvenience me very much.
There is also a certain unwieldiness that Unicode brings to the table. Properties like boldness are part of the definition of some characters, and so you can have two different identifiers Fortran and 𝐅𝐨𝐫𝐭𝐫𝐚𝐧. This is messy, and very far from the clarity and simplicity that identifiers in Fortran provide.
Intuitively, there is a difference between names and symbols. That is the whole reason that most programming languages make this distinction in the first place. In my opinion, the concept of Unicode identifier does not respect this distinction enough to be adequate in a programming language context. For example, the characters and 𝔽 are better treated as symbols, than as normal parts of an identifier. If the programming language wishes, it can carefully introduce additional symbolic identifiers that capture the use of such symbols as identifiers.

Simple, Symbolic and Cosmopolitan Identifiers

I propose cosmopolitan identifiers as the golden middle between Fortran identifiers and Unicode identifiers. These are defined by extending simple identifiers and symbolic identifiers with characters from various scripts in a principled way. Currently, Latin, Greek and Cyrillic scripts are supported.
Every cosmopolitan identifier is a Unicode identifier, but not every Unicode identifier is a cosmopolitan identifier. Also, equivalence of cosmopolitan identifiers is defined differently than both canonical equivalence and compatibility [3] as defined in the Unicode standard. Two cosmopolitan identifiers which are canonically equivalent are always also equivalent in the cosmopolitan sense, but not necessarily the other way around.

Simple Identifiers

Let me first define simple identifiers. These are sequences consisting of
such that the sequence starts with a letter, contains at least two characters, does not end with a hyphen, and such that there are no two consecutive hyphens appearing in the sequence. Two simple identifiers are equivalent iff they are identical.
For example, lothar-matthaus-10 is a simple identifier. These are not simple identifiers: Lothar-Matthaus-10, lothar-matthäus, 10-lothar-matthaus, lothar-matthaus-, lothar--matthaus.
Simple identifiers are basically aesthetically pleasing Fortran identifiers. As such, they inherit their clarity and simplicity. One might debate choosing the hyphen over the underscore. This is obviously a somewhat arbitrary choice, but unproblematic in my opinion. If you mean dx - 1, you really should not write dx-1 instead.
Note that simple identifiers must contain at least two characters! This means that dx is a simple identifier, but x is not. This is done to be able to distinguish names from symbols.

Symbolic Identifiers

Single character identifiers like i or I are used in practice as symbols, not as names. Therefore it is more intuitive to treat i and I as different identifiers.
What matters for symbolic identifiers is their appearance, not their intrinsic meaning. This is reflected in which symbolic identifiers are considered equivalent.
Currently all symbolic identifiers consist of a single Latin, Greek and/or Cyrillic primary character.

Extension Steps

Let us now extend the notion of simple identifiers in steps until we reach cosmopolitan identifiers.
Each step is required to follow these rules:
Two valid identifiers are considered to be equivalent if they translate to identical simple identifiers.

Step 0: Space

This is strictly speaking not an extension step as described, because we are not adding a letter, but the space character U+0020, which translates to the hyphen - U+002D. An identifier is valid if it translates to a simple identifier after first removing all leading and trailing spaces, and then replacing all consecutive occurrences of spaces with a single hyphen.
For example, lothar matthaus 10 is now a valid identifier and is equivalent to the simple identifier lothar-matthaus-10.
Programming languages will usually forbid the direct use of spaces in identifiers, but such identifiers can still be fruitfully applied when shared outside the context of the programming language, for example when the identifier simultaneously denotes a file name, or for pretty printing the identifier.

Primary Characters and Allowed Marks

Unicode characters come in the form of grapheme clusters [4], which are certain sequences of Unicode codepoints. For example, the Unicode character ä is represented by the grapheme cluster U+0061 U+0308, consisting of the codepoint a U+0061 and the codepoint  ̈ U+0308 which is called a combining diacritical mark. The character ä can also be represented by the grapheme cluster consisting of only the single codepoint U+00E4, so the same character can have multiple different representations as grapheme clusters.
Each Unicode character uu has a canonically decomposed grapheme cluster normalform p d1dnp\ d_1 \ldots d_n called NFD. We call pp the primary character of uu, and each did_i a mark of uu. In the following extension steps, we are only interested in those letters uu where all marks are combining diacritical marks.
Furthermore, I have observed that only a subset of all combining diacritical marks actually occurs in letters of modern alphabets, and therefore only the marks listed in Appendix M  are allowed.

Step 1: Latin Script

This step adds all unicode characters uu such that all its marks are allowed, and such that its primary character is listed in Appendix L .
Identifiers consisting of at least two characters are translated to simple identifiers by dropping all marks from all characters and translating each primary character as described in Appendix L .
For example, Lothar Matthäus 10 is now a valid identifier, and is equivalent to the simple identifier lothar-matthaus-10.
Letters A to Z and a to z are considered single-letter symbolic identifiers all different from each other, as listed in Appendix LS .

Step 2: Greek Script

Support for Greek characters is based on ISO 843:1997 [5] transliteration.
This step adds all unicode characters uu such that all its marks are allowed, and such that its primary character is listed in Appendix G .
Certain 2-letter combinations must be considered specially for translation. Possible marks do not matter as long as they are allowed, and are simply dropped during translation:
Otherwise, if none of the above situations apply, we translate Unicode characters with their primary character listed in Appendix G  as described there, and by dropping all marks.
Valid identifiers must consist of at least 2 characters to be translated to simple identifiers, otherwise they are not valid, unless they are symbolic identifiers.
The Greek symbolic identifiers are listed in Appendix GS . They are all different from each other, but some of them are equivalent to Latin symbolic identifiers, for example Β U+0392 is equivalent to B U+0042.
For example, Μπέχρος is now a valid identifier, and translates to the simple identifier mpechros.

Step 3: Cyrillic Script

Support for Cyrillic characters is added by applying GOST 7.79-2000 System B [6].
This step adds all unicode characters uu such that all its marks are allowed, and such that its primary character is listed in Appendix C .
There are a few special cases to consider where the translation does not rely only on the primary character, but also on the marks:
All other Unicode characters with their primary character listed in Appendix C  are translated by dropping all marks and performing a translation of the primary character as described in Appendix C .
Valid identifiers must consist of at least 2 characters to be translated to simple identifiers, otherwise they are not valid, unless they are symbolic identifiers.
The Cyrillic symbolic identifiers are listed in Appendix CS . Some of them are equivalent to each other because upper and lower case letters are just scaled versions of each other. Some of them are furthermore equivalent to Latin and/or Greek symbolic identifiers.
For example, Андре́й Никола́евич Колмого́ров is now a valid identifier, and translates to the simple identifier andrej-nikolaevich-kolmogorov.

Conclusion

Extending simple identifiers and symbolic identifiers via steps 0, 1, 2 and 3 we obtain cosmopolitian identifiers. Cosmopolitan identifiers retain much of the simplicity and clarity of Fortran identifiers, but allow users to use their native scripts when the situation calls for it. This is achieved by mapping each cosmopolitan identifier either to a uniquely determined simple identifier, or to a symbolic identifier, and deciding equality of identifiers based on just this mapping.
While it is possible to use equivalent but not identical identifiers in the same context, this is not recommended. For example, dx and Δx are equivalent, but not identical. Obviously, one should not write expressions like dx * Δx, but write either Δx * Δx or dx * dx. Ideally, these are not just conventions, but part of the definition of programming languages based on cosmopolitan identifiers, which would issue warnings or even errors in such situations. On the other hand, when accessing identifiers from another context it is OK to change them. For example, when accessing a library which exposes an API based on Greek identifiers, it is OK to use equivalent Latin identifiers at the calling site.
Currently the supported scripts are Latin, Greek and Cyrillic. It would be great if it would be possible to extend cosmopolitan identifiers to other widely used scripts without compromising their conceptual and technical simplicity.

References

[1]Stephen J. Chapman. (2017). Fortran for Scientists and Engineers, 4th Edition.
[2]Mark Davis (ed.). (2020). Unicode Identifier and Pattern Syntax, Unicode Standard Annex #31, https://unicode.org/reports/tr31/.
[3]Ken Whistler (ed.). (2020). Unicode Normalization Forms, Unicode Standard Annex #15, https://unicode.org/reports/tr15/.
[4]Mark Davis, Christopher Chapman (eds.). (2020). Unicode Text Segmentation, Unicode Standard Annex #29, https://unicode.org/reports/tr29/.
[5](1997). ISO 843:1997: Information and documentation — Conversion of Greek characters into Latin characters, International Organization for Standardization, https://www.iso.org/standard/5215.html.
[6](2002). GOST 7.79-2000: System of standards on information, librarianship and publishing. Rules of transliteration of Cyrillic script by Latin alphabet, https://runorm.com/catalog/1004/741213/.

Appendix M: Allowed Marks

Appendix L: Latin Primary Characters

Appendix G: Greek Primary Characters

Appendix C: Cyrillic Primary Characters

Appendix LS: Latin Symbolic Identifiers

Appendix GS: Greek Symbolic Identifiers

Appendix CS: Cyrillic Symbolic Identifiers