Cosmopolitan Identiﬁers

Cite as: https://doi.org/10.47757/obua.cosmo-id.3

July 1st, 2021

Abstract

I propose a simple Unicode-based lexical syntax for programming language identiﬁers using characters from international scripts (currently Latin, Greek, Cyrillic and Math). Such cosmopolitan identiﬁers are designed to achieve much of the simplicity of Fortran identiﬁers while acknowledging a modern international outlook. This seems particularly advantageous in contexts where such identiﬁers are not (only) used by professional programmers, but are exposed to normal users, for example through scriptable applications.

Introduction

The possibly oldest programming language still in use, Fortran [1], has an especially simple lexical syntax for identiﬁers: They consist of letters A to Z and a to z, digits 0 to 9, and the underscore _. An identiﬁer must start with a letter. Moreover, identiﬁers are case-insensitive: The identiﬁer Fortran denotes the same thing as the identiﬁer fortran, for example.

Especially for users of programming languages who are not professional programmers, such as many scientists and engineers, this syntactical simplicity seems to be a big advantage. Given that Fortran code is used, among other things, to control nuclear power plants, it certainly is a good thing to reduce the potential for confusion and misunderstanding as much as possible.

Identiﬁers are an important concept in computing: They allow to reference and abbreviate things, and to introduce a level of indirection. Computer users who are not programmers are exposed to the concept of identiﬁers as well, through the ﬁlesystem abstraction of their operating system. Attempts to make this abstraction unavailable to the user in Apple’s iOS had to be rolled back in more recent versions via the introduction of a dedicated Files app.

Scriptable computer applications like Blender also make some of their internals available to power users for automation purposes via built-in scripting engines. These are of course just programming languages, and therefore heavily reliant on identiﬁers.

Even though artiﬁcial intelligence is revolutionizing the interaction between humans and computers, it is my conviction that identiﬁers will nevertheless gain importance as a concept and become even more mainstream than they already are. In many situations, use of a precise identiﬁer just beats a hand-wavy negotiation with an AI, and that is not going to change.

Given the fundamental importance of identiﬁers, people must be able to express them in the script of their native tongue. This is self-evident in ﬁle systems: It would be unthinkable in a modern setting if people could only use Fortran identiﬁers as ﬁle names. It is also much easier to teach the use of identiﬁers to children if identiﬁers can be expressed in their native tongue.

Furthermore, it can be very convenient, especially in scientiﬁc applications, to be able to use symbols like π directly as identiﬁers, or to use them as part of identiﬁers, as in approximation-of-π.

Our design goal for identiﬁers consists therefore of three subgoals:

They should be almost as simple as Fortran identiﬁers.
They should support international scripts.
They should support the use of symbols as (part of) identiﬁers.

Unicode Identiﬁers

Unicode, billing itself as the “World Standard for Text and Emoji”, recognizes the importance of identiﬁers and actually provides the concept of Unicode identiﬁers [2]. Modern programming languages like Swift have incorporated Unicode identiﬁers into their syntax: ℱ𝒪ℛ𝒯ℛ𝒜𝒩 is a perfectly valid identiﬁer in Swift, and so is Φορτραν.

Nevertheless, this lexical diversity also comes at a prize. Having to deal with Greek letters in an application programming interface would inconvenience me very much, at least when they are used casually instead of just designating entities like π.

There is also a certain unwieldiness that Unicode brings to the table. Properties like boldness are part of the deﬁnition of some characters, and so you can have two different identiﬁers Fortran and 𝐅𝐨𝐫𝐭𝐫𝐚𝐧. This is messy, and very far from the clarity and simplicity that identiﬁers in Fortran provide. Granted, one will rarely encounter such a misuse of Unicode in practice, but why invite it in the ﬁrst place?

Intuitively, there is a difference between names and symbols. That is the whole reason that most programming languages make this distinction in the ﬁrst place. In my opinion, the concept of Unicode identiﬁers does not respect this distinction enough to be adequate in a programming language context. For example, the characters ℱ and 𝔽 are better treated as symbols than as arbitrary letters.

Clearly, while Unicode identiﬁers support international scripts, and can contain symbols, they are not as simple as Fortran identiﬁers at all, but a rather messy affair. Therefore Unicode identiﬁers only fulﬁll two of our three design goals for identiﬁers.

Cosmopolitan Identiﬁers

I propose cosmopolitan identiﬁers (CIDs) as the golden middle between Fortran identiﬁers and Unicode identiﬁers. A CID is a sequence of Unicode characters consisting of letters, digits and separators. The following properties hold for a CID:

A CID must start with a letter or symbol.
A CID must not end with a separator.
A CID must not contain consecutive separators.

A separator is a hyphen - U+002D. I favour the hyphen over the underscore for aesthetic reasons. Later → we will look at how it is possible to use other characters like underscores and spaces as separators as well.

A digit is a character between (and including) 0 U+0030 and 9 U+0039.

To fully deﬁne what a CID is, it is necessary to describe the set of letters, which sequences of letters are allowed, and to state when two CIDs are considered to be equivalent. All of this will be done in the following sections.

The set of letters contains initially just the set of lowercase letters from a U+0061 to z U+007A. We will later extend this set with Latin, Greek and Cyrillic characters. We will also extend it with letter-like mathematical symbols.

Equivalence of CIDs

Every CID is a Unicode identiﬁer, but not every Unicode identiﬁer is a CID. Also, equivalence of CIDs is deﬁned differently than both canonical equivalence and compatibility [3] as deﬁned in the Unicode standard. Two cosmopolitan identiﬁers which are canonically equivalent as Unicode identiﬁers are always also equivalent in the cosmopolitan sense, but not necessarily the other way around.

To decide equivalence of two CIDs, we map each CID

c

to a normal CID

N(c)

. Then two CIDs

a

and

b

are considered to be equivalent iff

N(a)

and

N(b)

are identical.

We construct

N(c)

by dividing

c

into maximal non-empty subsequences

s_1, \ldots, s_k

consisting either only of letters or only of digits. Separators are ignored themselves but serve to restrict how far subsequences of letters or digits can expand. We map each

s_i

separately to its normal form

NF(s_i)

, and then form the CID

NF(s_1){\includegraphics[height=0.5em]{339AAAC0-2F9D-4412-A1D2-039C5112B429}}NF(s_2){\includegraphics[height=0.5em]{95C16940-2933-41F3-B697-8EB1377B479D}}\ldots{\includegraphics[height=0.5em]{C1596ABA-0F74-44B6-9CB8-0195E4D1D0C6}}NF(s_k)

Note that if the normal form for one of the

s_i

does not exist, i.e. is undeﬁned, then the normal form for

c

does not exist as well.

For a complete description of

N

we therefore only need to describe

NF(l)

for a sequence

l

of letters, and

NF(d)

for a sequence

d

of digits.

Digits are easy to handle, we just set

NF(d) = d

How we map general sequences of letters to their normal forms is described in the following sections. For a sequence

x

consisting just of lowercase letters between a and z, we proceed just as we do for digits:

NF(x) = x

As an example, consider the CIDs xyz12p5 and xyz-12p-5. They both decompose into the subsequences xyz, 12, p and 5, and have therefore the same normal CID xyz-12-p-5.

Words vs. Symbols

On one hand we would like to treat Tree and tree as equivalent identiﬁers. On the other hand it is often convenient to treat T and t as different symbols. After all, they look entirely different. This creates a tension which we resolve by distinguishing between three different classes of sequences of letters:

words
symbols
invalid letter sequences

Dependent on class there are different normal forms

{NF}_{\text{word}}

and

{NF}_{\text{symbol}}

, and we deﬁne

NF(l) = \begin{cases} {NF}_{\text{word}}(l) & {\includegraphics[height=0.69444em, totalheight=0.69444em]{95B1F536-8ED9-4A54-AA5C-C20A01DC0CB9}}\\ {NF}_{\text{symbol}}(l) & {\includegraphics[height=0.69444em, totalheight=0.69444em]{2F904E5A-B622-4259-94F8-966B94E7F156}}\\ {\includegraphics[height=0.5em]{BF505A82-2B9C-4A8F-B225-1B6597026C6A}} & {\includegraphics[height=0.5em]{61FE1020-BDB9-49D1-9C7A-A2F06130E932}} \end{cases}

A symbol is any letter sequence

s

such that

{NF}_{\text{symbol}}(s)

is deﬁned. Currently all symbols consist of a single letter.

A word is any letter sequence

w

such that

{NF}_{\text{word}}(w)

is deﬁned, non-empty, and not a symbol.

All letter sequences that are neither words nor symbols are invalid. For example, Ä is invalid, because it is neither a word (because

{NF}_{\text{word}}({\includegraphics[height=0.5em]{65C9CD53-A8EA-465B-BCEC-E75F95FF9577}}) = {\includegraphics[height=0.5em]{5AA2DFB0-C60E-4D1F-B6C2-369B01389073}}

, and A is a symbol) nor a symbol.

Extension Steps

We deﬁne the full set of cosmopolitan identiﬁers in several stages. Stage

i

consists of a set of letters

L_i

, and normal forms

{NF}^{{\includegraphics[height=0.5em]{50839CF0-A1CD-4601-9960-999FFE969C03}}}_{\text{word}}

and

{NF}^{{\includegraphics[height=0.5em]{8DCCC8CC-0DB4-4C68-A740-87E6E1858CB1}}}_{\text{symbol}}

operating on sequences in

L_i^*

Our initial set

L_0

of letters consists of the lowercase letters from a U+0061 to z U+007A. The two normal forms

{NF}^{{\includegraphics[height=0.5em]{63C47685-F3FE-4245-A82B-6BEB8B6CD7DA}}}_{\text{word}}

and

{NF}^{{\includegraphics[height=0.5em]{57F43563-C924-445D-8995-8C8CF3522076}}}_{\text{symbol}}

are deﬁned for

x \in L_0^*

\begin{array}{rcl} {NF}^{{\includegraphics[height=0.5em]{B9659FA4-47A4-4C4B-8746-386619EA634B}}}_{\text{word}}(x) & = & x \\ {NF}^{{\includegraphics[height=0.5em]{03A0F85F-C768-40B7-ADFC-CD5450F1BEB9}}}_{\text{symbol}}(x) & = & \begin{cases} x & {\includegraphics[height=0.43056em, totalheight=0.43056em]{329B04EE-20B7-4522-92C8-8D1ED8922AD9}} \\ {\includegraphics[height=0.5em]{EF4E26E8-6A2C-46BA-A9EA-866AD24CA9F5}} & {\includegraphics[height=0.5em]{F3DDBF24-971F-4BCE-8F94-D623ED8F070D}} \end{cases} \end{array}

In short, the symbols of stage 0 are those sequences in

L_0^*

consisting of a single letter, and the words are all sequences in

L_0^*

that have at least two elements.

Starting from this initial stage 0, we proceed as follows:

Stage 1 adds Latin letters →.
Stage 2 adds Greek letters →.
Stage 3 adds Cyrillic letters →.
Stage 4 adds letter-like mathematical symbols →.

Each extension step from stage

i

to stage

i+1

follows these rules:

A step adds a set of Unicode letters, i.e. $L_i \subset L_{i+1}$ . The additional letters come from a modern alphabet which is actively and widely used today.
Normal forms ${NF}^{{\includegraphics[height=0.5em]{44B12C3B-99FD-45BE-83C4-1880317A0C28}}}_{\text{word}}$ and ${NF}^{{\includegraphics[height=0.5em]{6D20E475-DA27-449D-9F7D-95CD7B4016C7}}}_{\text{symbol}}$ are deﬁned on $L_{i+1}^*$ such that their restrictions to $L_i^*$ equal ${NF}^{{\includegraphics[height=0.5em]{38F3FF35-5894-4BBA-AD80-1A15D49032E1}}}_{\text{word}}$ and ${NF}^{{\includegraphics[height=0.5em]{0E10148E-EBDA-4865-8AE7-F187E7D12AA5}}}_{\text{symbol}}$ , respectively.
The normal forms work more or less transliterally, that is they transform their input letter by letter. Sometimes multiple consecutive letters may be translated at once, and the immediate context of letters may be taken into account as well.
The normal forms are pure functions in that they do not depend on anything else other than their input. In particular, they do not depend on things like the current geographical locale.

The following sections describe these four extension steps. They take advantage of Unicode features that we will be looking at ﬁrst.

Primary Characters and Allowed Marks

Unicode characters come in the form of grapheme clusters [4], which are certain sequences of Unicode codepoints. For example, the Unicode character ä is represented by the grapheme cluster U+0061 U+0308, consisting of the codepoint a U+0061 and the codepoint ̈ U+0308 which is called a combining diacritical mark. The character ä can also be represented by the grapheme cluster consisting of only the single codepoint U+00E4, so the same character can have multiple different representations as grapheme clusters.

Each Unicode character

u

has a canonically decomposed grapheme cluster normalform

p\ d_1 \ldots d_n

called NFD. We call

p

the primary character of

u

, and each

d_i

a mark of

u

. In the following extension steps, we are only interested in those letters

u

where all marks are combining diacritical marks.

Furthermore, I have observed that only a subset of all combining diacritical marks actually occurs in letters of modern alphabets, and therefore only the marks listed in Appendix M → are allowed.

Stage 1: Latin Script

The set

L_1

consists of all unicode characters

u

such that all marks of

u

are allowed, and such that the primary character

p

u

is listed in Appendix L →. The normal form

{NF}^{{\includegraphics[height=0.5em]{8E017689-A03D-4986-AE0A-D9FAC8D294A0}}}_{\text{word}}(u)

is the translation of

p

as shown in Appendix L →. Note that all marks of

u

are dropped for translation.

For letter sequences

w = u_1 \ldots u_n \in L_1^*,

{NF}^{{\includegraphics[height=0.5em]{99005E86-698C-48C2-B648-588C99833C47}}}_{\text{word}}(w)

is deﬁned transliterally, i.e.

{NF}^{{\includegraphics[height=0.5em]{C22C0270-97E6-4470-8423-F3401390FDF0}}}_{\text{word}}(w) = {NF}^{{\includegraphics[height=0.5em]{2782C04C-CFBE-46DF-BBCF-8648C5488673}}}_{\text{word}}(u_1) \ldots {NF}^{{\includegraphics[height=0.5em]{C5EBAD34-4140-4DB8-8E4E-57138692DEE5}}}_{\text{word}}(u_n).

The symbols

s

of stage 1 are listed in Appendix LS →. They are all new symbols, which is another way of saying that

{NF}^{{\includegraphics[height=0.5em]{CB04953D-7798-4595-93CA-6656AFA8CE74}}}_{\text{symbol}}(s) = s

holds.

For example, Lothar-Matthäus-10 is a valid cosmopolitan identiﬁer with normal form lothar-matthaus-10. The normal form Lothar-M is also a CID with normal form lothar-M.

Stage 2: Greek Script

Support for Greek characters is based on ISO 843:1997 [5] transliteration.

This step adds all unicode characters

u

such that all its marks are allowed, and such that its primary character is listed in Appendix G →.

Certain 2-letter combinations must be considered specially when translating a word

w

{NF}^{{\includegraphics[height=0.5em]{2AD50766-9CD5-430E-824D-6D3AEE57D24A}}}_{\text{word}}(w)

. Possible marks do not matter as long as they are allowed, and are simply dropped during translation:

ΑΥ U+0391 U+03A5, Αυ U+0391 U+03C5, αΥ U+03B1 U+03A5 and αυ U+03B1 U+03C5 all translate to au U+0061 U+0075
ΕΥ U+0395 U+03A5, Ευ U+0395 U+03C5, εΥ U+03B5 U+03A5 and ευ U+03B5 U+03C5 all translate to eu U+0065 U+0075
ΟΥ U+039F U+03A5, Ου U+039F U+03C5, οΥ U+03BF U+03A5 and ου U+03BF U+03C5 all translate to ou U+006F U+0075

Otherwise, if none of the above situations apply, we translate single Unicode characters with their primary character listed in Appendix G → as described there, and by dropping all marks.

Words are translated transliterally based on their division into 2-letter combinations and single characters.

The Greek symbols are listed in Appendix GS →. They are all different from each other, but some of them are equivalent to Latin symbols. For example Β U+0392 is equivalent to B U+0042, i.e.

{NF}^{{\includegraphics[height=0.5em]{E79DED07-1072-4D48-AE0B-ED01BA645764}}}_{\text{symbol}}({\includegraphics[height=0.5em]{61227A0B-2430-4F15-AE70-F0FB56354011}}) = {\includegraphics[height=0.5em]{19BD04EE-8392-4C9D-82F0-F117F54E547A}}

. On the other hand Γ U+0393 is a new symbol, and thus

{NF}^{{\includegraphics[height=0.5em]{107FDEEC-5AA5-40B3-B500-FFC5BBE4BE09}}}_{\text{symbol}}({\includegraphics[height=0.5em]{DBDDBF88-346D-4EC4-BB07-7FA8CC4AE39E}}) = {\includegraphics[height=0.5em]{E6B41612-07A2-4B95-8A53-3400E7E6C894}}

For example, Μπέχρος is a CID, and translates to the simple identiﬁer mpechros.

Stage 3: Cyrillic Script

Support for Cyrillic characters is added by applying GOST 7.79-2000 System B [6].

This step adds all unicode characters

u

such that all its marks are allowed, and such that its primary character is listed in Appendix C →.

There are a few special cases to consider where the translation from

u

{NF}^{{\includegraphics[height=0.5em]{FD8B4301-7AEB-4C88-996D-B7E05B686D53}}}_{\text{word}}(u)

does not rely only on the primary character of

u

, but also on its marks:

Unicode characters with primary character U+0418 or U+0438 are translated to j U+006A if U+0306 is among their marks.
Unicode characters with primary characters U+0406 or U+0456 are translated to yi U+0079 U+0069 if U+0308 is among their marks.
Unicode characters with primary characters U+0415 or U+0435 are translated to yo U+0079 U+006F if U+0308 is among their marks.

All other Unicode characters with their primary character listed in Appendix C → are translated by dropping all marks and performing a translation of the primary character as described in Appendix C →.

Words

w

are then translated transliterally to

{NF}^{{\includegraphics[height=0.5em]{1EF81285-F0DE-4C6A-B083-2F54A388FF57}}}_{\text{word}}(w)

The Cyrillic symbols are listed in Appendix CS →. Some of them are equivalent to each other because upper and lower case letters are just scaled versions of each other. Some of them are furthermore equivalent to Latin and/or Greek symbols.

For example, Андре́й-Никола́евич-Колмого́ров is a CID, and translates to the normalform andrej-nikolaevich-kolmogorov.

Stage 4: Mathematical Symbols

This step is adding letter-like mathematical symbols as listed in Appendix MS →. We have

{NF}^{{\includegraphics[height=0.5em]{281161BD-9DEE-4FF8-953D-6DF26491A7A7}}}_{\text{word}} = {NF}^{{\includegraphics[height=0.5em]{A0D89E39-A5EE-4975-9A15-1AD5BD4C33B5}}}_{\text{word}}

, and

{NF}^{{\includegraphics[height=0.5em]{39E367F4-3ECA-46E1-9A88-F1A762B5C32D}}}_{\text{symbol}}

extends

{NF}^{{\includegraphics[height=0.5em]{0A15D949-DC9C-40BB-8364-47F89D526CDE}}}_{\text{symbol}}

by acting as the identity on all symbols in Appendix MS →.

For example, ℕ, ℕ1 and ℕ-ℕ are CIDs, but ℕℕ is not.

Separators and External CIDs

Although we have deﬁned the hyphen - U+002D as the only possible separator, in practice we may want to use a different set of separators, depending on the situation. We might for example want to use the underscore _ U+005F instead, or also allow spaces U+0020 as separators.

We can accomodate for this by using external CIDs. What exactly an external CID is depends on your situation. All you need to do is to provide a description of the syntax of an external CID, and how to translate it to an ordinary CID.

For example, we might deﬁne an external CID to be a sequence of letters, digits, hyphens, underscores and spaces such that after a cleanup it becomes a CID. Here we deﬁne a cleanup as making the following modiﬁcations in order:

Trimming spaces from the left start and right end.
Replacing consecutive spaces with a single space.
Replacing spaces and underscores with hyphens.

In this example, x z_9-t would be a valid external CID, corresponding to the CID x-z-9-t. On the other hand, x - z would not be a valid external CID, as after cleanup it becomes x---z, which is not a CID.

Reference Implementation

A reference implementation of cosmopolitan identiﬁers as described in this paper is available at https://github.com/phlegmaticprogrammer/CosmopolitanIdentiﬁers.

Conclusion

Cosmopolitan identiﬁers retain much of the simplicity and clarity of Fortran identiﬁers, but allow users to use their native scripts and letter-like math symbols. This is achieved by mapping each cosmopolitan identiﬁer to a normal form which is basically a Fortran identiﬁer, apart from the fact that it can also include symbols.

We have proceeded in four stages to deﬁne cosmopolitan identiﬁers. Ideally, in your application you would use CIDs as deﬁned here. But if that is not possible for some reason, then CIDs might be adaptable to your needs by adding more stages, for example to allow more symbols to function as identiﬁers.

While it is possible to use equivalent but not identical identiﬁers in the same context, this is not recommended. For example, dx and Δx are equivalent, but not identical. Obviously, one should not write expressions like dx * Δx, but write either Δx * Δx or dx * dx. Ideally, these are not just conventions, but part of the deﬁnition of programming languages based on cosmopolitan identiﬁers, which would issue warnings or even errors in such situations. On the other hand, when accessing identiﬁers from another context it is OK to change them. For example, when accessing a library which exposes an API based on Greek identiﬁers, it is OK to use equivalent Latin identiﬁers at the calling site.

Currently the supported scripts are Latin, Greek, Cyrillic and Math. It would be great if it would be possible to extend cosmopolitan identiﬁers to other widely used scripts without compromising their conceptual and technical simplicity.

References

[1]Stephen J. Chapman. (2017). Fortran for Scientists and Engineers, 4th Edition.

[2]Mark Davis (ed.). (2020). Unicode Identifier and Pattern Syntax, Unicode Standard Annex #31, https://unicode.org/reports/tr31/.

[3]Ken Whistler (ed.). (2020). Unicode Normalization Forms, Unicode Standard Annex #15, https://unicode.org/reports/tr15/.

[4]Mark Davis, Christopher Chapman (eds.). (2020). Unicode Text Segmentation, Unicode Standard Annex #29, https://unicode.org/reports/tr29/.

[5](1997). ISO 843:1997: Information and documentation — Conversion of Greek characters into Latin characters, International Organization for Standardization, https://www.iso.org/standard/5215.html.

[6](2002). GOST 7.79-2000: System of standards on information, librarianship and publishing. Rules of transliteration of Cyrillic script by Latin alphabet, https://runorm.com/catalog/1004/741213/.

Appendix M: Allowed Marks

̀ U+0300
́ U+0301
̂ U+0302
̃ U+0303
̄ U+0304
̆ U+0306
̇ U+0307
̈ U+0308
̉ U+0309
̊ U+030A
̋ U+030B
̌ U+030C
̏ U+030F
̑ U+0311
̓ U+0313
̔ U+0314
̛ U+031B
̣ U+0323
̤ U+0324
̥ U+0325
̦ U+0326
̧ U+0327
̨ U+0328
̭ U+032D
̮ U+032E
̰ U+0330
̱ U+0331
͂ U+0342
ͅ U+0345

Appendix L: Latin Primary Characters

A U+0041 translates to a U+0061
B U+0042 translates to b U+0062
C U+0043 translates to c U+0063
D U+0044 translates to d U+0064
E U+0045 translates to e U+0065
F U+0046 translates to f U+0066
G U+0047 translates to g U+0067
H U+0048 translates to h U+0068
I U+0049 translates to i U+0069
J U+004A translates to j U+006A
K U+004B translates to k U+006B
L U+004C translates to l U+006C
M U+004D translates to m U+006D
N U+004E translates to n U+006E
O U+004F translates to o U+006F
P U+0050 translates to p U+0070
Q U+0051 translates to q U+0071
R U+0052 translates to r U+0072
S U+0053 translates to s U+0073
T U+0054 translates to t U+0074
U U+0055 translates to u U+0075
V U+0056 translates to v U+0076
W U+0057 translates to w U+0077
X U+0058 translates to x U+0078
Y U+0059 translates to y U+0079
Z U+005A translates to z U+007A
a U+0061 translates to a U+0061
b U+0062 translates to b U+0062
c U+0063 translates to c U+0063
d U+0064 translates to d U+0064
e U+0065 translates to e U+0065
f U+0066 translates to f U+0066
g U+0067 translates to g U+0067
h U+0068 translates to h U+0068
i U+0069 translates to i U+0069
j U+006A translates to j U+006A
k U+006B translates to k U+006B
l U+006C translates to l U+006C
m U+006D translates to m U+006D
n U+006E translates to n U+006E
o U+006F translates to o U+006F
p U+0070 translates to p U+0070
q U+0071 translates to q U+0071
r U+0072 translates to r U+0072
s U+0073 translates to s U+0073
t U+0074 translates to t U+0074
u U+0075 translates to u U+0075
v U+0076 translates to v U+0076
w U+0077 translates to w U+0077
x U+0078 translates to x U+0078
y U+0079 translates to y U+0079
z U+007A translates to z U+007A
Æ U+00C6 translates to ae U+0061 U+0065
Ø U+00D8 translates to o U+006F
ß U+00DF translates to ss U+0073 U+0073
æ U+00E6 translates to ae U+0061 U+0065
ø U+00F8 translates to o U+006F
Đ U+0110 translates to d U+0064
đ U+0111 translates to d U+0064
Ł U+0141 translates to l U+006C
ł U+0142 translates to l U+006C
Œ U+0152 translates to oe U+006F U+0065
œ U+0153 translates to oe U+006F U+0065
Ǆ U+01C4 translates to dz U+0064 U+007A
ǅ U+01C5 translates to dz U+0064 U+007A
ǆ U+01C6 translates to dz U+0064 U+007A
ǈ U+01C8 translates to lj U+006C U+006A
ǋ U+01CB translates to nj U+006E U+006A
Ǳ U+01F1 translates to dz U+0064 U+007A
ǲ U+01F2 translates to dz U+0064 U+007A
ǳ U+01F3 translates to dz U+0064 U+007A
ẞ U+1E9E translates to ss U+0073 U+0073
ﬀ U+FB00 translates to ff U+0066 U+0066
ﬁ U+FB01 translates to fi U+0066 U+0069
ﬂ U+FB02 translates to fl U+0066 U+006C
ﬃ U+FB03 translates to ffi U+0066 U+0066 U+0069
ﬄ U+FB04 translates to ffl U+0066 U+0066 U+006C
ﬆ U+FB06 translates to st U+0073 U+0074

Appendix G: Greek Primary Characters

Α U+0391 translates to a U+0061
Β U+0392 translates to v U+0076
Γ U+0393 translates to g U+0067
Δ U+0394 translates to d U+0064
Ε U+0395 translates to e U+0065
Ζ U+0396 translates to z U+007A
Η U+0397 translates to i U+0069
Θ U+0398 translates to th U+0074 U+0068
Ι U+0399 translates to i U+0069
Κ U+039A translates to k U+006B
Λ U+039B translates to l U+006C
Μ U+039C translates to m U+006D
Ν U+039D translates to n U+006E
Ξ U+039E translates to x U+0078
Ο U+039F translates to o U+006F
Π U+03A0 translates to p U+0070
Ρ U+03A1 translates to r U+0072
Σ U+03A3 translates to s U+0073
Τ U+03A4 translates to t U+0074
Υ U+03A5 translates to y U+0079
Φ U+03A6 translates to f U+0066
Χ U+03A7 translates to ch U+0063 U+0068
Ψ U+03A8 translates to ps U+0070 U+0073
Ω U+03A9 translates to o U+006F
α U+03B1 translates to a U+0061
β U+03B2 translates to v U+0076
γ U+03B3 translates to g U+0067
δ U+03B4 translates to d U+0064
ε U+03B5 translates to e U+0065
ζ U+03B6 translates to z U+007A
η U+03B7 translates to i U+0069
θ U+03B8 translates to th U+0074 U+0068
ι U+03B9 translates to i U+0069
κ U+03BA translates to k U+006B
λ U+03BB translates to l U+006C
μ U+03BC translates to m U+006D
ν U+03BD translates to n U+006E
ξ U+03BE translates to x U+0078
ο U+03BF translates to o U+006F
π U+03C0 translates to p U+0070
ρ U+03C1 translates to r U+0072
ς U+03C2 translates to s U+0073
σ U+03C3 translates to s U+0073
τ U+03C4 translates to t U+0074
υ U+03C5 translates to y U+0079
φ U+03C6 translates to f U+0066
χ U+03C7 translates to ch U+0063 U+0068
ψ U+03C8 translates to ps U+0070 U+0073
ω U+03C9 translates to o U+006F

Appendix C: Cyrillic Primary Characters

Є U+0404 translates to Ye U+0059 U+0065
Ѕ U+0405 translates to z U+007A
І U+0406 translates to i U+0069
Ј U+0408 translates to j U+006A
Љ U+0409 translates to l U+006C
Њ U+040A translates to n U+006E
Џ U+040F translates to dh U+0064 U+0068
А U+0410 translates to a U+0061
Б U+0411 translates to b U+0062
В U+0412 translates to v U+0076
Г U+0413 translates to g U+0067
Д U+0414 translates to d U+0064
Е U+0415 translates to e U+0065
Ж U+0416 translates to zh U+007A U+0068
З U+0417 translates to z U+007A
И U+0418 translates to i U+0069
К U+041A translates to k U+006B
Л U+041B translates to l U+006C
М U+041C translates to m U+006D
Н U+041D translates to n U+006E
О U+041E translates to o U+006F
П U+041F translates to p U+0070
Р U+0420 translates to r U+0072
С U+0421 translates to s U+0073
Т U+0422 translates to t U+0074
У U+0423 translates to u U+0075
Ф U+0424 translates to f U+0066
Х U+0425 translates to x U+0078
Ц U+0426 translates to cz U+0063 U+007A
Ч U+0427 translates to ch U+0063 U+0068
Ш U+0428 translates to sh U+0073 U+0068
Щ U+0429 translates to shh U+0073 U+0068 U+0068
Ъ U+042A translates to an empty sequence of characters
Ы U+042B translates to y U+0079
Ь U+042C translates to an empty sequence of characters
Э U+042D translates to e U+0065
Ю U+042E translates to yu U+0079 U+0075
Я U+042F translates to ya U+0079 U+0061
а U+0430 translates to a U+0061
б U+0431 translates to b U+0062
в U+0432 translates to v U+0076
г U+0433 translates to g U+0067
д U+0434 translates to d U+0064
е U+0435 translates to e U+0065
ж U+0436 translates to zh U+007A U+0068
з U+0437 translates to z U+007A
и U+0438 translates to i U+0069
к U+043A translates to k U+006B
л U+043B translates to l U+006C
м U+043C translates to m U+006D
н U+043D translates to n U+006E
о U+043E translates to o U+006F
п U+043F translates to p U+0070
р U+0440 translates to r U+0072
с U+0441 translates to s U+0073
т U+0442 translates to t U+0074
у U+0443 translates to u U+0075
ф U+0444 translates to f U+0066
х U+0445 translates to x U+0078
ц U+0446 translates to cz U+0063 U+007A
ч U+0447 translates to ch U+0063 U+0068
ш U+0448 translates to sh U+0073 U+0068
щ U+0449 translates to shh U+0073 U+0068 U+0068
ъ U+044A translates to an empty sequence of characters
ы U+044B translates to y U+0079
ь U+044C translates to an empty sequence of characters
э U+044D translates to e U+0065
ю U+044E translates to yu U+0079 U+0075
я U+044F translates to ya U+0079 U+0061
є U+0454 translates to Ye U+0059 U+0065
ѕ U+0455 translates to z U+007A
і U+0456 translates to i U+0069
ј U+0458 translates to j U+006A
љ U+0459 translates to l U+006C
њ U+045A translates to n U+006E
џ U+045F translates to dh U+0064 U+0068
Ґ U+0490 translates to g U+0067
ґ U+0491 translates to g U+0067

Appendix LS: Latin Symbols

A U+0041 is a new symbol
B U+0042 is a new symbol
C U+0043 is a new symbol
D U+0044 is a new symbol
E U+0045 is a new symbol
F U+0046 is a new symbol
G U+0047 is a new symbol
H U+0048 is a new symbol
I U+0049 is a new symbol
J U+004A is a new symbol
K U+004B is a new symbol
L U+004C is a new symbol
M U+004D is a new symbol
N U+004E is a new symbol
O U+004F is a new symbol
P U+0050 is a new symbol
Q U+0051 is a new symbol
R U+0052 is a new symbol
S U+0053 is a new symbol
T U+0054 is a new symbol
U U+0055 is a new symbol
V U+0056 is a new symbol
W U+0057 is a new symbol
X U+0058 is a new symbol
Y U+0059 is a new symbol
Z U+005A is a new symbol
a U+0061 is a new symbol
b U+0062 is a new symbol
c U+0063 is a new symbol
d U+0064 is a new symbol
e U+0065 is a new symbol
f U+0066 is a new symbol
g U+0067 is a new symbol
h U+0068 is a new symbol
i U+0069 is a new symbol
j U+006A is a new symbol
k U+006B is a new symbol
l U+006C is a new symbol
m U+006D is a new symbol
n U+006E is a new symbol
o U+006F is a new symbol
p U+0070 is a new symbol
q U+0071 is a new symbol
r U+0072 is a new symbol
s U+0073 is a new symbol
t U+0074 is a new symbol
u U+0075 is a new symbol
v U+0076 is a new symbol
w U+0077 is a new symbol
x U+0078 is a new symbol
y U+0079 is a new symbol
z U+007A is a new symbol

Appendix GS: Greek Symbols

Α U+0391 is equivalent to A U+0041
Β U+0392 is equivalent to B U+0042
Γ U+0393 is a new symbol
Δ U+0394 is a new symbol
Ε U+0395 is equivalent to E U+0045
Ζ U+0396 is equivalent to Z U+005A
Η U+0397 is equivalent to H U+0048
Θ U+0398 is a new symbol
Ι U+0399 is equivalent to I U+0049
Κ U+039A is equivalent to K U+004B
Λ U+039B is a new symbol
Μ U+039C is equivalent to M U+004D
Ν U+039D is equivalent to N U+004E
Ξ U+039E is a new symbol
Ο U+039F is equivalent to O U+004F
Π U+03A0 is a new symbol
Ρ U+03A1 is equivalent to P U+0050
Σ U+03A3 is a new symbol
Τ U+03A4 is equivalent to T U+0054
Υ U+03A5 is equivalent to Y U+0059
Φ U+03A6 is a new symbol
Χ U+03A7 is equivalent to X U+0058
Ψ U+03A8 is a new symbol
Ω U+03A9 is a new symbol
α U+03B1 is a new symbol
β U+03B2 is a new symbol
γ U+03B3 is a new symbol
δ U+03B4 is a new symbol
ε U+03B5 is a new symbol
ζ U+03B6 is a new symbol
η U+03B7 is a new symbol
θ U+03B8 is a new symbol
ι U+03B9 is a new symbol
κ U+03BA is a new symbol
λ U+03BB is a new symbol
μ U+03BC is a new symbol
ν U+03BD is a new symbol
ξ U+03BE is a new symbol
ο U+03BF is a new symbol
π U+03C0 is a new symbol
ρ U+03C1 is a new symbol
ς U+03C2 is a new symbol
σ U+03C3 is a new symbol
τ U+03C4 is a new symbol
υ U+03C5 is a new symbol
φ U+03C6 is a new symbol
χ U+03C7 is a new symbol
ψ U+03C8 is a new symbol
ω U+03C9 is a new symbol

Appendix CS: Cyrillic Symbols

Є U+0404 is a new symbol
Ѕ U+0405 is equivalent to S U+0053
І U+0406 is equivalent to I U+0049
Ј U+0408 is equivalent to J U+004A
Љ U+0409 is a new symbol
Њ U+040A is a new symbol
Џ U+040F is a new symbol
А U+0410 is equivalent to A U+0041
Б U+0411 is a new symbol
В U+0412 is equivalent to B U+0042
Г U+0413 is equivalent to Γ U+0393
Д U+0414 is a new symbol
Е U+0415 is equivalent to E U+0045
Ж U+0416 is a new symbol
З U+0417 is a new symbol
И U+0418 is a new symbol
К U+041A is equivalent to K U+004B
Л U+041B is a new symbol
М U+041C is equivalent to M U+004D
Н U+041D is equivalent to H U+0048
О U+041E is equivalent to O U+004F
П U+041F is equivalent to Π U+03A0
Р U+0420 is equivalent to P U+0050
С U+0421 is equivalent to C U+0043
Т U+0422 is equivalent to T U+0054
У U+0423 is equivalent to y U+0079
Ф U+0424 is a new symbol
Х U+0425 is equivalent to X U+0058
Ц U+0426 is a new symbol
Ч U+0427 is a new symbol
Ш U+0428 is a new symbol
Э U+042D is a new symbol
Ю U+042E is a new symbol
Я U+042F is a new symbol
а U+0430 is equivalent to a U+0061
б U+0431 is a new symbol
в U+0432 is equivalent to B U+0042
г U+0433 is equivalent to Γ U+0393
д U+0434 is equivalent to Д U+0414
е U+0435 is equivalent to e U+0065
ж U+0436 is equivalent to Ж U+0416
з U+0437 is equivalent to З U+0417
и U+0438 is equivalent to И U+0418
к U+043A is equivalent to K U+004B
л U+043B is equivalent to Л U+041B
м U+043C is equivalent to M U+004D
н U+043D is equivalent to H U+0048
о U+043E is equivalent to o U+006F
п U+043F is equivalent to Π U+03A0
р U+0440 is equivalent to p U+0070
с U+0441 is equivalent to c U+0063
т U+0442 is equivalent to T U+0054
у U+0443 is equivalent to y U+0079
ф U+0444 is equivalent to Ф U+0424
х U+0445 is equivalent to x U+0078
ц U+0446 is equivalent to Ц U+0426
ч U+0447 is equivalent to Ч U+0427
ш U+0448 is equivalent to Ш U+0428
э U+044D is equivalent to Э U+042D
ю U+044E is equivalent to Ю U+042E
я U+044F is equivalent to Я U+042F
є U+0454 is equivalent to Є U+0404
ѕ U+0455 is equivalent to s U+0073
і U+0456 is equivalent to i U+0069
ј U+0458 is equivalent to j U+006A
љ U+0459 is equivalent to Љ U+0409
њ U+045A is equivalent to Њ U+040A
џ U+045F is equivalent to Џ U+040F

Appendix MS: Math Symbols

ℵ U+2135 is a new symbol
ℶ U+2136 is a new symbol
ℷ U+2137 is a new symbol
ℸ U+2138 is a new symbol
𝒜 U+1D49C is a new symbol
ℬ U+212C is a new symbol
𝒞 U+1D49E is a new symbol
𝒟 U+1D49F is a new symbol
ℰ U+2130 is a new symbol
ℱ U+2131 is a new symbol
𝒢 U+1D4A2 is a new symbol
ℋ U+210B is a new symbol
ℐ U+2110 is a new symbol
𝒥 U+1D4A5 is a new symbol
𝒦 U+1D4A6 is a new symbol
ℒ U+2112 is a new symbol
ℳ U+2133 is a new symbol
𝒩 U+1D4A9 is a new symbol
𝒪 U+1D4AA is a new symbol
𝒫 U+1D4AB is a new symbol
𝒬 U+1D4AC is a new symbol
ℛ U+211B is a new symbol
𝒮 U+1D4AE is a new symbol
𝒯 U+1D4AF is a new symbol
𝒰 U+1D4B0 is a new symbol
𝒱 U+1D4B1 is a new symbol
𝒲 U+1D4B2 is a new symbol
𝒳 U+1D4B3 is a new symbol
𝒴 U+1D4B4 is a new symbol
𝒵 U+1D4B5 is a new symbol
𝒶 U+1D4B6 is a new symbol
𝒷 U+1D4B7 is a new symbol
𝒸 U+1D4B8 is a new symbol
𝒹 U+1D4B9 is a new symbol
ℯ U+212F is a new symbol
𝒻 U+1D4BB is a new symbol
ℊ U+210A is a new symbol
𝒽 U+1D4BD is a new symbol
𝒾 U+1D4BE is a new symbol
𝒿 U+1D4BF is a new symbol
𝓀 U+1D4C0 is a new symbol
𝓁 U+1D4C1 is a new symbol
𝓂 U+1D4C2 is a new symbol
𝓃 U+1D4C3 is a new symbol
ℴ U+2134 is a new symbol
𝓅 U+1D4C5 is a new symbol
𝓆 U+1D4C6 is a new symbol
𝓇 U+1D4C7 is a new symbol
𝓈 U+1D4C8 is a new symbol
𝓉 U+1D4C9 is a new symbol
𝓊 U+1D4CA is a new symbol
𝓋 U+1D4CB is a new symbol
𝓌 U+1D4CC is a new symbol
𝓍 U+1D4CD is a new symbol
𝓎 U+1D4CE is a new symbol
𝓏 U+1D4CF is a new symbol
𝔄 U+1D504 is a new symbol
𝔅 U+1D505 is a new symbol
ℭ U+212D is a new symbol
𝔇 U+1D507 is a new symbol
𝔈 U+1D508 is a new symbol
𝔉 U+1D509 is a new symbol
𝔊 U+1D50A is a new symbol
ℌ U+210C is a new symbol
ℑ U+2111 is a new symbol
𝔍 U+1D50D is a new symbol
𝔎 U+1D50E is a new symbol
𝔏 U+1D50F is a new symbol
𝔐 U+1D510 is a new symbol
𝔑 U+1D511 is a new symbol
𝔒 U+1D512 is a new symbol
𝔓 U+1D513 is a new symbol
𝔔 U+1D514 is a new symbol
ℜ U+211C is a new symbol
𝔖 U+1D516 is a new symbol
𝔗 U+1D517 is a new symbol
𝔘 U+1D518 is a new symbol
𝔙 U+1D519 is a new symbol
𝔚 U+1D51A is a new symbol
𝔛 U+1D51B is a new symbol
𝔜 U+1D51C is a new symbol
ℨ U+2128 is a new symbol
𝔞 U+1D51E is a new symbol
𝔟 U+1D51F is a new symbol
𝔠 U+1D520 is a new symbol
𝔡 U+1D521 is a new symbol
𝔢 U+1D522 is a new symbol
𝔣 U+1D523 is a new symbol
𝔤 U+1D524 is a new symbol
𝔥 U+1D525 is a new symbol
𝔦 U+1D526 is a new symbol
𝔧 U+1D527 is a new symbol
𝔨 U+1D528 is a new symbol
𝔩 U+1D529 is a new symbol
𝔪 U+1D52A is a new symbol
𝔫 U+1D52B is a new symbol
𝔬 U+1D52C is a new symbol
𝔭 U+1D52D is a new symbol
𝔮 U+1D52E is a new symbol
𝔯 U+1D52F is a new symbol
𝔰 U+1D530 is a new symbol
𝔱 U+1D531 is a new symbol
𝔲 U+1D532 is a new symbol
𝔳 U+1D533 is a new symbol
𝔴 U+1D534 is a new symbol
𝔵 U+1D535 is a new symbol
𝔶 U+1D536 is a new symbol
𝔷 U+1D537 is a new symbol
𝔸 U+1D538 is a new symbol
𝔹 U+1D539 is a new symbol
ℂ U+2102 is a new symbol
𝔻 U+1D53B is a new symbol
𝔼 U+1D53C is a new symbol
𝔽 U+1D53D is a new symbol
𝔾 U+1D53E is a new symbol
ℍ U+210D is a new symbol
𝕀 U+1D540 is a new symbol
𝕁 U+1D541 is a new symbol
𝕂 U+1D542 is a new symbol
𝕃 U+1D543 is a new symbol
𝕄 U+1D544 is a new symbol
ℕ U+2115 is a new symbol
𝕆 U+1D546 is a new symbol
ℙ U+2119 is a new symbol
ℚ U+211A is a new symbol
ℝ U+211D is a new symbol
𝕊 U+1D54A is a new symbol
𝕋 U+1D54B is a new symbol
𝕌 U+1D54C is a new symbol
𝕍 U+1D54D is a new symbol
𝕎 U+1D54E is a new symbol
𝕏 U+1D54F is a new symbol
𝕐 U+1D550 is a new symbol
ℤ U+2124 is a new symbol
𝕒 U+1D552 is a new symbol
𝕓 U+1D553 is a new symbol
𝕔 U+1D554 is a new symbol
𝕕 U+1D555 is a new symbol
𝕖 U+1D556 is a new symbol
𝕗 U+1D557 is a new symbol
𝕘 U+1D558 is a new symbol
𝕙 U+1D559 is a new symbol
𝕚 U+1D55A is a new symbol
𝕛 U+1D55B is a new symbol
𝕜 U+1D55C is a new symbol
𝕝 U+1D55D is a new symbol
𝕞 U+1D55E is a new symbol
𝕟 U+1D55F is a new symbol
𝕠 U+1D560 is a new symbol
𝕡 U+1D561 is a new symbol
𝕢 U+1D562 is a new symbol
𝕣 U+1D563 is a new symbol
𝕤 U+1D564 is a new symbol
𝕥 U+1D565 is a new symbol
𝕦 U+1D566 is a new symbol
𝕧 U+1D567 is a new symbol
𝕨 U+1D568 is a new symbol
𝕩 U+1D569 is a new symbol
𝕪 U+1D56A is a new symbol
𝕫 U+1D56B is a new symbol
𝟘 U+1D7D8 is a new symbol
𝟙 U+1D7D9 is a new symbol
𝟚 U+1D7DA is a new symbol
𝟛 U+1D7DB is a new symbol
𝟜 U+1D7DC is a new symbol
𝟝 U+1D7DD is a new symbol
𝟞 U+1D7DE is a new symbol
𝟟 U+1D7DF is a new symbol
𝟠 U+1D7E0 is a new symbol
𝟡 U+1D7E1 is a new symbol
ℼ U+213C is a new symbol
ℽ U+213D is a new symbol
ℾ U+213E is a new symbol
ℿ U+213F is a new symbol
⅀ U+2140 is a new symbol