Unicode and Identifiers

Some people think that Unicode is the (best/ultimate) solution for anything and everything that has to do with text encoding. But (unfortunately) this is not true.

For some text encoding problems Unicode is a good solution. An example for this category of problems is text intended for presentation to humans (e.g. the body of an email, when multiple languages and/or scripts are mixed).

This page is about another category, the encoding of Identifiers. File names in a Filesystem, Function names in a Programming Language, etc. The reason why Unicode is a bad solution here is the fundamental design of ambiguous encoding described in the next section.
Do not misunderstand my statement in the sense that Unicode is flawed, it is simply not well suited for this purpose. If you nevertheless want to use Unicode for such purposes (sometimes the alternatives are even worse), you should think twice. And you should know exactly what you do. Otherwise you can shoot yourself in the foot in some very subtle variants — maybe you don’t feel the pain immediately.

Problem

Many people look at Unicode thinking that it’s “yet another encoding system”. They have the old Codepage system in mind and want to use Unicode as a drop-in replacement.

But Unicode lacks one property that is very important for Identifiers: Non-ambiguous encoding.
If you encode the name of an identifier, the result should be the only valid encoding for that name. This is what you want, but Unicode works different by design: With Unicode there can be multiple encodings for the same Identifier name, all of them defined canonically equivalent [1][2] (compatibility equivalence can be ignored for the examples below).
To make it worse, there is no “preferred” encoding — you are free to use what you think will match best to your local requirements.

Let’s assume the identifier name to be “abc”. Because this name uses only the US-ASCII subset of Unicode, no ambiguity can occur. The encoding will always be this codepoint sequence:

If we extend the scope to the ISO 8859-1 subset of Unicode, this is no longer the case. Let’s replace the “a” with the german umlaut “ä”: The identifier name “äbc” can be represented with one of multiple codepoint sequences in Unicode:

Both are valid and canonically equivalent. This means they both point to the same entity, if used as an identifier (when canonical equivalence is respected in the sense of Unicode [3]).

Note that if we would replace a second character with an umlaut, the number of valid, canonically equivalent encodings increase to four (see the filename example below). Longer names can have hundreds of valid (but different) encodings.

Important:
This problem has nothing to do with different Unicode Transformation Formats (UTF [4]). It is inherent to the Unicode design as a whole and cannot be avoided (except for trivial cases in the US-ASCII subset as above).

File names

Nearly all modern filesystems allow you to store filenames encoded with Unicode (e.g. BSD FFS and Linux ext2/3/4 as UTF-8, Apple HFS+, Microsoft VFAT12/16/32 and NTFS as UTF-16).
Some of them require Unicode encoding (e.g. NTFS), but many filesystems and their corresponding OS drivers allow nearly arbitrary bytestreams as names. The Unicode compatibility is a kind of coincidence in this case.
Very few filesystems and OS drivers require and enforce real Unicode semantics for filenames (Apple macOS with HFS+ do so).

If neither the filesystem nor the OS enforces Unicode semantics, what happens if multiple users access the files with arbitrary Unicode names that are canonically equivalent?
The result can be a real mess.

The following C program “uc_test.c” for POSIX compatible OS will try to create four files with the name “äbä” in the current directory, but each with a different encoding (all four canonically equivalent):

#define _POSIX_C_SOURCE  200112L

#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>

int  main(void)
{
   int  rv;
   const char*  file1 = "\xC3\xA4\x62\xC3\xA4";
   const char*  file2 = "\x61\xCC\x88\x62\xC3\xA4";
   const char*  file3 = "\xC3\xA4\x62\x61\xCC\x88";
   const char*  file4 = "\x61\xCC\x88\x62\x61\xCC\x88";

   rv = creat(file1, S_IRWXU);  if(-1 == rv)  { exit(EXIT_FAILURE); }
   rv = creat(file2, S_IRWXU);  if(-1 == rv)  { exit(EXIT_FAILURE); }
   rv = creat(file3, S_IRWXU);  if(-1 == rv)  { exit(EXIT_FAILURE); }
   rv = creat(file4, S_IRWXU);  if(-1 == rv)  { exit(EXIT_FAILURE); }

   exit(EXIT_SUCCESS);
}

The names have UTF-8 applied, therefore the Unicode codepoints are not directly visible. The corresponding codepoint sequences are:

On a system with real Unicode semantics, all four names should refer to the same file. The first call to creat() should create the file, the other calls should be aware that a file with the name “äbä” does already exist (and create no further links in the current directory).

As described above, most filesystems and OS do not use Unicode semantics. They allow you to use Unicode encoding for filenames, but itself they don’t understand them and don’t handle canonical equivalence correctly. Let’s see what happens on a GNU/Linux system with ext3 filesystem:

$ locale | grep LC_ALL
LC_ALL=de_DE.utf8
$ cc -o uc_test uc_test.c
$ ./uc_test
$ ls -li
insgesamt 16
131106 -rwx------ 1 baeuerle users    0 Jan 18 14:56 äbä
131104 -rwx------ 1 baeuerle users    0 Jan 18 14:56 äbä
131105 -rwx------ 1 baeuerle users    0 Jan 18 14:56 äbä
131103 -rwx------ 1 baeuerle users    0 Jan 18 14:56 äbä
131101 -rwxr-xr-x 1 baeuerle users 9559 Jan 18 14:55 uc_test
131107 -rw-rw---- 1 baeuerle users  551 Nov 21  2014 uc_test.c

There are now four files with the same name (in the sense of Unicode) in the current directory.
Important:
There are really four files (four inodes), the directory entries are not four different hardlinks to a single file!
This means the four files (now created empty) can be filled with different content. And they can have different permissions. Example:

$ ls -li
insgesamt 32
131106 -rw-rw-rw- 1 baeuerle users  289 Jan 18 15:09 äbä
131104 -rwx---rwx 1 baeuerle users 6151 Jan 18 15:09 äbä
131105 -r-xr--r-- 1 baeuerle users   15 Jan 18 15:08 äbä
131103 -rwx------ 1 baeuerle users    0 Jan 18 14:56 äbä
131101 -rwxr-xr-x 1 baeuerle users 9559 Jan 18 14:55 uc_test
131107 -rw-rw---- 1 baeuerle users  551 Nov 21  2014 uc_test.c

Now consider that the directory above is exported via NFS to machines that use different Unicode encoding conventions ...
Remember that NFS do not enforce any Unicode semantics on filenames. You can now ask yourself some questions:

Now copy the directory above to an an USB-Stick for data exchange between different machines.

Function names

Similar things can happen with Identifiers in programs, e.g. function names. Consider the following C program that use a function “äbc” with Unicode encoding for the name (some compilers like clang will accept this):

#include <stdio.h>
#include <stdlib.h>

void äbc(void)
{
   printf("Foo\n");
   return;
}

int main(void)
{
   äbc();

   exit(EXIT_SUCCESS);
}

Now assume that such a program uses additional libraries and somebody has modified/prepared a library, used by this program, so that it now contains a (still dormant) function “äbc” too (but with a different Unicode encoding for the name):

void äbc(void)
{
   printf("Foo\n");
   /* Do something evil silently */
   return;
}

Now assume an attacker later modifies your program to call the evil library function. The program source code still looks exactly the same (e.g. in the editor of a reviewer) and it still generates the same output (behaviour seems to be unchanged too, after the first look).

Conclusion

For all things that are Identifiers — names of something — you normally not want to use an ambiguous encoding system, like the one that is defined for Unicode.

Think twice when using Unicode for Identifiers

General

Problem

File names

Function names

Conclusion

External References