Some people think that Unicode is the (best/ultimate) solution for anything and everything that has to do with text encoding. But (unfortunately) this is not true.
For some text encoding problems Unicode is a good solution. An example for this category of problems is text intended for presentation to humans (e.g. the body of an email, when multiple languages and/or scripts are mixed).
This page is about another category, the encoding of Identifiers.
File names in a Filesystem, Function names in a Programming Language,
etc. The reason why Unicode is a bad solution here is the fundamental
design of ambiguous encoding described in the next section.
Do not misunderstand my statement in the sense that Unicode is flawed,
it is simply not well suited for this purpose.
If you nevertheless want to use Unicode for such purposes (sometimes
the alternatives are even worse), you should think twice. And you
should know exactly what you do. Otherwise you can shoot yourself in
the foot in some very subtle variants — maybe you don’t
feel the pain immediately.
Many people look at Unicode thinking that it’s “yet another encoding system”. They have the old Codepage system in mind and want to use Unicode as a drop-in replacement.
But Unicode lacks one property that is very important for Identifiers:
Non-ambiguous encoding.
If you encode the name of an identifier, the result should be the only
valid encoding for that name. This is what you want, but Unicode works
different by design: With Unicode there can be multiple encodings for
the same Identifier name, all of them defined canonically equivalent
[1][2] (compatibility equivalence can be ignored for the examples
below).
To make it worse, there is no “preferred” encoding —
you are free to use what you think will match best to your local
requirements.
Let’s assume the identifier name to be “abc”. Because this name uses only the US-ASCII subset of Unicode, no ambiguity can occur. The encoding will always be this codepoint sequence:
<U+0061,U+0062,U+0063>
If we extend the scope to the ISO 8859-1 subset of Unicode, this is no longer the case. Let’s replace the “a” with the german umlaut “ä”: The identifier name “äbc” can be represented with one of multiple codepoint sequences in Unicode:
<U+00E4,U+0062,U+0063> <U+0061,U+0308,U+0062,U+0063>
Both are valid and canonically equivalent. This means they both point to the same entity, if used as an identifier (when canonical equivalence is respected in the sense of Unicode [3]).
Note that if we would replace a second character with an umlaut, the number of valid, canonically equivalent encodings increase to four (see the filename example below). Longer names can have hundreds of valid (but different) encodings.
Important:
This problem has nothing to do with different Unicode Transformation
Formats (UTF [4]). It is inherent to the Unicode design as a whole and
cannot be avoided (except for trivial cases in the US-ASCII subset as
above).
Nearly all modern filesystems allow you to store filenames encoded
with Unicode (e.g. BSD FFS and Linux ext2/3/4 as UTF-8, Apple HFS+,
Microsoft VFAT12/16/32 and NTFS as UTF-16).
Some of them require Unicode encoding (e.g. NTFS), but many filesystems
and their corresponding OS drivers allow nearly arbitrary
bytestreams as names. The Unicode compatibility is a kind of
coincidence in this case.
Very few filesystems and OS drivers require and enforce real Unicode
semantics for filenames (Apple macOS with HFS+ do so).
If neither the filesystem nor the OS enforces Unicode semantics, what
happens if multiple users access the files with arbitrary Unicode names
that are canonically equivalent?
The result can be a real mess.
The following C program “uc_test.c” for POSIX compatible OS will try to create four files with the name “äbä” in the current directory, but each with a different encoding (all four canonically equivalent):
#define _POSIX_C_SOURCE 200112L #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> int main(void) { int rv; const char* file1 = "\xC3\xA4\x62\xC3\xA4"; const char* file2 = "\x61\xCC\x88\x62\xC3\xA4"; const char* file3 = "\xC3\xA4\x62\x61\xCC\x88"; const char* file4 = "\x61\xCC\x88\x62\x61\xCC\x88"; rv = creat(file1, S_IRWXU); if(-1 == rv) { exit(EXIT_FAILURE); } rv = creat(file2, S_IRWXU); if(-1 == rv) { exit(EXIT_FAILURE); } rv = creat(file3, S_IRWXU); if(-1 == rv) { exit(EXIT_FAILURE); } rv = creat(file4, S_IRWXU); if(-1 == rv) { exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
The names have UTF-8 applied, therefore the Unicode codepoints are not directly visible. The corresponding codepoint sequences are:
file1: <U+00E4,U+0062,U+00E4> file2: <U+0061,U+0308,U+0062,U+00E4> file3: <U+00E4,U+0062,U+0061,U+0308> file4: <U+0061,U+0308,U+0062,U+0061,U+0308>
On a system with real Unicode semantics, all four names should refer to the same file. The first call to creat() should create the file, the other calls should be aware that a file with the name “äbä” does already exist (and create no further links in the current directory).
As described above, most filesystems and OS do not use Unicode semantics. They allow you to use Unicode encoding for filenames, but itself they don’t understand them and don’t handle canonical equivalence correctly. Let’s see what happens on a GNU/Linux system with ext3 filesystem:
$ locale | grep LC_ALL LC_ALL=de_DE.utf8 $ cc -o uc_test uc_test.c $ ./uc_test $ ls -li insgesamt 16 131106 -rwx------ 1 baeuerle users 0 Jan 18 14:56 äbä 131104 -rwx------ 1 baeuerle users 0 Jan 18 14:56 äbä 131105 -rwx------ 1 baeuerle users 0 Jan 18 14:56 äbä 131103 -rwx------ 1 baeuerle users 0 Jan 18 14:56 äbä 131101 -rwxr-xr-x 1 baeuerle users 9559 Jan 18 14:55 uc_test 131107 -rw-rw---- 1 baeuerle users 551 Nov 21 2014 uc_test.c
There are now four files with the same name (in the sense of Unicode)
in the current directory.
Important:
There are really four files (four inodes), the directory entries are
not four different hardlinks to a single file!
This means the four files (now created empty) can be filled with
different content. And they can have different permissions. Example:
$ ls -li insgesamt 32 131106 -rw-rw-rw- 1 baeuerle users 289 Jan 18 15:09 äbä 131104 -rwx---rwx 1 baeuerle users 6151 Jan 18 15:09 äbä 131105 -r-xr--r-- 1 baeuerle users 15 Jan 18 15:08 äbä 131103 -rwx------ 1 baeuerle users 0 Jan 18 14:56 äbä 131101 -rwxr-xr-x 1 baeuerle users 9559 Jan 18 14:55 uc_test 131107 -rw-rw---- 1 baeuerle users 551 Nov 21 2014 uc_test.c
Now consider that the directory above is exported via NFS to machines
that use different Unicode encoding conventions ...
Remember that NFS do not enforce any Unicode semantics on filenames.
You can now ask yourself some questions:
Now copy the directory above to an an USB-Stick for data exchange between different machines.
Similar things can happen with Identifiers in programs, e.g. function names. Consider the following C program that use a function “äbc” with Unicode encoding for the name (some compilers like clang will accept this):
#include <stdio.h> #include <stdlib.h> void äbc(void) { printf("Foo\n"); return; } int main(void) { äbc(); exit(EXIT_SUCCESS); }
Now assume that such a program uses additional libraries and somebody has modified/prepared a library, used by this program, so that it now contains a (still dormant) function “äbc” too (but with a different Unicode encoding for the name):
void äbc(void) { printf("Foo\n"); /* Do something evil silently */ return; }
Now assume an attacker later modifies your program to call the evil library function. The program source code still looks exactly the same (e.g. in the editor of a reviewer) and it still generates the same output (behaviour seems to be unchanged too, after the first look).
There are also variable names, etc.
For all things that are Identifiers — names of something — you normally not want to use an ambiguous encoding system, like the one that is defined for Unicode.
[1] Unicode chapter 3 (D70, Page 118)
[2] Canonical Equivalence
[3] Canonical Equivalence
[4] Unicode Transformation Formats