Character encodings and the beauty of UTF-8

我愛 UTF-8 أحِبّ

In my previous blog post, I discussed what was needed to ensure that a web site uses the UTF-8 character encoding consistently. I thought I should write a post on why I think UTF-8 is superior to any other encodings, and why outdated and limited schemes such as ISO-8859-1 should be ditched once and for all. In my experience, it seems that a lot of developers don’t pay much attention to character encoding issues, and there aren’t that many good introductions on the web (see the references below for a couple of good ones). In this post, I will write a primer on the main concepts behind character encoding and the advantages of using UTF-8. Even if it will appear very basic to well-seasoned programmers, I hope this post may be useful for those who feel their understanding of encodings is on shaky ground.

1. The history of character encodings

A long time ago, when text started to be stored in an electronic format, the problem of coding the letters of the alphabet as a bunch of bits was addressed by assigning certain numeric values to the different letters in the Latin alphabet. For example, an uppercase ‘A’ was, and still is, represented by the number 65, whereas a lowercase ‘a’ was represented by the number 97. This numeric representation turns a word like ‘cat’ into a sequence of numbers ‘99, 97, 116’, something that can be stored as binary data in a straightforward way. As it can be hard to keep track of how long a string of text is, i. e. how many numbers we should interpret as letters before we come to the end of the piece of text, there is a very useful convention originally adopted by the C language which consists in adding a zero value to the end of the string, so that ‘cat’ actually becomes the numeric sequence ‘99, 97, 116, 0’, with a final zero value that signals the end of the text. Such null-terminated strings are often referred to as ‘C strings’. The numeric values for the letters of the Latin alphabet and some special codes (like the use of zero to signal the end of the string) have been standardised for a long time in the form of the ASCII scheme, which was officially endorsed in the US by President Lyndon Johnson in 1968 (see the Wikipedia article).

The ASCII scheme codifies the letters of the basic Latin alphabet (without accented letters) and the basic punctuation signs (like ‘.’ and ‘,’) by using the numbers between 0 and 127, a range of values that can be encoded in just seven bits. In this way, the digital representation of ‘cat’, including the final zero, would occupy 28 bits in memory. A system based on just 128 values is clearly insufficient to represent variations of the Latin alphabet that include accented letters, let alone other writing systems like the Greek alphabet or Chinese characters. In order to support more characters, the ASCII scheme needs to be supplemented with additional codes, and this is what was done in the 80s and 90s by the software companies that were struggling to make their products work with a variety of languages. During that time, a lot of incompatible extensions to support non-ASCII characters were devised. Most of these extensions are based on the idea of using eight bits rather than seven for each individual character. Since bits in memory are typically grouped in octets, and computer programs written in languages such as C access memory by indexing these octets or bytes, characters are usually stored as full octets that can store the numeric values between 0 and 255. So, rather than encoding ‘cat’ as 28 bits of memory, the usual situation since the 80s is that a full octet is used for each letter and for the terminating null character, and the memory occupied by ‘cat’ is four octets or 32 bits. The additional eighth bit provides a range of values from 128 to 255 that goes beyond the scope of the original ASCII scheme.

The natural way to extend the ASCII system to support other languages consists in using the extended range between 128 and 255 for additional letters, and this is what software companies did in the 80s and 90s. So, for example some companies including Microsoft decided to assign the value 225 to the character ‘á’ so that a Spanish word like ‘árbol’ would be represented by the sequence ‘225, 114, 98, 111, 108, 0’. The problem was that each software vendor made their own uncoordinated decisions about which numeric values to use. For example, Apple developed its own ASCII Extended Character Set for Mac, where ‘á’ was represented by the value 135, so the first letter in the word ‘árbol’ would not be interpreted correctly by a Macintosh program if it had been originally stored using the Windows character set. The accented letters used in Western European languages like Spanish, Portuguese, German and the Nordic languages are not too many, so it was possible to put all those characters in one extension. But it is of course impossible to support all languages simultaneously in this way. Because of that, a Russian-language version of an operating system like Windows or Mac would use the extended range 128-255 to codify the Cyrillic alphabet, whereas a Greek-language operating system would use those codes for the Greek letters. These alternative uses of the extended range are referred to as ‘code pages’. In order to determine how to interpret some binary data as text, a multilanguage program such as a web browser would need to know under which code page the text has been encoded, so that Russian text can appear as Russian even if the operating system uses a western European code page.

Languages like Chinese and Japanese, which use thousands of characters, pose a more difficult problem since it is obviously impossible to codify thousands of characters in a range of just 128 values. The nifty solution for this consists in using the eighth bit as a hint that the following octet must also be taken into account in order to determine the represented character. By using the eighth bit in combination with the full following octet we get up to 32,768 (128 x 256) available codes (actually, a bit less than that, since characters with a special meaning, like the null character, cannot be used for the second part; and most of these schemes completely avoid the 0-127 range in the second byte. Even so, we would still have 16,384 codes). These kinds of encoding schemes used for Chinese, Japanese, and Korean are called multibyte character sets (MBCS) , and they are the first historical appearance of variable-width encoding, where some characters are longer than others, which is the idea behind the current universal standard UTF-8.

This profusion and lack of coordination of encoding schemes explains the frequency of such common problems as ‘why does this email in Chinese appear as rubbish?’ or ‘I registered as Andrés Fernández and my name is now displayed as AndrÃ©s FernÃ¡ndez’. Furthermore, the existence of so many encodings made it nearly impossible to display different languages in one document. For example, an article in Spanish about the Chinese language would require having both accented letters and Chinese characters within the same document. The only way to achieve this in the code page jungle of the 90s would have consisted in writing some clever software that could tag segments of text with different code pages, so that the document would be rendered by parsing the segments of text differently depending on their stated code page. Another possibility consisted in using special fonts that would use bitmaps for, say, Russian or Greek letters regardless of the code page. Such techniques were complicated and a bit hackish. A simpler idea that would eventually win out consists in having a universal character set. Just like an ‘a’ is always 97, would it not be possible to assign unambiguous codes to all the letters and characters in common use in all the written languages? By doing that, if the value 1604 corresponds to the Arabic letter ل, there will never be any risk that it gets mistaken with a Russian letter or with a Chinese character. 1604 will unequivocally be ل just like 97 is ‘a’. This is the idea of universal encoding, and the first serious proposal was published by Joe Becker in 1988 with the name Unicode.

An encoding scheme consists not only of a mapping of characters to numeric values, but also a specification of the binary layout that those values should assume. Note that ‘Unicode’ is simply an assignment of numeric values, referred to as ‘code points’, and not a full encoding scheme, like UTF-8 or UCS-2. The Unicode specification differentiates between these two aspects of encoding through the terms ‘coded character set’ (CCS) for the correspondence between characters and numeric values, and ‘character encoding form’ (CEF) for the particular binary layout that the numeric values or code points should assume. We have mentioned that the value 97 for ‘a’ can be stored in just seven bits, but a value like 1604 will require at least 11 bits. At the time when the first Unicode draft was written, it was taken for granted that 16 bits would be enough to express all characters in common use in the world. At that time, it was not expected that a universal encoding should take into account ancient scripts or rare variants of Chinese and Japanese characters, so the range between 0 and 65,535 that can be represented by two octets seemed to be more than enough for a universal encoding. Building on this idea, the natural way to move from the ASCII system to Unicode would require the use of 16 rather than eight bits for characters. In this way, a character like ‘a’ would still be represented by the numeric value 97, just as in ASCII, but it would be encoded as 16 bits. The idea of making characters larger than single bytes in order to support all languages was added to the Java programming language, designed after C and C++, so that in that language the types byte and char are distinct built-in types and a char has twice the size of the byte. The designers of the Java language made the decision to make wide characters an integral part of the language. The situation in the C language, in which operating systems like Windows and Unix are programmed, was more complex. Because of the ASCII mindset that identified characters and bytes, the C type char was used in a double role as a type for both raw bytes, the smallest built-in type (in fact, the expression sizeof(char) must always be 1 by definition in both C and C++), and for alphanumeric characters. Because the char type could not be widened to two bytes without a fundamental change in the built-in types, a new type wchar_t was added. The name follows the pattern of other types like size_t and ptrdiff_t, where the ‘_t’ suffix denotes a type that is not built-in, but defined through a typedef declaration in a standard header file. The C++ standard in 1998 made wchar_t a built-in type, so the naming style is a bit of a relic in C++ (not so in C, where it is still defined in stddef.h). With the addition of wchar_t, it is possible to use the types char and wchar_t in much the same way as the byte and char of Java, so an important semantic distinction between raw bytes and text is established in this way.

The use of wide characters to support the Unicode character mappings seemed such a good idea that in the 90s that a language like Java and an operating system like Windows NT adopted this practice throughout. This gave Java and Windows an edge in internationalisation support at the time. However, this approach has turned out to be problematic in the end. Basically, there are three problems with the use of 16-bit wide characters:

They break all the legacy code that treats characters as single bytes, so any program or library written in languages like C or C++ that works fine with ASCII characters needs to be modified at the source code level in order to support wide characters.
The use of 16-bit wide characters doubles the size of text files in English. So, a file of 20 kB that uses plain ASCII characters will become 40 kB when converted to wide characters. If the text does not include any accented letters or characters from non-Latin scripts, every other byte will just be a zero value. Even if hard drives have huge capacities these days, this redundancy is not good for documents that are transferred over a network, such as websites, so people are reluctant to use wide characters for English-language documents.
The Unicode list has been expanded to include ancient scripts and variant forms of East Asian characters, including thousands of obsolete Chinese and Vietnamese characters. This has taken the number of supported characters beyond the 65,536 mark, so the full Unicode list can no longer be represented with 16 bits. Since Unicode 2.0 (July 1996), the numeric range used by Unicode is from 0 to 1114111 (or 10ffff in hexadecimal notation), which requires at least 21 bits.

Number 3 is a killer. If 16 bits are not enough to store all the characters that you may find, for example, in a Wikipedia article about the history of the Vietnamese language, we are back at the same situation where we were with ASCII, and we will need to extend the system with some nifty multibyte solution to support, well, three-byte characters. Of course, this limitation can be overcome by making the range larger again. If characters are represented by 32 bits, which is a typical size for integer values, then we have a maximum range of 4,294,967,296 values. It seems reasonable that all the characters in all the scripts in human history are well under four billion in number, so using 32-bit values looks like a completely safe bet to represent all possible characters. This encoding that uses 32-bit values has been given the official names ‘UCS-4’ or ‘UTF-32’, whereas the original use of 16-bit values as in Windows NT or Java is called ‘UCS-2’, which cannot accommodate the high-code-point values that require more than 16 bits. A variable-width version of UCS-2 is UTF-16, which is like UCS-2 except that the high-code-point characters beyond the 16-bit mark are encoded as 32 bits in pairs of 16-bit units that are referred to as ‘surrogate pairs’. The use of wide characters in Java and Windows is now officially UTF-16, so that the surrogate pairs are in theory also supported, but this means that the one-to-one mapping between each char or wchar_t and a Unicode code point has been lost, and Java and Windows are back in variable-width encoding territory again. The official Unicode website has a very interesting FAQ page regarding the different UTF schemes.

UCS-4 has not been popular, however, because it only exacerbates issues 1 and 2 above. If 16 bits are not enough, then not only traditional ASCII code, but all the code written for 16-bit wide characters like Windows NT and the Java language itself would have to be re-written to support a larger character type. And then the size of files would now be multiplied by 4 and most of the data would be zeroes, only there just in case an ancient Phoenician character pops up in the middle of the text.

2. The advantages of UTF-8

The outcome of the wide-character fiasco is that a much better solution has been found in UTF-8. This encoding uses the multibyte idea that was first applied to Chinese and Japanese encodings. Basically, the ASCII characters are stored in eight bits with a null most significant bit. If the most significant bit is non-zero, then the represented character is non-ASCII, and the second byte is also required to determine the numeric value of the character. The layout of the bits in the first byte actually determine whether one, two, three or four bytes make up the character. The encoding rules of UTF-8 are complicated, and it has the main problem of multibyte encodings, which is that it is difficult for the programmer to determine the boundaries between full characters or to know how many characters rather than bytes make up a string. This problem, however, is not as important as it may appear since programmers usually handle text strings just to compare for equality or to append one string to another. How the bytes make up the real characters that get displayed by fonts is very often irrelevant for most applications. And apart from the fact that the variable width is not such a terrible thing as it is often thought, UTF-8 has two extraordinary advantages:

1. It is a superset of ASCII.

It is difficult to emphasise strongly enough the importance of this. UTF-8 has been designed so that 8-bit ASCII is a particular case of it, and so, a file that contains ASCII text in 8-bit characters is a valid UTF-8 file too. This backward compatibility with ASCII means that legacy code can be made to work with non-ASCII characters with very little effort (if any at all). Most operations on strings in the C language, like determining where a string ends by looking for a terminating null byte, or comparing for equality using strcmp, work perfectly with text encoded in UTF-8. Add to that the fact that plain English-language text doesn’t pay a performance price because of the existence of thousands of East Asian characters, and it is not surprising that UTF-8 has become the most popular text encoding, the ASCII of the 21st century.

2. The binary layout of one character cannot be part of the binary layout of another character

This is a great improvement over the multibyte schemes traditionally used for Chinese and Japanese. In those East Asian encodings, a byte with the same value as ‘a’ did not necessarily correspond to a letter ‘a’ embedded in a string, but it could just be the second byte (the ‘trailing byte’) of an East Asian character that happened to be equal in binary layout to the ASCII ‘a’. As a manifestation of this problem, let’s have a look at the following lines of C code:

const char *text = NULL;
char *c = NULL;
char *text_following_underscore = NULL;
    text = function_that_reads_some_text();

    if(text)

    {

        for(c = text; text_following_underscore == NULL && *c != 0; c++)

        {

            if(*c == '_')

                text_following_underscore = ++c;

        }

        (…)

    }

Now can you spot what’s wrong with the above code? This would make a good interview question, and I wonder how many programmers would be able to come up with an answer. As you will have guessed, the code above is broken under some East Asian encodings. In particular, both the Shift JIS Japanese encoding and the Big5 encoding used for traditional Chinese accept the ASCII code for the underscore ‘_’ as the numeric value for the trailing byte in many East Asian characters, and so the above code may potentially yield false positives when searching for an underscore. This can lead to very obscure bugs, where some East Asian customer would complain about crashes that no-one at the company can reproduce. This situation used to make searching for tokens and substrings particularly difficult under such East Asian encodings. But the good news is that this sort of problem never happens if the text returned by the function_that_reads_some_text above is encoded as UTF-8. Because UTF-8 guarantees that the binary layout of a character cannot reappear in the binary layout of another character using a higher number of octets, an underscore will really be an underscore and an ‘a’ will really be an ‘a’, and never a binary fragment of a high-code-point character. Thanks to this awesome property, searching for particular characters or substrings is completely safe in UTF-8.

With the benefit of hindsight, I would say that the designers of Windows NT and Java got it wrong by jumping on the wide character bandwagon. Lots of resouces have been devoted by companies to internationalisation plans with managers telling programmers to rewrite all the code turning chars into wchar_ts, "literals" into L"literals", strcmp into wcscmp, and so on. I have made this mistake in the past too. In comparison, using UTF-8 is much simpler. Operating systems like Unix and Linux that stuck to the strings made up of octets eventually got much more elegant internationalisation support thanks to the success of the simple yet beautiful UTF-8 encoding.

In the next post, which I expect to publish next week, I will address some of the issues that arise when using UTF-8 as the internal character representation in C and C++ source code.

3. References

Unicode, article in Wikipedia. There are other interesting related articles, which make for very good reading on the subject, like UTF-8, UTF-16, Character encoding and Variable-width encoding.
A tutorial on character code issues, a superb and very comprehensive introduction to how encodings work by Jukka “Yucca” Korpela.
The Absolute Minimum Every Developer Absolutely, Positively Must Know About unicode and Character Sets (No Excuses!), an excellent introduction to Unicode and character encodings by Joel Spolsky.
Web Application Component Toolkit – PHP – Character sets / character encoding issues, a very good discussion of encoding issues in PHP.
Basic Questions, the basic Frequently Asked Questions by the Unicode Consortium.
UTF-8, UTF-16, UTF-32 & BOM, the Frequently Asked Questions on the various UTF encodings by the Unicode Consortium.

Character encodings and the beauty of UTF-8

1. The history of character encodings

2. The advantages of UTF-8

3. References

3 Responses to Character encodings and the beauty of UTF-8

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta