Thanks for signing up, Mr. González – Welcome back, Mr. González!

In my previous post in the blog, I mentioned my frustration about those programs that surprisingly fail on Unicode support and encoding issues. I thought I should write a post about this because it never stops to amaze me how, more than ten years into the 21st century, there’s still a lot of software around that can’t cope with accented letters or non-Latin characters. And this happens quite often in the web too. I experience this constantly because my name has some accented letters and it is displayed incorrectly in a lot of e-mails I receive. The situation exemplified by the title of this post will be very familiar to those who have a name with any of the accents and other diacritics common in most European languages.

These problems with accented letters typically happen because of an inconsistency in character encodings creeping in as the information flows through different systems. Just think about the process involved in signing up for a website. In order to register as a user you typically have to fill in an HTML form, which may use some JavaScript, and the information you enter will be sent to a server, where it may be processed by some scripting code and stored in a database. When your user name is retrieved, for example in order to send you a notification by e-mail, it has to be read from a table in a database, and then handled by some scripting code again that will generate the text of the e-mail that finally appears in your web-based inbox as an HTML document. Problems with accented letters and non-Latin characters happen when there is some mismatch of encodings between how the text is read in one place and how it is written somewhere else. For example, the conversion from ‘González’ to ‘González’ in the title of this post is a typical case which happens when some text that was originally retrieved in UTF-8 encoding is then misinterpreted as ISO-8859-1. This is quite sloppy. In my view, just like scientists in a laboratory should never mix up metric and imperial units, people who write software should devote some time to think about these issues and ensure that they consistently stick to one character encoding in the application they’re developing. Unfortunately, there are still a lot of programs and websites that run into such problems, and this is a shame, especially because nearly 20 years have elapsed since the universal encoding UTF-8, one of the various flavours of Unicode, was first proposed and all these problems just go away if you use UTF-8 consistently and avoid region-specific encodings.

I find it particularly annoying how a lot of websites here in Spain, and in Western Europe in general, insist on using the dreadful ISO-8859-1 encoding (also known as Latin-1). There are a lot of exceptions, of course, but, in general, a lot of web developers still stick to ISO-8859-1. And why, may I ask? If anyone can name a reason that would make ISO-8859-1 preferable over UTF-8, I would love to know what it is. The only advantage of local encodings over UTF-8 I can think of is in the size of the whole document, since an accented letter will occupy two bytes in UTF-8 as opposed to just one byte in ISO-8859-1. However, in most European languages, letters with diacritics make up such a small percentage of the whole text and markup in web documents that such an advantage in size is not worth the trouble. I admit that this may be more justifiable in the case of non-Latin scripts, where the size of plain text may be multiplied by a factor of 2 for languages like Russian, and by a factor of 1.5 in Chinese, Japanese and Korean text. Even in these cases, I think the difference in size doesn’t justify the extra trouble of dealing with restricted character sets. The real reason why so many developers still use the ISO-8859-1 encoding, which only covers some basic characters used in a handful of Western European languages, is the fact that the HTTP 1.1 specification defines it as the default encoding for serving documents, and consequently web servers use this encoding by default, which is also the default encoding used by text editors in versions of Windows in Western languages like English and Spanish. So, a lot of websites use local encodings like ISO-8859-1 simply because most people don’t feel confident about these character encoding issues, and will just use whatever is there by default. But in this case, it actually makes sense to make an effort and modify the default settings.

In order to prevent character encoding issues while fully supporting Unicode at the same time, I am a firm advocate of using UTF-8 as much as possible. Only when we need to interact with an external database or programming library over which we have no control should we handle other encodings, converting text to our internal UTF-8 representation immediately after interacting with the third-party system. I intend to write a few posts about character encodings. In this first post, I’m going to discuss the way we can ensure that UTF-8 is used consistently in web development projects that use HTML, CSS, JavaScript and optionally PHP and MySQL. Things should be similar when using Active Server Pages or other database systems. In future posts, I will discuss some general issues about encoding and how to address Unicode support in programs written in C or C++.

1. Simple steps to remember in order to use the UTF-8 encoding in web development

As a result of this convention, all the HTML and PHP documents we write at Retibus Software are encoded as UTF-8. This means that we declare the character set to be UTF-8 within the document with a meta declaration:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8">

In HTML5 the meta tag can be used simply to specify the character set used by the document:

<meta charset="UTF-8">

Similarly, all our XML files declare the UTF-8 encoding:

<?xml version="1.0" encoding="UTF-8"?>

And then the files must, of course, be saved with UTF-8 encoding, which any modern text editor will allow you to do. The great thing about saving all files as UTF-8 text (including .css and .js files) is that one never needs to use escape sequences for characters outside the local encoding. For example, I have written multilingual sites where there would be a link to the Spanish version labelled as ‘Español’ and a link to the Chinese version labelled as 中文. By using UTF-8, we can have these strings just like that within the source files. On the contrary, if we save an HTML file as ISO-8859-1 we would have to type 中文 as ‘&#x4E2D;&#x6587;’. And if we were saving it with a Chinese encoding the Chinese text would be fine, but ‘Español’ would have to be typed as ‘Espa&ntilde;ol;’. By using UTF-8 (or another Unicode encoding like UTF-16), we can type or paste any characters directly in the source files, and we don’t need to fall back on cumbersome escape sequences at all.

Because browsers will inspect the header part and make their final decision about which encoding to use once they come across the meta tag that specifies the encoding, it is advisable never to use any non-ASCII characters before the character set declaration in the HTML markup. That’s why I always put the meta tags before the title tag. This will prevent possible parsing errors if the HTTP header that preceded the HTML document didn’t specify the UTF-8 character set. Such inconsistencies may be benign most of the time, but for consistency we should try to ensure that the HTTP headers indicate that the HTML document that follows uses UTF-8 encoding. This requires configuring the web server so that files are served with the right encoding. In an Apache server this can be done by adding the following line to the .htaccess file:

AddDefaultCharset UTF-8

For alternative ways of handling this that don’t affect the complete server behaviour, check the articles 4 and 5 in the references below.

When it is not possible to modify the settings of the web server, you can use PHP to send an HTTP header specifying the content type and the character set immediately before the HTML document is sent. This is done by using the header function:

header('Content-type: text/html; charset=utf-8');

Finally, if we use a database, we should make sure that it is created with the UTF-8 encoding too. This varies across database systems, but in general it should be possible to specify a default encoding for new databases and their tables. Many systems, like MySQL, still use ISO-8859-1 by default (which is referred to as ‘latin1’ in the MySQL settings), so we need to be careful so that we override that default value as a global setting or, alternatively, specify UTF-8 as the character encoding every time we create a new database. The references 6 and 7 below provide more detailed information about using UTF-8 with MySQL.

2. References

  1. Handling character encodings in HTML and CSS, an article by the World Wide Web Consortium.
  2. Declaring character encodings in HTML, an article by the World Wide Web Consortium.
  3. Setting the HTTP charset parameter, an article by the World Wide Web Consortium.
  4. Setting charset information in .htaccess, an article by the World Wide Web Consortium.
  5. Setting charset in htaccess, an article in askapache.com.
  6. Configure Rails and MySQL to Support UTF-8, a blog post that specifies the options that need to be set in the my.cnf file in order to ensure that UTF-8 is treated as the default character set in MySQL.
  7. Getting out of MySQL Character Set Hell, an excellent and detailed article about the difficulties converting an ISO-8859-1-encoded database into UTF-8 using MySQL.
This entry was posted in Character encoding, Unicode, Web development. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *