In my previous post in the blog, I mentioned my frustration about those programs that surprisingly fail on Unicode support and encoding issues. I thought I should write a post about this because it never stops to amaze me how, more than ten years into the 21st century, there’s still a lot of software around that can’t cope with accented letters or non-Latin characters. And this happens quite often in the web too. I experience this constantly because my name has some accented letters and it is displayed incorrectly in a lot of e-mails I receive. The situation exemplified by the title of this post will be very familiar to those who have a name with any of the accents and other diacritics common in most European languages.
I find it particularly annoying how a lot of websites here in Spain, and in Western Europe in general, insist on using the dreadful ISO-8859-1 encoding (also known as Latin-1). There are a lot of exceptions, of course, but, in general, a lot of web developers still stick to ISO-8859-1. And why, may I ask? If anyone can name a reason that would make ISO-8859-1 preferable over UTF-8, I would love to know what it is. The only advantage of local encodings over UTF-8 I can think of is in the size of the whole document, since an accented letter will occupy two bytes in UTF-8 as opposed to just one byte in ISO-8859-1. However, in most European languages, letters with diacritics make up such a small percentage of the whole text and markup in web documents that such an advantage in size is not worth the trouble. I admit that this may be more justifiable in the case of non-Latin scripts, where the size of plain text may be multiplied by a factor of 2 for languages like Russian, and by a factor of 1.5 in Chinese, Japanese and Korean text. Even in these cases, I think the difference in size doesn’t justify the extra trouble of dealing with restricted character sets. The real reason why so many developers still use the ISO-8859-1 encoding, which only covers some basic characters used in a handful of Western European languages, is the fact that the HTTP 1.1 specification defines it as the default encoding for serving documents, and consequently web servers use this encoding by default, which is also the default encoding used by text editors in versions of Windows in Western languages like English and Spanish. So, a lot of websites use local encodings like ISO-8859-1 simply because most people don’t feel confident about these character encoding issues, and will just use whatever is there by default. But in this case, it actually makes sense to make an effort and modify the default settings.
1. Simple steps to remember in order to use the UTF-8 encoding in web development
As a result of this convention, all the HTML and PHP documents we write at Retibus Software are encoded as UTF-8. This means that we declare the character set to be UTF-8 within the document with a meta declaration:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
In HTML5 the
meta tag can be used simply to specify the character set used by the document:
Similarly, all our XML files declare the UTF-8 encoding:
<?xml version="1.0" encoding="UTF-8"?>
And then the files must, of course, be saved with UTF-8 encoding, which any modern text editor will allow you to do. The great thing about saving all files as UTF-8 text (including .css and .js files) is that one never needs to use escape sequences for characters outside the local encoding. For example, I have written multilingual sites where there would be a link to the Spanish version labelled as ‘Español’ and a link to the Chinese version labelled as 中文. By using UTF-8, we can have these strings just like that within the source files. On the contrary, if we save an HTML file as ISO-8859-1 we would have to type 中文 as ‘中文’. And if we were saving it with a Chinese encoding the Chinese text would be fine, but ‘Español’ would have to be typed as ‘Español;’. By using UTF-8 (or another Unicode encoding like UTF-16), we can type or paste any characters directly in the source files, and we don’t need to fall back on cumbersome escape sequences at all.
Because browsers will inspect the header part and make their final decision about which encoding to use once they come across the meta tag that specifies the encoding, it is advisable never to use any non-ASCII characters before the character set declaration in the HTML markup. That’s why I always put the meta tags before the title tag. This will prevent possible parsing errors if the HTTP header that preceded the HTML document didn’t specify the UTF-8 character set. Such inconsistencies may be benign most of the time, but for consistency we should try to ensure that the HTTP headers indicate that the HTML document that follows uses UTF-8 encoding. This requires configuring the web server so that files are served with the right encoding. In an Apache server this can be done by adding the following line to the .htaccess file:
For alternative ways of handling this that don’t affect the complete server behaviour, check the articles 4 and 5 in the references below.
When it is not possible to modify the settings of the web server, you can use PHP to send an HTTP header specifying the content type and the character set immediately before the HTML document is sent. This is done by using the
header('Content-type: text/html; charset=utf-8');
Finally, if we use a database, we should make sure that it is created with the UTF-8 encoding too. This varies across database systems, but in general it should be possible to specify a default encoding for new databases and their tables. Many systems, like MySQL, still use ISO-8859-1 by default (which is referred to as ‘latin1’ in the MySQL settings), so we need to be careful so that we override that default value as a global setting or, alternatively, specify UTF-8 as the character encoding every time we create a new database. The references 6 and 7 below provide more detailed information about using UTF-8 with MySQL.
- Handling character encodings in HTML and CSS, an article by the World Wide Web Consortium.
- Declaring character encodings in HTML, an article by the World Wide Web Consortium.
- Setting the HTTP charset parameter, an article by the World Wide Web Consortium.
- Setting charset information in .htaccess, an article by the World Wide Web Consortium.
- Setting charset in htaccess, an article in askapache.com.
- Configure Rails and MySQL to Support UTF-8, a blog post that specifies the options that need to be set in the my.cnf file in order to ensure that UTF-8 is treated as the default character set in MySQL.
- Getting out of MySQL Character Set Hell, an excellent and detailed article about the difficulties converting an ISO-8859-1-encoded database into UTF-8 using MySQL.