I spent the best part of today researching how to properly handle Unicode - specifically UTF-8 - with PHP and MySQL for Web applications. Since this is something that has stumped me for a while, I decided to make this post about it. This will be as much for my own benefit as anyone else's: hopefully I'll remember it for future use if I write about it, and I can always refer to this post at a later date if necessary.
A quick and very basic introduction to UTF-8: whereas the ASCII character encoding that we're all taught about first at school stores characters very simply with one byte per character, UTF-8 uses anything from one to six bytes to store each character, depending on how many bytes are actually needed to store it. This means it can store many more characters than ASCII or any other one-byte-per-character encoding which are limited to a maximum of 256 "characters" (the first 32 ASCII codes are actually unprintable control codes, leaving a mere 224 values for actual characters), which is enough for English (mostly) but not enough for the vast majority of other languages which have accents and all kinds of fancy letters. And that's before we get onto Japanese, Chinese, Korean and the like with thousands of characters each. UTF-16 and UTF-32 also allow many more than 256 characters to be stored, but they take 16 bits and 32 bits, respectively, per character to store data, which is fair enough if you're using a lot of high-numbered characters, but a huge waste of storage space, bandwidth and/or memory if you're primarily using the basic ASCII characters which otherwise fit in a single byte each. UTF-8 gets around this by storing the lower ASCII values (i.e. just 0-127 - the original ASCII standard, rather than any of the extended ASCII sets which filled in the remaining 128-255 range) in a single byte, and using the most significant bit to indicate if another byte is needed to store the character, allowing the number of bytes used by a character to grow if necessary, but only if it actually is necessary. This also has the pleasant side-effect of keeping the majority of the English language characters in the same position as they are in ASCII: only the odd characters like £ are moved around.
On to the details!
The first and, perhaps, most obvious thing that needs to be done is to set the "charset" to UTF-8 on the XHTML/HTML/XML pages in question, both in the Content-Type HTTP header and in a Content-Type <meta> tag (the <?xml version="1.0" encoding="UTF-8"?> tag in an XML document). The meta tag should be the first child of the <head> element, as a browser reaching that element will have to stop processing the page and start again from the top, reading it with the new character set. You need to set both because some browsers read one, while other browsers read the other. There's no harm in setting them both, just remember to keep them the same to avoid confusing yourself later on.
<? header('Content-Type: text/html; charset=utf-8'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The second most important thing to do is tell MySQL that you're using UTF-8. This is done by sending an SQL statement as soon after connecting to the MySQL server as possible:
mysql_connect('localhost', 'username', 'password');
mysql_query('SET NAMES utf8');
Note how MySQL uses a non-standard name for UTF-8, with no hyphen. By sending this statement, MySQL knows that it should convert whatever it's using internally to UTF-8 before sending it to the client (PHP in this case), and that it should convert everything it receives from the client from UTF-8 into its internal charater encoding before doing anything with it. In reality, this may mean that it does no conversion at all - which is good, as then we don't waste CPU time converting character sets all over the place. It should even be possible to use UTF-8 in database, table and column names, though it isn't recommended as UTF-8 support in MySQL is still fairly new.
Now, in the paragraph above I mentioned that ideally MySQL will be storing everything in UTF-8 internally to avoid having to do any conversion. In order to achieve this, the columns that the data is stored in must be set to use UTF-8 (character sets can be set per-column in MySQL, remember). The easiest way to do this is by setting the default character set and the collation (the method used to sort result sets) when you create the database:
CREATE DATABASE mydatabase DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Unfortunately this isn't always possible (some web hosts give you a single MySQL database and expect you to store all your tables in that one database). As an alternative, therefore, you can set the character set and collation on a table as you create it:
CREATE TABLE mytable (
mydata VARCHAR(128) NOT NULL
) ENGINE=MyISAM DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Of course if you've already created the table you can change the default character set and collation for it with an ALTER TABLE statement (which will only affect newly-added columns to the table), or you can change the character set and collation for each individual column:
ALTER TABLE mytable DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE mytable CHANGE mydata mydata VARCHAR(128) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;
Setting MySQL's character sets to UTF-8 will also ensure that it handles string lengths properly, as it will understand that UTF-8 isn't a simple "1 byte = 1 character" encoding. There are a couple of things to note about MySQL's implementation of UTF-8, however. Firstly, MySQL will only handle UTF-8 characters up to three bytes long. This is fine for the vast majority of situations, but can theoretically cause problems. Secondly, because it supports up to three bytes for each character, it will reserve three bytes for every character in a CHAR column if its character set is UTF-8! So you can forget about using CHARs with UTF-8. If you're sure that the data going into a column fits another, non-UTF-8 character set, you can use a CHAR and set the column's character set to that set; otherwise, just use VARCHARs and stop worrying.
Another change to make, ideally, is to tell PHP's htmlspecialchars() function that you're now using UTF-8, so that it doesn't needlessly encode characters that could be sent as UTF-8 without trouble. Just pass 'UTF-8' as the third parameter to that function.
You'll need to use the 'u' (UTF-8) modifier on preg_* calls to make those behave properly, and you should also note that PHP's string-handling functions aren't Unicode-aware in PHP 5. They will be in PHP 6, but in the meantime you should remember that strlen(), for example, will count the bytes used in the string, and not the actual number of characters, so if you have a string containing a number of high-value (anything above #127) UTF-8 characters you will get strlen() reporting that strings are larger than they really are. Use alternatives such as the mb_ functions (e.g. mb_strlen(), or if you're really stuck, pass the string to MySQL to measure (assuming you've already executed the "SET NAMES utf8" statement above):
SELECT LENGTH('My UTF-8 String');
I suspect that more than that may be required, but so far that's all I've found I've needed. I would have thought that you would need to convert user input (the $_POST array, specifically) to UTF-8 before sending it to MySQL, but that doesn't seem necessary. Feel free to leave a comment below if I've missed anything, though; and I'll update this post if I come up with new things that need doing.
I guess that means I now have no excuse not to sort out this site (and Web2Messenger...) to use UTF-8 sometime.
Pages I referenced today, and thus influenced this post (and will no doubt contain more useful information than I've given above):
- Fun with UTF-8, PHP and MySQL | Byteflex
- MySQL AB :: Unicode and Other Funny Characters
- mysql utf8 php - PHPBuilder.com
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
- Turning MySQL data in latin1 to utf8 utf-8 - O'Reilly ONLamp Blog
- Unicode Data with PHP 5 and MySQL 4.1
- Unicode Macrons, mainly for Te Reo Maori