Andrew's Stuff

UTF-8 in PHP and MySQL

Posted at Sat, 22 Sep 2007, 22:44:59

I spent the best part of today researching how to properly handle Unicode - specifically UTF-8 - with PHP and MySQL for Web applications. Since this is something that has stumped me for a while, I decided to make this post about it. This will be as much for my own benefit as anyone else's: hopefully I'll remember it for future use if I write about it, and I can always refer to this post at a later date if necessary.

A quick and very basic introduction to UTF-8: whereas the ASCII character encoding that we're all taught about first at school stores characters very simply with one byte per character, UTF-8 uses anything from one to six bytes to store each character, depending on how many bytes are actually needed to store it. This means it can store many more characters than ASCII or any other one-byte-per-character encoding which are limited to a maximum of 256 "characters" (the first 32 ASCII codes are actually unprintable control codes, leaving a mere 224 values for actual characters), which is enough for English (mostly) but not enough for the vast majority of other languages which have accents and all kinds of fancy letters. And that's before we get onto Japanese, Chinese, Korean and the like with thousands of characters each. UTF-16 and UTF-32 also allow many more than 256 characters to be stored, but they take 16 bits and 32 bits, respectively, per character to store data, which is fair enough if you're using a lot of high-numbered characters, but a huge waste of storage space, bandwidth and/or memory if you're primarily using the basic ASCII characters which otherwise fit in a single byte each. UTF-8 gets around this by storing the lower ASCII values (i.e. just 0-127 - the original ASCII standard, rather than any of the extended ASCII sets which filled in the remaining 128-255 range) in a single byte, and using the most significant bit to indicate if another byte is needed to store the character, allowing the number of bytes used by a character to grow if necessary, but only if it actually is necessary. This also has the pleasant side-effect of keeping the majority of the English language characters in the same position as they are in ASCII: only the odd characters like £ are moved around.

On to the details!

The first and, perhaps, most obvious thing that needs to be done is to set the "charset" to UTF-8 on the XHTML/HTML/XML pages in question, both in the Content-Type HTTP header and in a Content-Type <meta> tag (the <?xml version="1.0" encoding="UTF-8"?> tag in an XML document). The meta tag should be the first child of the <head> element, as a browser reaching that element will have to stop processing the page and start again from the top, reading it with the new character set. You need to set both because some browsers read one, while other browsers read the other. There's no harm in setting them both, just remember to keep them the same to avoid confusing yourself later on.

<? header('Content-Type: text/html; charset=utf-8'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>



The second most important thing to do is tell MySQL that you're using UTF-8. This is done by sending an SQL statement as soon after connecting to the MySQL server as possible:
<?
mysql_connect('localhost', 'username', 'password');
mysql_query('SET NAMES utf8');
?>


Note how MySQL uses a non-standard name for UTF-8, with no hyphen. By sending this statement, MySQL knows that it should convert whatever it's using internally to UTF-8 before sending it to the client (PHP in this case), and that it should convert everything it receives from the client from UTF-8 into its internal charater encoding before doing anything with it. In reality, this may mean that it does no conversion at all - which is good, as then we don't waste CPU time converting character sets all over the place. It should even be possible to use UTF-8 in database, table and column names, though it isn't recommended as UTF-8 support in MySQL is still fairly new.


Now, in the paragraph above I mentioned that ideally MySQL will be storing everything in UTF-8 internally to avoid having to do any conversion. In order to achieve this, the columns that the data is stored in must be set to use UTF-8 (character sets can be set per-column in MySQL, remember). The easiest way to do this is by setting the default character set and the collation (the method used to sort result sets) when you create the database:
CREATE DATABASE mydatabase DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;


Unfortunately this isn't always possible (some web hosts give you a single MySQL database and expect you to store all your tables in that one database). As an alternative, therefore, you can set the character set and collation on a table as you create it:
CREATE TABLE mytable (
mydata VARCHAR(128) NOT NULL
) ENGINE=MyISAM DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;


Of course if you've already created the table you can change the default character set and collation for it with an ALTER TABLE statement (which will only affect newly-added columns to the table), or you can change the character set and collation for each individual column:
ALTER TABLE mytable DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

ALTER TABLE mytable CHANGE mydata mydata VARCHAR(128) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;


Setting MySQL's character sets to UTF-8 will also ensure that it handles string lengths properly, as it will understand that UTF-8 isn't a simple "1 byte = 1 character" encoding. There are a couple of things to note about MySQL's implementation of UTF-8, however. Firstly, MySQL will only handle UTF-8 characters up to three bytes long. This is fine for the vast majority of situations, but can theoretically cause problems. Secondly, because it supports up to three bytes for each character, it will reserve three bytes for every character in a CHAR column if its character set is UTF-8! So you can forget about using CHARs with UTF-8. If you're sure that the data going into a column fits another, non-UTF-8 character set, you can use a CHAR and set the column's character set to that set; otherwise, just use VARCHARs and stop worrying.


Another change to make, ideally, is to tell PHP's htmlspecialchars() function that you're now using UTF-8, so that it doesn't needlessly encode characters that could be sent as UTF-8 without trouble. Just pass 'UTF-8' as the third parameter to that function.

You'll need to use the 'u' (UTF-8) modifier on preg_* calls to make those behave properly, and you should also note that PHP's string-handling functions aren't Unicode-aware in PHP 5. They will be in PHP 6, but in the meantime you should remember that strlen(), for example, will count the bytes used in the string, and not the actual number of characters, so if you have a string containing a number of high-value (anything above #127) UTF-8 characters you will get strlen() reporting that strings are larger than they really are. Use alternatives such as the mb_ functions (e.g. mb_strlen(), or if you're really stuck, pass the string to MySQL to measure (assuming you've already executed the "SET NAMES utf8" statement above):
SELECT LENGTH('My UTF-8 String');


I suspect that more than that may be required, but so far that's all I've found I've needed. I would have thought that you would need to convert user input (the $_POST array, specifically) to UTF-8 before sending it to MySQL, but that doesn't seem necessary. Feel free to leave a comment below if I've missed anything, though; and I'll update this post if I come up with new things that need doing.

I should also mention, here, that you need at least MySQL 4.1 to do what I've stated above. MySQL 4.0 does not have Unicode support, and so you're apparently left doing all the conversion in your application - which is something I've yet to figure out how to do (heavily Unicode string appear to be completely useless by the time they get to PHP). I'm also using PHP 5.2.1 for my tests, and can't make any promises about earlier versions (especially PHP 4) working as described above. There is also the possibility of Internet Explorer 6 not liking some of this stuff. My brief testing earlier indicated some kind of JavaScript problem: while IE6 loaded and rendered the page "fine" (for IE6's usual value of 'fine' when we're talking about page-rendering...), it seemed to choke on having a UTF-8 string inside a block of JavaScript. That said, I hadn't tested that script in IE6 for a long time, so something else could have broken it for all I know. But yes, that means that UTF-8 strings inside JavaScript work fine, at least in Firefox and IE7!

I guess that means I now have no excuse not to sort out this site (and Web2Messenger...) to use UTF-8 sometime.

Pages I referenced today, and thus influenced this post (and will no doubt contain more useful information than I've given above):

blog comments powered by Disqus

Archived Comments

UTF-8, PHP, SMF by Daniel15 at Apr 10, 2007 4:21:23 PM

Very informative post... I should play around with UTF-8 one day. :)
It definitely seems like complex stuff, though. I wish that things like this were easier...

"I should also mention, here, that you need at least MySQL 4.1 to do what I've stated above. "
That shouldn't really be a problem any more, as people really should be using MySQL 5.0. Yeah, there's some lazy web hosts that haven't upgraded yet, but the majority should have.

"MySQL 4.0 does not have Unicode support, and so you're apparently left doing all the conversion in your application - which is something I've yet to figure out how to do"
I believe that SMF (Simple Machines Forum) *may* do this, but I'm not certain.

What I am sure of is that it has a lot of its own UTF-8 friendly functions, because (like you mentioned), PHP's functions aren't really UTF-8 aware. Take a look the reloadSettings() function in Sources/Load.php, and the Sources/Subs-Charset.php file.

Re: UTF-8, PHP, SMF by Lord d'Eath at May 10, 2007 2:09:36 PM

Things will be slightly easier once we get everyone using PHP 6, since PHP's string handling functions will be natively Unicode-aware then (finally..). It will still require a small amount of effort, but nothing compared to what you have to do at the moment.

Yes, people should be using MySQL 5.0 already, but in reality no one actually is, unfortunately. I've got 5.0 on my local server, but the server hosting this site is still only on 4.1. I can't convince TB to upgrade MySQL on the Web2Messenger/StufPlug server, either, so we're only on 4.1 there, too. Luckily, there aren't any particularly "killer" features in 5.0 that aren't in 4.1. 4.1 introduced subqueries and UTF-8 support, while 5.0 made the inclusion of InnoDB default. That's really about it, I think.

SMF most likely does do its own charset conversion. I believe IPB does, and without mb_string or iconv support you have little choice. My personal opinion is that it's more effort than it's worth. If you want Unicode support, enable the damned PHP extensions and stop using ridiculously outdated software.

Any Help With This UTF-8 Stuff by jigen at Nov 9, 2008 9:20:38 AM

Hi,
First, I already tried everything in your post but seems not working on my end. Here are some details:
I am using XAMPP on windows XP i.e mysql 5.x, PHP 5.2.4 and APache 2.
I used accept-charset="utf-8" inside my form so I can post or write UTF-8 (japanese characters).

Now, my problem is with regards to posting the values from the form, for example: I have these characters inside my form 当社は、グローバルに展開するズー but when I display the $_POST value before inserting it to my database I get this Insert into menu (name) values ('鎤Ꭿ夲㎭㎼妭Ꭻ㎼').
How come the the UTF-8 characters were somehow manipulated along the line?

Please help.

thanks.

Re: Any Help With This UTF-8 Stuff by Lord d'Eath at Dec 9, 2008 12:19:55 PM

I'm afraid that I've really posted everything I know about UTF-8 in PHP/etc.

That said, I can submit Japanese kanji quite happily to sites I've made using the methods detailed in this blog post, so I'm not sure what you're doing differently. I've only actually tested in Firefox, though, so maybe that has something to do with it.

Example: 茜粟娃飴芦案斡逢 (I hope that doesn't mean anything horrible in Japanese... I just copied random characters out of Charmap :P)