= Going UTF-8 (utf8) with PHP & MySQL = UTF-8 is a character encoding standard which supports characters for (nearly) all the languages in the world. Older standards, such as US-ASCII and ISO-8859-1, contains only characters for English (US-ASCII) and Western European languages (ISO-8859-1). There are a lot of good reasons to use UTF-8, especially if your app will (or may eventually need to) support international users. If you're developing in PHP, there's also a few good reasons not to use UTF-8 right now... at least not until PHP6 comes out. PHP5 doesn't natively support multibyte characters (which UTF-8 uses), so you may have to do some special handling. If you want to use UTF-8, here is a quick guide to upgrading your LAMP application. First & foremost, you want to ensure that your entire application stack is using UTF-8. That means you're serving pages to the browser in UTF-8, the browser is sending data back in UTF-8, and you're storing data in your database in UTF-8. If some portion of the app stack //isn't// passing data in UTF-8, then characters in the data stream will be mangled or lost. == Getting the Browser to use UTF-8 == You'll want to make sure to tell the browser that you're sending data as UTF-8, and that it should send data back as UTF-8. To do that, you should put in an HTTP header on every page: header('Content-Type: text/html; charset=utf-8'); and also include a ''Content-Type'' meta tag in your actual HTML document: Note that some folks recommend you do add an 'accept-charset' attribute on your ''
'' tags: but [[http://stackoverflow.com/questions/1317152/am-i-correctly-supporting-utf-8-in-my-php-apps/1317301#1317301|bobince at StackOverflow]] says that IE doesn't support that, and so its not a good idea to use it. == PHP - Using the mb_* functions == PHP4/5 treats every string as a sequence of bytes, rather than a sequence of characters. If one char = one byte, that's fine. A function like ''strlen()'' will simply count the number of bytes in the string, and return that as the number of characters in the string. So ''strlen('ab')'' is **2**, as you'd expect. The trouble comes with UTF-8; it uses //anywhere between one and three bytes// to represent a single character. Now, one char != one byte... so ''strlen('汉语')'' is **6**, even though there are only two characters there. Each of those Chinese characters takes up 3 bytes. To combat this, PHP introduced the [[http://us.php.net/manual/en/book.mbstring.php|mbstring extension]], which contains functions to process multibyte strings. For example, if you do: ''mb_strlen('汉语', 'utf-8')'', you get the expected length of **2**. Therefore, the general recommendation is to go through all of your code and replace any standard string function with the ''mb_*'' equivalent. ''strlen()'' becomes ''mb_strlen()'', ''split()'' becomes ''mb_split()'' and so on. Note that you don't have to pass the encoding with each call; you can use ''mb_internal_encoding('utf8');'' before calling any string functions to set the encoding that they should all use. ==== Other Options in PHP ==== Replacing all of your string functions is a **ton** of work, so //do you have to//? Yes and no: PHP includes [[http://us.php.net/manual/en/mbstring.overload.php|a function overloading feature]], which will force PHP to use the ''mb_*'' functions whenever you call thier non-mb equivalents. Set ''mbstring.func_overload=7'' in ''php.ini'' to use it. Problem is, as of this writing in September 2009, this feature is not well tested, and may lead to 'undefined behavior'. General consensus seems to be not to use this feature. You could also just not do anything, and let PHP treat multibyte strings as a sequence of individual bytes. Depending on what you're doing, this may not be as dumb as it sounds: if you only expect to be passing mutlibyte data back and forth between a webpage and a database, you'll probably be OK. The PHP WACT site has a great summary on what can break if you [[http://www.phpwact.org/php/i18n/utf-8|use standard PHP string functions with multibyte UTF-8 data]]. In practice, the biggest issues will likely be around data validation: for example, if you require a string to be no more than 20 characters, and you use the ''strlen()'' function to check for this, then a 7-chinese character string will come back as invalid (7 chars * 3 bytes = 21 bytes). ==== Regular Expressions & UTF-8 ==== PHP has two types of regular expressions: the POSIX-compliant ''ereg_*'' functions, and the Perl-compatible ''preg_*'' functions. The ''ereg_*'' functions all have multibyte equivalents, e.g. ''mb_ereg_*''. The ''preg_*'' functions //don't//, but you can pass the ''/u'' modifier to force them to parse strings in UTF-8 mode. See the [[http://www.regular-expressions.info/php.html|Regular Expressions in PHP]] page for more info. ==== PHP 6 ==== In theory, PHP6 will natively support multibyte strings. It is introducing a new string type: you'll be able to have binary strings (like in PHP4/5), and you'll also be able to have multibyte character based strings, which will let all of the standard string functions work properly. As of this writing in September 2009, PHP6 is still under development with no firm timeframe for release. == MySQL == There are two things that you need to worry about when dealing with MySQL & UTF-8. ==== Setting the connection charset ==== First, you need to make sure that you set the character set of your connection to be ''utf8''. The exact mechanics of doing that depend on the connection method you're using; here are a few of the common ones: mysql_set_charset('utf8'); // mysql extension mysqli_set_charset('utf8'); // mysqli extension $dbAdapterMySQLi->getConnection()->set_charset("utf8"); // Zend DB MySQLi **NOTE:** it is **VERY** important that you use the built-in ''set_charset()'' call to change the character set. Many sites recommend simply making a query to the database with "''SET NAMES utf8''". The problem with that is that the MySQL extension doesn't know you're passing data as UTF-8, and that means ''mysql_real_escape_string()'' will be escaping data using the default ''latin1'' character set. That could open your app up to weird behavior and possibly an SQL-injection vulnerability; see this [[http://stackoverflow.com/questions/1317152/am-i-correctly-supporting-utf-8-in-my-php-apps/1317239#1317239|StackOverflow post for more details]]. ==== Creating utf8 tables ==== Second, you need to make sure your database and database tables are using the ''utf8'' charset. The easiest way to do this is to specify the charset when you create the database and tables: CREATE DATABASE `my_db` CHARACTER SET = utf8 COLLATE = utf8_general_ci; CREATE TABLE `my_table` ([table spec]) CHARACTER SET = utf8 COLLATE = utf8_general_ci; ==== Converting from latin1 to utf8 ==== What if you already have an existing application, and it uses a database with a ''latin1'' charset (ISO 8859-1)? You'll have to convert your database to use utf8. The only way I've found to //reliably// do this is outlined below; unfortunately it requires taking the DB offline during the conversion. There may be better ways, but nothing I've tried has worked as well as this. **As always, make yourself about six backup copies first, just in case!** # Dump the database: mysqldump --default_character_set=latin1 -u root -p my_db > my_db.sql # Delete (drop) the database from the DB server. # Use ''iconv'' to convert any ''latin1'' (''iso-8859-1'') characters to ''utf8'' characters: iconv -f iso-8859-1 -t utf8 my_db.sql > my_db-utf8.sql # Use ''sed'' to replace any mentions of the ''latin1'' character set with the ''utf8'' character set: sed s/latin1/utf8/ < my_db-utf8.sql > my_db-utf8-final.sql # Create a new database with the proper UTF-8 character set and collation: CREATE DATABASE `my_db` CHARACTER SET = utf8 COLLATE = utf8_general_ci; # Reload your data into the new database: mysql -u root -p my_db < my_db-utf8-final.sql **UPDATE: 6 April 2010** This technique does not appear to properly translate MySQL BLOB (binary) data. ''mysqldump'' dumps the blob as a binary string, which is represented in the dumpfile as a series of characters with a latin1 encoding. When ''iconv'' comes through, it //changes// the binary representation of those characters into the UTF-8 encoding, which corrupts the underlying binary data. The only solution I've found so far is to manually copy the binary data afterwards from the old DB to the new DB, using a multi-database ''UPDATE'' query like this: UPDATE new_db_utf8.mytable SET new_db_utf8.mytable.blob_col = (SELECT old_db_latin1.mytable.blob_col FROM old_db_latin1.mytable WHERE old_db_latin1.mytable.id = new_db_utf8.mytable.id) Obviously you'll have to change that query for your specific database schema. If you have a better way to do this, please post in the comments. That should be it. Your database is now fully utf8, after a (hopefully) short bit of downtime. == Recommended Reading == There are many good articles about using UTF-8 in the LAMP stack. Here are a few: * [[http://ferdychristant.com/blog/articles/DOMM-7LDBXK|Building Unicode LAMP applications]] * [[http://climbtothestars.org/archives/2004/07/18/converting-mysql-database-contents-to-utf-8/|Converting MySQL Database Contents to UTF-8]] * [[http://akrabat.com/2009/03/18/utf8-php-and-mysql/|UTF8, PHP & MySQL]]