Last Updated: 14 Oct 2023
|
Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
backend-tech:going-utf-8-utf8-with-php-and-mysql [Apr 6, 2010 07:50 PM] dordal |
backend-tech:going-utf-8-utf8-with-php-and-mysql [Oct 13, 2023 09:47 AM] 111.225.148.109 removed |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | = Going UTF-8 (utf8) with PHP & MySQL = | ||
+ | |||
+ | UTF-8 is a character encoding standard which supports characters for (nearly) all the languages in the world. Older standards, such as US-ASCII and ISO-8859-1, contains only characters for English (US-ASCII) and Western European languages (ISO-8859-1). | ||
+ | |||
+ | There are a lot of good reasons to use UTF-8, especially if your app will (or may eventually need to) support international users. If you're developing in PHP, there' | ||
+ | |||
+ | If you want to use UTF-8, here is a quick guide to upgrading your LAMP application. First & foremost, you want to ensure that your entire application stack is using UTF-8. That means you're serving pages to the browser in UTF-8, the browser is sending data back in UTF-8, and you're storing data in your database in UTF-8. If some portion of the app stack // | ||
+ | |||
+ | == Getting the Browser to use UTF-8 == | ||
+ | |||
+ | You'll want to make sure to tell the browser that you're sending data as UTF-8, and that it should send data back as UTF-8. To do that, you should put in an HTTP header on every page: | ||
+ | <code php> | ||
+ | header(' | ||
+ | </ | ||
+ | and also include a '' | ||
+ | <code html> | ||
+ | <meta http-equiv=" | ||
+ | </ | ||
+ | |||
+ | Note that some folks recommend you do add an ' | ||
+ | <code html> | ||
+ | <form action=" | ||
+ | </ | ||
+ | but [[http:// | ||
+ | |||
+ | == PHP - Using the mb_* functions == | ||
+ | |||
+ | PHP4/5 treats every string as a sequence of bytes, rather than a sequence of characters. If one char = one byte, that's fine. A function like '' | ||
+ | |||
+ | The trouble comes with UTF-8; it uses //anywhere between one and three bytes// to represent a single character. Now, one char != one byte... so '' | ||
+ | |||
+ | To combat this, PHP introduced the [[http:// | ||
+ | |||
+ | Therefore, the general recommendation is to go through all of your code and replace any standard string function with the '' | ||
+ | |||
+ | ==== Other Options in PHP ==== | ||
+ | |||
+ | Replacing all of your string functions is a **ton** of work, so //do you have to//? Yes and no: | ||
+ | |||
+ | PHP includes [[http:// | ||
+ | |||
+ | You could also just not do anything, and let PHP treat multibyte strings as a sequence of individual bytes. Depending on what you're doing, this may not be as dumb as it sounds: if you only expect to be passing mutlibyte data back and forth between a webpage and a database, you'll probably be OK. The PHP WACT site has a great summary on what can break if you [[http:// | ||
+ | |||
+ | ==== Regular Expressions & UTF-8 ==== | ||
+ | |||
+ | PHP has two types of regular expressions: | ||
+ | |||
+ | ==== PHP 6 ==== | ||
+ | In theory, PHP6 will natively support multibyte strings. It is introducing a new string type: you'll be able to have binary strings (like in PHP4/5), and you'll also be able to have multibyte character based strings, which will let all of the standard string functions work properly. As of this writing in September 2009, PHP6 is still under development with no firm timeframe for release. | ||
+ | |||
+ | == MySQL == | ||
+ | |||
+ | There are two things that you need to worry about when dealing with MySQL & UTF-8. | ||
+ | |||
+ | ==== Setting the connection charset ==== | ||
+ | |||
+ | First, you need to make sure that you set the character set of your connection to be '' | ||
+ | <code php> | ||
+ | mysql_set_charset(' | ||
+ | mysqli_set_charset(' | ||
+ | $dbAdapterMySQLi-> | ||
+ | </ | ||
+ | **NOTE:** it is **VERY** important that you use the built-in '' | ||
+ | |||
+ | ==== Creating utf8 tables ==== | ||
+ | |||
+ | Second, you need to make sure your database and database tables are using the '' | ||
+ | <code sql> | ||
+ | CREATE DATABASE `my_db` CHARACTER SET = utf8 COLLATE = utf8_general_ci; | ||
+ | CREATE TABLE `my_table` ([table spec]) CHARACTER SET = utf8 COLLATE = utf8_general_ci; | ||
+ | </ | ||
+ | |||
+ | ==== Converting from latin1 to utf8 ==== | ||
+ | |||
+ | What if you already have an existing application, | ||
+ | |||
+ | |||
+ | # Dump the database:< | ||
+ | mysqldump --default_character_set=latin1 -u root -p my_db > my_db.sql | ||
+ | </ | ||
+ | # Delete (drop) the database from the DB server. | ||
+ | # Use '' | ||
+ | iconv -f iso-8859-1 -t utf8 my_db.sql > my_db-utf8.sql | ||
+ | </ | ||
+ | # Use '' | ||
+ | sed s/ | ||
+ | </ | ||
+ | # Create a new database with the proper UTF-8 character set and collation:< | ||
+ | CREATE DATABASE `my_db` CHARACTER SET = utf8 COLLATE = utf8_general_ci; | ||
+ | </ | ||
+ | # Reload your data into the new database:< | ||
+ | mysql -u root -p my_db < my_db-utf8-final.sql | ||
+ | </ | ||
+ | |||
+ | **UPDATE: 6 April 2010** This technique does not appear to properly translate MySQL BLOB (binary) data. '' | ||
+ | |||
+ | The only solution I've found so far is to manually copy the binary data afterwards from the old DB to the new DB, using a multi-database '' | ||
+ | |||
+ | <code sql> | ||
+ | UPDATE new_db_utf8.mytable SET | ||
+ | new_db_utf8.mytable.blob_col = (SELECT old_db_latin1.mytable.blob_col FROM old_db_latin1.mytable WHERE old_db_latin1.mytable.id = new_db_utf8.mytable.id) | ||
+ | </ | ||
+ | |||
+ | Obviously you'll have to change that query for your specific database schema. If you have a better way to do this, please post in the comments. | ||
+ | |||
+ | That should be it. Your database is now fully utf8, after a (hopefully) short bit of downtime. | ||
+ | |||
+ | == Recommended Reading == | ||
+ | |||
+ | There are many good articles about using UTF-8 in the LAMP stack. Here are a few: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | |||
+ | |||