Fixing mangled characters in MySQL

Some 30000 UTF-8 encoded posts in a popular Spanish forum had been stored in a column defined as latin1 with the usual consequences: accented characters like á é ó ñ got mangled into á é ó ñ, also known as mojibake. To complicate matters more, the more recent content was properly encoded in latin1, so a bulk conversion wouldn’t work. I thought we might as well get this issue fixed since we’re planning to upgrade from phpBB2 to phpBB3 shortly anyway.

It turns out that the least painful solution has to do with the convert(… using …)-function, available since MySQL 4.1. You just have to know what you’re doing.

The MySQL documentation has a page on column character set conversions but it’s missing practical examples, in particular on what to do when you have a mix of encodings in one column. People suggest different techniques to solve this problem, but sadly I find the writing lacking in clarity. The latter one actually suggests that the storage encoding of a text column is determined by the collation! (Collations actually only define how strings are ordered, which depends on the language you speak. For example in Finnish and Swedish Ä is a letter on its own right and is ordered after Z, while in English it is equivalent to A.)

My guess is that what people find most confusing are the implicit conversions. For instance when you insert data into a text column it is silently converted into the column encoding. So far the clearest guide to MySQL charsets and collations I’ve seen is by Shlomi Noach.

Many people recommend migrating everything to UTF-8 but I decided keep latin1. The site is Spanish-language only, so latin1 is enough. Remember that what MySQL calls “latin1” is not ISO-8859-1 but actually windows-1252, so we’re not losing the euro-character, curly quotes etc.

How Charsets Work in MySQL

Non-binary MySQL strings are basically octet strings with an encoding marker. The encoding specifies how bytes are decoded into characters. Each string can have an encoding of its own; there is no single “working encoding” like in Java for instance. You can check the encoding used for a string with the CHARSET-function, and see the byte values with the HEX-function:

mysql> select "ñ", charset("ñ") as cs, hex("ñ") as hex;
+----+------+------+
| ñ  | cs   | hex  |
+----+------+------+
| ñ  | utf8 | C3B1 |
+----+------+------+

The convert(.. using ..)-function changes the encoding marker and the underlaying bytes so that the new string encodes the same string as the old one:

mysql> set @c = convert("ñ" using latin1);
mysql> select @c, charset(@c) as cs, hex(@c) as hex;
+------+--------+------+
| @c   | cs     | hex  |
+------+--------+------+
| ñ    | latin1 | F1   |
+------+--------+------+

(Here you see one of those implicit conversions: the MySQL client converts the latin1 string into UTF-8 so that my UTF-8 terminal can show it correctly.)

What about binary strings? They are effectively octet sequences without a charset marker, so the convert(… using …) function simply slaps a charset marker on the byte string. The same thing happens when you store a binary string in a non-binary column. Therefore, if you start with a mangled latin1 string like “niño,” you can easily correct it with convert(binary … using utf8):

mysql> set @c = convert("niño" using latin1);
mysql> select convert(binary @c using utf8);
+-------------------------------+
| convert(binary @c using utf8) |
+-------------------------------+
| niño                          |
+-------------------------------+

(Here I first convert it to latin1 because my client encoding is UTF-8. With a latin1 text column I wouldn’t have to do this extra step.)

You could omit the explicit call to convert if you were updating or inserting into a UTF-8 column, but this is not the case here.

The Solution

If all posts suffered from this problem, it would be easy to do a bulk update to recode all the text with SQL like this:

-- convert ALL posts from utf8 back to latin1
update posts set post_text = convert(binary post_text using utf8);

Remember that only some posts are encoded in UTF-8 while others are in latin1, so this doesn’t work: the latin1 posts contain byte sequences that are not valid in utf8. For these posts MySQL gives you a piece of the converted data and the warning “1300 Invalid utf8 character string.” I could try detecting the posts that have mangled text by comparing the original and converted lengths:

-- convert ONLY utf8 posts back to latin1
update posts set post_text = @txt where length(post_text) =
    length(@txt := convert(binary post_text using utf8));

But no. In fact, there were forum posts that have mangled UTF-8 characters mixed with regular latin1. If you try to run “convert” on that you get a piece of converted data and the warning “1300 Invalid utf8 character string.” So in the end I was reduced to running the following replace function, for the 128 different bytes from 128 to 255:

update posts set post_text = replace(binary post_text,
    convert(char(0xF1 using latin1) using utf8), char(0xF1));

Wait, what? Why so complicated? Well, first you have strip the post_text of its encoding marker so that replace works on bytes, not on characters. Second, char produces binary strings, which can be converted to UTF-8 by going through latin1:

mysql> select hex(convert(char(0xF1 using latin1) using utf8));
+--------------------------------------------------+
| hex(convert(char(0xF1 using latin1) using utf8)) |
+--------------------------------------------------+
| C3B1                                             |
+--------------------------------------------------+

Unfortunately you can’t use char(0xF1 using utf8) because char interprets its argument as a list of bytes to be encoded as a string, and F1 is not valid UTF-8.

So there. I hope this helps somebody.

One Trackback

  1. […] I wrote about fixing UTF-8 encoded text stored in a latin1 column by converting it back to latin1. What if you want to convert the column […]