Fixing Doubly UTF-8 Encoded Text in MySQL

Earlier I wrote about fixing UTF-8 encoded text stored in a latin1 column by converting it back to latin1. What if you want to convert the column to UTF-8 instead? Changing the storage encoding is easy to do, it takes just one ALTER TABLE statement and MySQL converts the column to UTF-8. But afterwards you notice that the text that was encoded in UTF-8 now looks very weird: á has been turned into á, ä into ä… and “you’re” into “you’re”.

This happens because everything in the column was encoded in latin1 as far as MySQL knows: instead of an UTF-8 encoded á it sees the bytes C3 A1, and interprets them as the latin1 characters á. I call this doubly UTF-8 encoded text because the bytes that are stored are now the result of converting á from latin1 to UTF-8 twice: E1→ C3 A1 → C3 83 C2 A1.

Doubly UTF-8 encoded text can be fixed with two calls to the convert(.. using ..)-function: first we convert the bytes from utf8 to latin1, and then reset the encoding marker back to utf8:

mysql> select convert(binary convert('you’re' using latin1) using utf8);

Now that we know the basic mechanism of recovering doubly encoded text we could write a single UPDATE statement to fix the columns that have values with errors:

update mytable set mycolumn = 
    convert(binary convert(mycolumn using latin1) using utf8);

A more difficult case

What if the column has both singly and doubly encoded text? The single update statement above cannot be used because it will cut the text at the first non-ASCII character of singly encoded text. This happens because after the first call to convert the underlaying bytes are not valid as UTF-8 encoded text, and MySQL stops trying to convert when it finds out. It also gives the warning “1300 Invalid utf8 character string.”

Solution 1: We could try to detect the doubly encoded column values by comparing the character-length of the old value of the column with the byte-length of the new value. For doubly encoded text they should be equal. But now we have to do the conversion in the where-clause, which justifies the use of MySQL variables:

update mytable set mycolumn = @txt where char_length(mycolumn) = 
    length(@txt := convert(binary convert(mycolumn using latin1) using utf8));

Solution 2: Another approach is defining a new function that returns the recovered text if the conversion is successful and the original text otherwise. Basically the function would be a wrapper for the expression we use above, but with a check for warnings produced by the second call to convert:

Using this new function we could write the following statement to fix the column values.

update mytable set mycolumn = maybe_utf8_decode(mycolumn);

One Comment

  1. Sasha
    Posted 2013/06/05 at 11:32 | Permalink

    Thank you for this solution!