Java and File Names With Invalid UTF-8

On Unix systems – Linux and OS X included – file names can be arbitrary binary data with very few limitations. This means that in order to make sense of the name a character encoding must be used. Recently UTF-8 has become the default encoding on many systems, but sometimes you have to deal with files originating from older systems with names in other encodings. These files are a problem for Java programs because java.io treats file names as strings of unicode characters rather than bytes, and is unable to open files with with incorrectly encoded names.

Example

This Java program lists files from the current directory and tells you if they exist. It demonstrates that when Java encounters a file with a problematic name it does report it in listFiles, but any further operations on the file fail.

import java.io.File;
import java.io.IOException;

class Ls {
    public static void main(String[] args) throws IOException {
        File d = new File(".");
        for (File f : d.listFiles()) {
            System.out.printf("%s: %b\n", f.getName(), f.exists());
        }
    }
}

For example, when it encounters a file with a name encoded in latin1, this is what happens:

$ ls -b
ni\361o
$ java Ls
ni�: false

You can download Ls.java with an example file here.

Setting the default character encoding

You probably know that Java uses a “default character encoding” to convert binary data to Strings. To read or write text using another encoding you can use an InputStreamReader or OutputStreamWriter. But for data-to-text conversions deep in the API you have no choice but to change the default encoding.

Java reads the default character encoding from the system language settings. On Unix this means LANG and LC_CTYPE environment variables; changing one of these is sufficient. For example, to make Java use latin1 you could start the JVM with the following command:

$ LANG=en_US.iso88591 java Ls
ni�o: true

Or, if you want all programs you start from the terminal to use this locale:

export LANG=en_US.iso88591

The locale en_US.iso88591 has to be installed on the system for these to work, though. You can use the following command to list locales that are available on your system.

locale -a

Defining and installing a new locale

If you don’t have a locale with the appropriate encoding installed you can define and install a new one with the localedef program. For example, to create locale with the Windows Western character encoding you could use the following command.

sudo localedef -f CP1252 -i en_US en_US.cp1252

Under this locale Java would correctly process files with all kinds of names, including those whose name contains curly quotes or the euro character €.

What about file.encoding?

The file.encoding system property can also be used to set the default character encoding that Java uses for I/O. Unfortunately it seems to have no effect on how file names are decoded into Strings.

A penny for your thoughts

(Your email is never shared.)