Somebody Set Us Up the BOM

By Chris Rittersdorf on 07 08 2011

Last week I was working on localizations for a Ruby on Rails project.
Rails provides a simple way to add internationalization to your
project. But what should have been as simple as adding a couple of locale files to
the project turned into a wild goose chase fraught with misery and heartbreak.

The Danish Translations

I received the Danish translations from a project manager. These were
given to him by a native Danish speaker who had translated the English copy to Danish.
I placed the files in the appropriate directory and then fired
up my Rails server.

When I switched the locale to Denmark, the Rails application loaded up
the English version of the application, but with several of the words
highlighted in yellow. The Danish translation file wasn't getting loaded.

Upon inspection I saw that the YAML file still contained the English
translations as well as the Danish ones. For example:

welcome: "Welcome" - "Velkommen"

That's not valid YAML, which would explain why things weren't working. The English translations were still present along with the Danish
translation. For each entry everything after the "-" would have to be
removed.

This problem was nothing that a few keystrokes of VIM couldn't fix. So I fixed
it. Reloading the Rails application had no effect. The application was still
defaulting to English because it couldn't read the YAML.

Digging Deeper

This YAML was clean and I knew it! I triple-checked it. So I
opened up the 'i18n' gem and started throwing 'debugger' statements all
over the place.

The debug statements let me peek at the internal structure of the locales that got loaded,
The English and the Pirate locales were getting loaded
correctly, but the Danish translation was incorrect.

At the beginning of the Danish YAML file, there were two lines of
comments, followed by a line that contained the root node of the YAML
document. For example:

I would have expected the YAML to get de-serialized like the following:

But instead of seeing a hash with the first key :da, I saw that everything in the file,
including the first two lines of comments were getting set as the key
for the hash.

That's not correct. I expressed my anguish to my colleagues. Matt Lehman
suggested that I should verify byte order mark. I had no clue what this is.

The Byte Order Mark

The byte order mark (BOM) is a unicode character that specifies the
endianness of a text file. It lies at the very beginning of a UTF-8
file.

At a glance the Danish local file appeared to be clean. There was no
character at the beginning, and the YAML was well formed. But I needed
to take a closer look.

I downloaded Hex Fiend to inspect
the binary makeup of the file. After loading the file I noticed an odd
byte before the file's content:

da.yml

What is that!? That couldn't be right. According to Wikipedia, this odd
string of characters is the BOM for UTF-8 Files.

The rest of the YAML files didn't have the BOM. So I removed it from the
Danish translation file. And after that, everything worked great!

The Moral of the Story

If your code is trying to load a YAML file and it doesn't parse
correctly, check for the BOM, especially if it's coming from an unknown
source. Many (non-programmer focused) text editors will add the BOM by default.
MS Notepad is guilty this. The BOM is optional, and unless is required
for other reasons, can safely be removed.