Monday, June 30, 2008

Bim Bam BOM

We had this strange bug recently, while trying to parse a perfectly healthy XML file we got an exception saying:

org.xml.sax.SAXParseException: Document root element is missing.

We could, however, open the XML file in the browser without an error. And also our XML editor took it without a problem. Yet, in our code trying to operate some XSL transformation using Xalan, we got the exception above. And the problem was not with the transformation itself, at least according to the exception. The problem is with the XML document starting without a root. Though the first character seen in the file was an opening triangle bracket...

Well, there are things beyond what you see.

To find out what is the exact stream of bytes that the transformer receives, and doesn't like, we added the simplest debug line, dumping the bytes from the file to screen, not as chars but in their value. There appeared to be two bytes before the opening triangle bracket of the XML: FF FE.

At this point, my friend and colleague Effie Nadiv (famous for his Hebrew site, and a UI authority and legend), shouted out: it's the BOM! Xalan doesn't recognize the BOM correctly!!

And without further ado he presented the file (same file) in text mode and in binary mode:





See the FF FE at the beginning? This is the BOM.

BOM stands for Byte-Order-Mark, added to UTF-16 documents to denote the order of the bytes in each two consecutive bytes creating a character. UTF-8 documents may also have BOM, but it will be redundant and have no mean.

To read more about Unicode, UTF-8, UTF-16 and BOM, you may want to go to:
http://unicode.org/faq/utf_bom.html
http://en.wikipedia.org/wiki/Byte_Order_Mark

But, before you rush to the above, a GREAT reading material to understand once and for all the entire encoding and charset thing:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.
This is a must read!

Specifically to solve the above problem we changed the doc to UTF-8 without BOM. Most text editors support conversion to different unicode transformation formats, and allow the user to decide whether to add BOM to UTF-8 or not (probably it's better not to add, just turn off the option underneath).

.


=========================================
Added 21/7/08:
------------------------------
Just found this old newsgroup entry on the subject...
http://biglist.com/lists/xsl-list/archives/200208/msg01302.html
=========================================

2 comments:

royshil said...

VEry very informative.
In fact this goddarn <feff> (as you see it in 'vi') has been the cause for many hours of pointless debugging.

I also liked the wondeful article by Joel Spolesky.

10x,
Roy.

Amir Kirsh said...

Roy, I expect a relevant cartoon in your blog on the subject!