Monday, June 30, 2008

Bim Bam BOM

We had this strange bug recently, while trying to parse a perfectly healthy XML file we got an exception saying:

org.xml.sax.SAXParseException: Document root element is missing.

We could, however, open the XML file in the browser without an error. And also our XML editor took it without a problem. Yet, in our code trying to operate some XSL transformation using Xalan, we got the exception above. And the problem was not with the transformation itself, at least according to the exception. The problem is with the XML document starting without a root. Though the first character seen in the file was an opening triangle bracket...

Well, there are things beyond what you see.

To find out what is the exact stream of bytes that the transformer receives, and doesn't like, we added the simplest debug line, dumping the bytes from the file to screen, not as chars but in their value. There appeared to be two bytes before the opening triangle bracket of the XML: FF FE.

At this point, my friend and colleague Effie Nadiv (famous for his Hebrew site, and a UI authority and legend), shouted out: it's the BOM! Xalan doesn't recognize the BOM correctly!!

And without further ado he presented the file (same file) in text mode and in binary mode:





See the FF FE at the beginning? This is the BOM.

BOM stands for Byte-Order-Mark, added to UTF-16 documents to denote the order of the bytes in each two consecutive bytes creating a character. UTF-8 documents may also have BOM, but it will be redundant and have no mean.

To read more about Unicode, UTF-8, UTF-16 and BOM, you may want to go to:
http://unicode.org/faq/utf_bom.html
http://en.wikipedia.org/wiki/Byte_Order_Mark

But, before you rush to the above, a GREAT reading material to understand once and for all the entire encoding and charset thing:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.
This is a must read!

Specifically to solve the above problem we changed the doc to UTF-8 without BOM. Most text editors support conversion to different unicode transformation formats, and allow the user to decide whether to add BOM to UTF-8 or not (probably it's better not to add, just turn off the option underneath).

.


=========================================
Added 21/7/08:
------------------------------
Just found this old newsgroup entry on the subject...
http://biglist.com/lists/xsl-list/archives/200208/msg01302.html
=========================================

Thursday, June 26, 2008

Tag, Log, Debug!

I was presenting with my friend Alex Romanov our work on "Automated Log Generation and Analysis using Collaborative Tagging" at the IBM Programming Languages and Environments seminar 2008, yesterday (25/6).


This is Alex with the poster describing our work:






More info on this work can be found here: http://ed.finnerty.googlepages.com/taglogdebug.
Here you can find the initial paper.
And finally, a blog that will follow our work was opened here: http://taglogdebug.blogspot.com

Tuesday, June 24, 2008

Eclipse Plugins Tutorial (OOPSLA '07)



I was asked where can people get the materials from the tutorial on "Creating Plug-ins and Applications on Eclipse Platform" that Alex Romanov and I gave in Montreal on OOPSLA '07. It was while ago, and the materials are on the web already. But here are the links, as it seems google is not doing a good job referring people to the googlegroups site we created...
The tutorial deals with all the important things for getting started with Eclipse plug-ins: creating a plug-in project, the components of a plug-in project, GEF, MVC in the Eclipse Plug-in architecture, SWT, Actions, RCP and more.

So here is the companion booklet.
And here are the slides.
Both can be printed or redistributed, as is, for any purpose, referring back to the source.

Hope you can find it useful.