This is something that I’ve been battling with for a while and tonight, whilst not able to sleep, I finally found a way to crack it! The answer: use an XmlInputFormat from a related project. Now for the story.
We process lots of XML documents every day and some of them are pretty large: over 8 gigabytes uncompressed. Each. Yikes!
We’d made significant performance gains by switching the old REXML SAX parser to libxml. But we’ve suffered from random segfaults on our production servers (seemingly caused by garbage collecting bad objects). Besides, this still was running at nearly an hour for the largest reports.
Hadoop did seem to offer XML processing: the general advice was to use Hadoops’s StreamXmlRecordReader which can be accessed through using the StreamInputFormat.
This seemed to have weird behaviour with our reports (which often don’t have line-endings): lines would be duplicated and processing would jump to 100%+ complete. All a little fishy.
Hadoop Input Formats and Record Readers
The InputFormat in Hadoop does a couple of things. Most significantly, it provides the Splits that form the chunks that are sent to discrete Mappers.
Splits form the rough boundary of the data to be processed by an individual Mapper. The FileInputFormat (and it’s subclasses) generate splits on overall file size. Of course, it’s unlikely that all individual Records (a Key and Value that are passed to each Map invocation) lie neatly within these splits: records will often cross splits. The RecordReader, therefore, must handle this
… on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.
In short, a RecordReader could scan from the start of a split for the start of it’s record. It may then continue to read past the end of it’s split to find the end. The InputSplit only contains details about the offsets from the underlying file: data is still accessed through the streams.
It seemed like the StreamXmlRecordReader was skipping around the underlying InputStream too much; reading records it wasn’t entitled to read. I tried my best at trying to understand the code but it was written a long while ago and is pretty cryptic to my limited brain.
I started trying to rewrite the code from scratch but it became pretty hairy very quickly. Take a look at the implementation of next() in LineRecordReader to see what I mean.
Mahout to the Rescue
After a little searching around on GitHub I found another XmlInputFormat courtesy of the Lucene sub-project: Mahout.
I’m happy to say it appears to work. I’ve just run a quick test on our 30 node cluster (via my VPN) and it processed the 8 gig file in about 10 minutes. Not bad.
For anyone trying to process XML with Hadoop: try Mahout’s XmlInputFormat.