Processing XML in Hadoop
This is something that I’ve been battling with for a while and tonight, whilst not able to sleep, I finally found a way to crack it! The answer: use an XmlInputFormat from a related project. Now for the story.
We process lots of XML documents every day and some of them are pretty large: over 8 gigabytes uncompressed. Each. Yikes!
We’d made significant performance gains by switching the old REXML SAX parser to libxml. But we’ve suffered from random segfaults on our production servers (seemingly caused by garbage collecting bad objects). Besides, this still was running at nearly an hour for the largest reports.
Hadoop did seem to offer XML processing: the general advice was to use Hadoops’s StreamXmlRecordReader which can be accessed through using the StreamInputFormat.
This seemed to have weird behaviour with our reports (which often don’t have line-endings): lines would be duplicated and processing would jump to 100%+ complete. All a little fishy.
Hadoop Input Formats and Record Readers
The InputFormat in Hadoop does a couple of things. Most significantly, it provides the Splits that form the chunks that are sent to discrete Mappers.
Splits form the rough boundary of the data to be processed by an individual Mapper. The FileInputFormat (and it’s subclasses) generate splits on overall file size. Of course, it’s unlikely that all individual Records (a Key and Value that are passed to each Map invocation) lie neatly within these splits: records will often cross splits. The RecordReader, therefore, must handle this
… on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.
In short, a RecordReader could scan from the start of a split for the start of it’s record. It may then continue to read past the end of it’s split to find the end. The InputSplit only contains details about the offsets from the underlying file: data is still accessed through the streams.
It seemed like the StreamXmlRecordReader was skipping around the underlying InputStream too much; reading records it wasn’t entitled to read. I tried my best at trying to understand the code but it was written a long while ago and is pretty cryptic to my limited brain.
I started trying to rewrite the code from scratch but it became pretty hairy very quickly. Take a look at the implementation of next() in LineRecordReader to see what I mean.
Mahout to the Rescue
After a little searching around on GitHub I found another XmlInputFormat courtesy of the Lucene sub-project: Mahout.
I’m happy to say it appears to work. I’ve just run a quick test on our 30 node cluster (via my VPN) and it processed the 8 gig file in about 10 minutes. Not bad.
For anyone trying to process XML with Hadoop: try Mahout’s XmlInputFormat.


Comments 5 Comments
I'm looking to do the same thing and have just started looking into Mahout's XmlInputFormat. Any chance you can share an example of what the mapper code looks like when using this?
Cheers,
Steve
Great that you found it useful. Its not a very good way of parsing XML. That is, we get only the tag contents, not the properties/attributes of the tag. So, in case you have some thing more done at your end, do come back and file patches to Mahout :)
It's been ok for our purposes- we can extract the full contents of the element we're interested in which still allows us to parse attributes and any other nested elements.
We started without needing to do much else (we didn't need a full blown Document to navigate, or, use a reader etc.) as we were just parsing out some attribute values.
It's probably not the best (guess you could provide some kind of lazy XML document record type to pass through to the mapper, but, the standard Mahout format has been absolutely fine for now.
can you please put an example of what the mapper code looks like when using this?
regards
You would need to do something like:
jobConf.setInputFormat(org.apache.mahout.classifier.bayes.XmlInputFormat.class)
The format won't do anything more so you'll need to still write the code to parse the XML. You may need to use a full XML parser, you may be ok with extracting the contents you need yourself.