Processing XML in Hadoop

This is something that I’ve been battling with for a while and tonight, whilst not able to sleep, I finally found a way to crack it! The answer: use an XmlInputFormat from a related project. Now for the story.

We process lots of XML documents every day and some of them are pretty large: over 8 gigabytes uncompressed. Each. Yikes!

We’d made significant performance gains by switching the old REXML SAX parser to libxml. But we’ve suffered from random segfaults on our production servers (seemingly caused by garbage collecting bad objects). Besides, this still was running at nearly an hour for the largest reports.

Hadoop did seem to offer XML processing: the general advice was to use Hadoops’s StreamXmlRecordReader which can be accessed through using the StreamInputFormat.

This seemed to have weird behaviour with our reports (which often don’t have line-endings): lines would be duplicated and processing would jump to 100%+ complete. All a little fishy.

Hadoop Input Formats and Record Readers

The InputFormat in Hadoop does a couple of things. Most significantly, it provides the Splits that form the chunks that are sent to discrete Mappers.

Splits form the rough boundary of the data to be processed by an individual Mapper. The FileInputFormat (and it’s subclasses) generate splits on overall file size. Of course, it’s unlikely that all individual Records (a Key and Value that are passed to each Map invocation) lie neatly within these splits: records will often cross splits. The RecordReader, therefore, must handle this

… on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

In short, a RecordReader could scan from the start of a split for the start of it’s record. It may then continue to read past the end of it’s split to find the end. The InputSplit only contains details about the offsets from the underlying file: data is still accessed through the streams.

It seemed like the StreamXmlRecordReader was skipping around the underlying InputStream too much; reading records it wasn’t entitled to read. I tried my best at trying to understand the code but it was written a long while ago and is pretty cryptic to my limited brain.

I started trying to rewrite the code from scratch but it became pretty hairy very quickly. Take a look at the implementation of next() in LineRecordReader to see what I mean.

Mahout to the Rescue

After a little searching around on GitHub I found another XmlInputFormat courtesy of the Lucene sub-project: Mahout.

I’m happy to say it appears to work. I’ve just run a quick test on our 30 node cluster (via my VPN) and it processed the 8 gig file in about 10 minutes. Not bad.

For anyone trying to process XML with Hadoop: try Mahout’s XmlInputFormat.

43 responses

I'm looking to do the same thing and have just started looking into Mahout's XmlInputFormat. Any chance you can share an example of what the mapper code looks like when using this?


Cheers,
Steve

Great that you found it useful. Its not a very good way of parsing XML. That is, we get only the tag contents, not the properties/attributes of the tag. So, in case you have some thing more done at your end, do come back and file patches to Mahout :)

It's been ok for our purposes- we can extract the full contents of the element we're interested in which still allows us to parse attributes and any other nested elements.


We started without needing to do much else (we didn't need a full blown Document to navigate, or, use a reader etc.) as we were just parsing out some attribute values.


It's probably not the best (guess you could provide some kind of lazy XML document record type to pass through to the mapper, but, the standard Mahout format has been absolutely fine for now.

I'm looking to process xml document with mapreduce with the mahout XmlInputFormat
can you please put an example of what the mapper code looks like when using this?

regards

I don't have anything to hand for Java (we're using this in combination with clojure), but it works similarly to the text format: the key that's provided to your mapper with be a byte offset, the value will be the string containing the contents of the matching part of the xml document (provided as a Text writable).

You would need to do something like:

jobConf.setInputFormat(org.apache.mahout.classifier.bayes.XmlInputFormat.class)

The format won't do anything more so you'll need to still write the code to parse the XML. You may need to use a full XML parser, you may be ok with extracting the contents you need yourself.

Hi, I try to use the Mahout’s XmlInputFormat, but uptil now no luck. My entire project is generating and consuming a lot of xml files. To implement Hadoop, I need to show how we can process xml.
I m receving nothing in map as input.
Aniruddha, sounds like you're not matching to the right tags (set through the xmlinput.start and xmlinput.end properties in the job config). Also note that the XmlInputFormat provides the contents of the tag in the value to the mapper as a Text writable (so you still need to parse the text).
Hi Paul,
Thanks so much for sharing this. Are you using Mahout's XMLInputFormat in conjunction with Hadoop's streaming interface? If yes, could you share your code how to get that up and running?
Best,
Diederik
Hi Diederik,

We're not using it with the streaming interface currently (it's used by a Clojure job for some ETL). However, it ought to be pretty easy to use it from the streaming API.

You'll need to make sure you add the mahout jar to the job's class path (or put it onto Hadoop's class path on each node), then set the InputFormat with the -inputformat parameter.

Once you've set that you just need to pass the 2 parameters that set the start and end tags for the XML format to match to.

I found a way around this by checking for the start and end tags within my Python code as it reads standard in, thus avoiding having to use the Mahout jar.

http://davidvhill.com/article/processing-xml-with-hadoop-streaming

Hi David,

Thanks for the link.

We had some issues with the bundled streaming xml reader dropping records/dying half way through parts of the documents when it was missing line endings.

But, if it works, definitely much easier than using a 3rd party jar!

Thank you for sharing this. Mahout’s XmlInputFormat seems to be my lifesaver.
You mention clojure; did you use cascalog by chance? If so can you share a brief code snippet showing how it's set up?
Hi, I have tried with XML Input Format and it is working well with small XML files. But when I am processing huge XML file (around 750MB) data, am getting " Heap memory exception". Could any one suggest how to deal with large xml files. Thanks in advance.. Sagar.......
@Sagar I think you'd need to either allocate the JVM more space or shrink the files. Because Hadoop's Map/Reduce functions are record-oriented they need to know what bounds a record. If your XML file is genuinely 750MB for an individual record then that is probably the problem. If you have 750MB of repeated, smaller records then you're probably not splitting them correctly- check the element name is specified properly to ensure its splitting up the content correctly. @Kevin We did use Cascalog but not for this job- it was a straight Java job. It's been a long while since I've looked but it should be possible to wire together with Cascading. You'll probably need to write a custom Scheme that uses the XmlInputFormat. HTH Paul Ingles
28 visitors upvoted this post.