Sunday, April 20, 2008

A comprehensive XML processing benchmark

by Eduardo Rodrigues


I think I've already mentioned it here but, anyway, I'm currently leading a very interesting and challenging project for a big telecom company here in Brazil. This project is basically a complete reconstruction of the current data loading system used to process, validate and load all cellphone statements, which are stored as XML files, into an Oracle CMSDK repository. For those who aren't familiar, Oracle CMSDK is an old content management product, which has succeeded the older Oracle iFS (Internet File System). Because it's not an open repository, we are obligated to use its Java API if we want to programmatically load or retrieve data into or from the repository. That, obviously, prevents us from taking advantage of some of the newest tools available like Oracle's XML DB or even the recent Oracle Data Integrator.


One of our biggest concerns in this project is with the performance the new system must deliver. The SLA is really aggressive. So, we decided to make some research to find out the newest XML processing technologies available, try and compare them in order to make sure which ones would really help us in the most efficient way. The only constraints are: we must not consider any non-industry-standard solution nor any non-production (or non-stable) releases.

Test Sceneries

That said, based on research and also on previous experience, these were the technologies I've chosen to test and compare:

I've initially discarded DOM parsers based on the large average size of the XML files we'll be dealing with. We most certainly can't afford the excessive memory consumption involved. I've also discarded Oracle StAX Pull Parser, because it was still a preview release, and J2SE 5.0 built-in XML parsers, since I know they're a proprietary implementation of Apache Xerces based on a version certainly older than 2.9.1.

The test scenery designed was very simple and was intended only to measure and compare performance and memory consumption. The test job would be just to parse a real-world XML file containing 1 phone statement, retrieving and counting a predefined set of elements and attributes. In summary, rules were (for privacy's sake, real XML structure won't be revealed):
  1. Parse all occurrences of "/root/child1/StatementPage" element
  2. For each <StatementPage> do:
    1. Store and print out value of attribute "/root/child1/StatementPage/PageInfo/@pageNumber"
    2. Store and print out value of attribute "/root/child1/StatementPage/PageInfo/@customerCode"
    3. Store any occurrence of element <ValueRecord>, along with all its attributes, within page's subtree
    4. Print out the number of <ValueRecord> elements stored
  3. Print out the total number of <StatementPage> elements parsed
  4. Print out the total number of <ValueRecord> elements parsed

Also, every test should be performed for 2 different XML files: a small file (6.5MB), containing a total of 420 statement pages and 19,133 value records and a large one (143MB) with 7,104 pages and 464,357 value records.

Based on the rules above, I then tested and compared the following technology sets:
  1. Apache Digester using Apache Xerces2 SAX2 parser
  2. Apache Digester using Oracle SAX2 parser
  3. Sun JAXB2 using Xerces2 SAX2 parser
  4. Sun JAXB2 using Oracle SAX2 parser
  5. Sun JAXB2 using Woodstox StAX1 parser
  6. Pure Xerces2 SAX2 parser
  7. Pure Oracle SAX2 parser
  8. Pure Woodstox StAX1 parser

Based on this tutorial fragment from Sun: and considering that performance is our primary goal, I've chosen StAX's cursor API (XMLStreamReader) over iterator. Still aiming for performance, all tested parsers have been configured as non-validating.

In time; all tests were executed on a Dell Latitude D620 notebook, with an Intel Centrino DUO T2400 CPU @ 1.83GHz running on Windows XP Professional SP2 and Sun's Java VM 1.5.0_15 in client mode.


These were the performance results obtained after parsing the small XML file (for obvious reasons, I decided to measure heap usage only when the large file was parsed):

Performance results for small XML file
As you can see, Apache Digester's performance was extremely and surprisingly poor despite all my efforts to improve it. So, I had no other choice than to discard it for next tests with the large XML file, from which the results are presented bellow:

Performance results for large XML file
Notice that the tendency toward a better performance when <!DOCTYPE> tag is removed from the XML document has been clearly confirmed here.

As for memory allocation comparison, I've once again narrowed the tests only to the worst case from performance tests above: large XML file including <!DOCTYPE> tag. The results obtained from JDev's memory profiler were:

Memory allocation for large XML file
Another interesting information we can extract from these tests is how much XML binding represents in terms of overhead when compared to a straight parser:

Overhead charts


After a careful and thorough revision and confirmation of all results obtained from the tests described here, I tend to recommend a mixed solution. Considering its near 12MB/s throughput verified here, I'd certainly choose pure Woodstox StAX parser every time I'll have to deal with medium to large XML sources but, for convenience, I'd also choose JAXB 2 whenever there's a XML schema available to compile its classes from and the size of the source XML is not a concern.

As for complexity, I really can't say that any one of the tested technologies was found considerably more complex to implement than the others. In fact, I don't think this would be an issue for anybody with an average experience with XML processing.

Important Note

Just for curiosity, I've also tested Codehaus StaxMate 1.1 along with Woodstox StAX parser. It's a helper library built on top of StAX in order to create an easier to use abstraction layer for StAX cursor API. I can confirm the implementor's affirmation that StaxMate shouldn't represent any significant overhead for performance. In fact, performance results were identical when compared to pure Woodstox StAX parsing the large XML file. I can also say that it really made my job pretty easier. The only reason I won't consider StaxMate is that it depends on a StAX 1.0 API non-standard extension which is being called "StAX2" by guys at Codehaus.

That's all for now.

Enjoy and... keep reading!


Anonymous said...

Hey Eduardo,

Cool post and looks you are doing some pretty solid research on XML performance. I have some experience with XML performance issues - I work on Intel's XML software team. I hate when posters on blogs use it for a sales pitch so I won't try to sell you on it but if you are intersted in high performance, check it out. I can promise you'll be pleasently surprised. See,

Unknown said...

Well Matt,

It certainly seems very impressive. However, our intention here was to compare only open-source, industry-standard compliant solutions. An exception was made to Oracle XDK because this blog is about Oracle technologies too - although you can clearly see that Oracle XDK didn't show the best results.

Anyway, your post is published, so readers are free to try Intel's XML kit if they want to.

Unknown said...

Hi Eduardo! Thank you for the interesting article. I was happy to also see a reference to StaxMate. One minor note wrt StaxMate: although it does use Stax2 extension, it also implements wrappers, so theoretically any other Stax (1.0) implementation should work as well. I have not extensively tested this, but I would expect Sun's sjsxp implementation (part of JDK 1.6) to work as well as Woodstox.

Also, another brand new Stax implementation, Aalto ( might be interesting to check out. While it is work-in-progress, it does implement Stax well enough to work with JAXB 2, and it is very very fast (50% higher throughput than Woodstox) at least in my test cases.

Unknown said...

Thank you Tatu. That's very interesting info indeed.

anon_anon said...

try vtd-xml
you won't regret