by Eduardo Rodrigues
I think I've already mentioned it here but, anyway, I'm currently leading a very interesting and challenging project for a big telecom company here in Brazil. This project is basically a complete reconstruction of the current data loading system used to process, validate and load all cellphone statements, which are stored as XML files, into an Oracle CMSDK 22.214.171.124.2 repository. For those who aren't familiar, Oracle CMSDK is an old content management product, which has succeeded the older Oracle iFS (Internet File System). Because it's not an open repository, we are obligated to use its Java API if we want to programmatically load or retrieve data into or from the repository. That, obviously, prevents us from taking advantage of some of the newest tools available like Oracle's XML DB or even the recent Oracle Data Integrator.
One of our biggest concerns in this project is with the performance the new system must deliver. The SLA is really aggressive. So, we decided to make some research to find out the newest XML processing technologies available, try and compare them in order to make sure which ones would really help us in the most efficient way. The only constraints are: we must not consider any non-industry-standard solution nor any non-production (or non-stable) releases.
That said, based on research and also on previous experience, these were the technologies I've chosen to test and compare:
- JAXP SAX 2 compliant parsers:
Oracle XDK parsers shipped with JDeveloper 10.1.3.3
Apache Xerces 2.9.1
- StAX 1 compliant pull parsers (Streaming API for XML):
StAX 1.0 (JSR-173) API
Codehaus Woodstox 3.2.4
- XML binding:
Sun's JAXB 2.1.6 Reference Implementation
Apache Commons Digester 1.8
I've initially discarded DOM parsers based on the large average size of the XML files we'll be dealing with. We most certainly can't afford the excessive memory consumption involved. I've also discarded Oracle StAX Pull Parser, because it was still a preview release, and J2SE 5.0 built-in XML parsers, since I know they're a proprietary implementation of Apache Xerces based on a version certainly older than 2.9.1.
The test scenery designed was very simple and was intended only to measure and compare performance and memory consumption. The test job would be just to parse a real-world XML file containing 1 phone statement, retrieving and counting a predefined set of elements and attributes. In summary, rules were (for privacy's sake, real XML structure won't be revealed):
- Parse all occurrences of "/root/child1/StatementPage" element
- For each <StatementPage> do:
- Store and print out value of attribute "/root/child1/StatementPage/PageInfo/@pageNumber"
- Store and print out value of attribute "/root/child1/StatementPage/PageInfo/@customerCode"
- Store any occurrence of element <ValueRecord>, along with all its attributes, within page's subtree
- Print out the number of <ValueRecord> elements stored
- Print out the total number of <StatementPage> elements parsed
- Print out the total number of <ValueRecord> elements parsed
Also, every test should be performed for 2 different XML files: a small file (6.5MB), containing a total of 420 statement pages and 19,133 value records and a large one (143MB) with 7,104 pages and 464,357 value records.
Based on the rules above, I then tested and compared the following technology sets:
- Apache Digester using Apache Xerces2 SAX2 parser
- Apache Digester using Oracle SAX2 parser
- Sun JAXB2 using Xerces2 SAX2 parser
- Sun JAXB2 using Oracle SAX2 parser
- Sun JAXB2 using Woodstox StAX1 parser
- Pure Xerces2 SAX2 parser
- Pure Oracle SAX2 parser
- Pure Woodstox StAX1 parser
Based on this tutorial fragment from Sun: http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP3.html and considering that performance is our primary goal, I've chosen StAX's cursor API (XMLStreamReader) over iterator. Still aiming for performance, all tested parsers have been configured as non-validating.
In time; all tests were executed on a Dell Latitude D620 notebook, with an Intel Centrino DUO T2400 CPU @ 1.83GHz running on Windows XP Professional SP2 and Sun's Java VM 1.5.0_15 in client mode.
These were the performance results obtained after parsing the small XML file (for obvious reasons, I decided to measure heap usage only when the large file was parsed):
As you can see, Apache Digester's performance was extremely and surprisingly poor despite all my efforts to improve it. So, I had no other choice than to discard it for next tests with the large XML file, from which the results are presented bellow:
Notice that the tendency toward a better performance when <!DOCTYPE> tag is removed from the XML document has been clearly confirmed here.
As for memory allocation comparison, I've once again narrowed the tests only to the worst case from performance tests above: large XML file including <!DOCTYPE> tag. The results obtained from JDev's memory profiler were:
Another interesting information we can extract from these tests is how much XML binding represents in terms of overhead when compared to a straight parser:
After a careful and thorough revision and confirmation of all results obtained from the tests described here, I tend to recommend a mixed solution. Considering its near 12MB/s throughput verified here, I'd certainly choose pure Woodstox StAX parser every time I'll have to deal with medium to large XML sources but, for convenience, I'd also choose JAXB 2 whenever there's a XML schema available to compile its classes from and the size of the source XML is not a concern.
As for complexity, I really can't say that any one of the tested technologies was found considerably more complex to implement than the others. In fact, I don't think this would be an issue for anybody with an average experience with XML processing.
Just for curiosity, I've also tested Codehaus StaxMate 1.1 along with Woodstox StAX parser. It's a helper library built on top of StAX in order to create an easier to use abstraction layer for StAX cursor API. I can confirm the implementor's affirmation that StaxMate shouldn't represent any significant overhead for performance. In fact, performance results were identical when compared to pure Woodstox StAX parsing the large XML file. I can also say that it really made my job pretty easier. The only reason I won't consider StaxMate is that it depends on a StAX 1.0 API non-standard extension which is being called "StAX2" by guys at Codehaus.
That's all for now.
Enjoy and... keep reading!