Java XML Processing using SAX (boring)

**Feirefiz** · 28 Jan 2010 03:45 PM

Does anyone know a good resource on using SAX? I don't mean an API reference or something describing just the very first steps, but something that explains how to write good code using SAX.

I can use it basically. The problem is that the code I write in the process is consistently the ugliest I produce. There has to be a better way.

Because I have to deal with large documents regularly, DOM or similar things aren't always an option, and I would like to avoid exotic parsers. I want to have a look at StAX but so far I never got around to it.

**CatInASuit** · 29 Jan 2010 04:50 AM

Originally posted by Feirefiz

Does anyone know a good resource on using SAX? I don't mean an API reference or something describing just the very first steps, but something that explains how to write good code using SAX.

I can use it basically. The problem is that the code I write in the process is consistently the ugliest I produce. There has to be a better way.

Because I have to deal with large documents regularly, DOM or similar things aren't always an option, and I would like to avoid exotic parsers. I want to have a look at StAX but so far I never got around to it.

I have found that most java xml code is pretty ugly. What does your code look like and why is it a problem?

Also How large are the docs you are using, because DOM can handle a fair amount?

**Feirefiz** · 29 Jan 2010 09:46 AM

Originally posted by CatInASuit

I have found that most java xml code is pretty ugly. What does your code look like and why is it a problem

My training in XML processing has been pretty minimal. Because of that my code is basically an extrapolation of toy examples.
My startElement and endElement handler methods are huge messes of nested if-then-else cascades. All the while I have numerous objects for keeping track of the information extracted from the document. The organization does not reflect any logical organization apart from the type of the SAX event. Usually I like my methods short and my scopes small. It works, but it doesn't feel like the right way to do it. It is also far from readable.

Also How large are the docs you are using, because DOM can handle a fair amount?

Up to hundreds of megabytes on disk for single documents. I don't doubt that DOM can handle that in general, but on the university computers and and my poor old notebook where it is supposed to run in practice it has been a pain in the ass.

**CatInASuit** · 29 Jan 2010 10:40 AM

Originally posted by Feirefiz

My training in XML processing has been pretty minimal. Because of that my code is basically an extrapolation of toy examples.
My startElement and endElement handler methods are huge messes of nested if-then-else cascades. All the while I have numerous objects for keeping track of the information extracted from the document. The organization does not reflect any logical organization apart from the type of the SAX event. Usually I like my methods short and my scopes small. It works, but it doesn't feel like the right way to do it. It is also far from readable.

SAX is event driven, so you are probably stuck with a lot of if-then-else to catch the elements you want followed by adding the same catch into the characters block to get out the data.

If possible, have you considered running an XSLT transform up front to catch the data you need in a new XML file which can be DOM parsed?

Originally posted by Feirefiz

Up to hundreds of megabytes on disk for single documents. I don't doubt that DOM can handle that in general, but on the university computers and and my poor old notebook where it is supposed to run in practice it has been a pain in the ass.

Ouch. DOM will handle it, but there goes all your memory.

**Feirefiz** · 29 Jan 2010 12:15 PM

Originally posted by CatInASuit

SAX is event driven, so you are probably stuck with a lot of if-then-else to catch the elements you want followed by adding the same catch into the characters block to get out the data.

I was just thinking that perhaps I am missing some way to organize it in a more elegant way or break it down into smaller units.

If possible, have you considered running an XSLT transform up front to catch the data you need in a new XML file which can be DOM parsed?

Yes, I have. My problem is that all the free XSLT processors I have found will also read the whole thing into memory. It's the most advantageous thing in the general case and most processors don't bother with optimizations for the special cases where it isn't. (e.g. the Enterprise version of Saxon does but that's not an option for me.)

In general I can do preprocessing to save time, but everything has to be reasonably reliable and easy to replicate from the "official" third-party files. That means that ideally I should keep the requirements for all steps within similar limits.

Ouch. DOM will handle it, but there goes all your memory.

It is computational linguistics stuff and there you often work with data in that size range and bigger. Unfortunately people have very different ideas of reasonable ways to package such a resource. Right now I am working with data from someone who chose the opposite direction: ca. 200MB and over 20 000 XML files.

**CatInASuit** · 01 Feb 2010 05:27 AM

For coding, sometimes there is no nice way to present it.

Looks like Xalan can be set to do incremental transforms of XML docs so it doesn't have to read everything into memory.

**Feirefiz** · 01 Feb 2010 05:35 PM

Over the weekend I had a look at StAX and so far I really like it. I don't know who has tried it, but it addresses some of my reservations about SAX. The most important point is that because it is a pull parser, i.e. you intitiate parser events, you are not forced to work with a single entry point and you can choose your own modular structure.

Thread: Java XML Processing using SAX (boring)

Thread tools

Java XML Processing using SAX (boring)

Posting rules