+ Reply to thread
Results 1 to 7 of 7

Thread: Java XML Processing using SAX (boring)

  1. #1
    Elephant Feirefiz's avatar
    Registered
    Feb 2009
    Location
    Germany
    Posts
    802

    Default Java XML Processing using SAX (boring)

    Does anyone know a good resource on using SAX? I don't mean an API reference or something describing just the very first steps, but something that explains how to write good code using SAX.

    I can use it basically. The problem is that the code I write in the process is consistently the ugliest I produce. There has to be a better way.

    Because I have to deal with large documents regularly, DOM or similar things aren't always an option, and I would like to avoid exotic parsers. I want to have a look at StAX but so far I never got around to it.

  2. #2
    Administrator CatInASuit's avatar
    Registered
    Feb 2009
    Location
    Coulsdon Cat Basket
    Posts
    10,342

    Default

    Quote Originally posted by Feirefiz View post
    Does anyone know a good resource on using SAX? I don't mean an API reference or something describing just the very first steps, but something that explains how to write good code using SAX.

    I can use it basically. The problem is that the code I write in the process is consistently the ugliest I produce. There has to be a better way.

    Because I have to deal with large documents regularly, DOM or similar things aren't always an option, and I would like to avoid exotic parsers. I want to have a look at StAX but so far I never got around to it.
    I have found that most java xml code is pretty ugly. What does your code look like and why is it a problem?

    Also How large are the docs you are using, because DOM can handle a fair amount?
    In the land of the blind, the one-arm man is king.

  3. #3
    Elephant Feirefiz's avatar
    Registered
    Feb 2009
    Location
    Germany
    Posts
    802

    Default

    Quote Originally posted by CatInASuit View post
    I have found that most java xml code is pretty ugly. What does your code look like and why is it a problem
    My training in XML processing has been pretty minimal. Because of that my code is basically an extrapolation of toy examples.
    My startElement and endElement handler methods are huge messes of nested if-then-else cascades. All the while I have numerous objects for keeping track of the information extracted from the document. The organization does not reflect any logical organization apart from the type of the SAX event. Usually I like my methods short and my scopes small. It works, but it doesn't feel like the right way to do it. It is also far from readable.
    Also How large are the docs you are using, because DOM can handle a fair amount?
    Up to hundreds of megabytes on disk for single documents. I don't doubt that DOM can handle that in general, but on the university computers and and my poor old notebook where it is supposed to run in practice it has been a pain in the ass.

  4. #4
    Administrator CatInASuit's avatar
    Registered
    Feb 2009
    Location
    Coulsdon Cat Basket
    Posts
    10,342

    Default

    Quote Originally posted by Feirefiz View post
    My training in XML processing has been pretty minimal. Because of that my code is basically an extrapolation of toy examples.
    My startElement and endElement handler methods are huge messes of nested if-then-else cascades. All the while I have numerous objects for keeping track of the information extracted from the document. The organization does not reflect any logical organization apart from the type of the SAX event. Usually I like my methods short and my scopes small. It works, but it doesn't feel like the right way to do it. It is also far from readable.
    SAX is event driven, so you are probably stuck with a lot of if-then-else to catch the elements you want followed by adding the same catch into the characters block to get out the data.

    If possible, have you considered running an XSLT transform up front to catch the data you need in a new XML file which can be DOM parsed?

    Quote Originally posted by Feirefiz View post
    Up to hundreds of megabytes on disk for single documents. I don't doubt that DOM can handle that in general, but on the university computers and and my poor old notebook where it is supposed to run in practice it has been a pain in the ass.
    Ouch. DOM will handle it, but there goes all your memory.
    In the land of the blind, the one-arm man is king.

  5. #5
    Elephant Feirefiz's avatar
    Registered
    Feb 2009
    Location
    Germany
    Posts
    802

    Default

    Quote Originally posted by CatInASuit View post
    SAX is event driven, so you are probably stuck with a lot of if-then-else to catch the elements you want followed by adding the same catch into the characters block to get out the data.
    I was just thinking that perhaps I am missing some way to organize it in a more elegant way or break it down into smaller units.
    If possible, have you considered running an XSLT transform up front to catch the data you need in a new XML file which can be DOM parsed?
    Yes, I have. My problem is that all the free XSLT processors I have found will also read the whole thing into memory. It's the most advantageous thing in the general case and most processors don't bother with optimizations for the special cases where it isn't. (e.g. the Enterprise version of Saxon does but that's not an option for me.)

    In general I can do preprocessing to save time, but everything has to be reasonably reliable and easy to replicate from the "official" third-party files. That means that ideally I should keep the requirements for all steps within similar limits.
    Ouch. DOM will handle it, but there goes all your memory.
    It is computational linguistics stuff and there you often work with data in that size range and bigger. Unfortunately people have very different ideas of reasonable ways to package such a resource. Right now I am working with data from someone who chose the opposite direction: ca. 200MB and over 20 000 XML files.

  6. #6
    Administrator CatInASuit's avatar
    Registered
    Feb 2009
    Location
    Coulsdon Cat Basket
    Posts
    10,342

    Default

    For coding, sometimes there is no nice way to present it.

    Looks like Xalan can be set to do incremental transforms of XML docs so it doesn't have to read everything into memory.
    In the land of the blind, the one-arm man is king.

  7. #7
    Elephant Feirefiz's avatar
    Registered
    Feb 2009
    Location
    Germany
    Posts
    802

    Default

    Over the weekend I had a look at StAX and so far I really like it. I don't know who has tried it, but it addresses some of my reservations about SAX. The most important point is that because it is a pull parser, i.e. you intitiate parser events, you are not forced to work with a single entry point and you can choose your own modular structure.

+ Reply to thread

Posting rules

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts