How do I process a large xml file?

Hi, I am a hobbyist. I have written a style sheet (xslt) which I want to use to process an xml file. I have run some tests, using xsltproc, on a sample xml and it all works well. The xml file I am wanting to process is 650 GB in size. When I run the xsltproc command the response is "Killed". I had hoped that xsltproc would step through the xml file, creating output as it goes. This does not seem to be the case. I am running Mageia x64. I have added a 1T disk to the system and set it as a swap partition. Does anyone know of an xml parser that will work with "limited" memory? - or a method by which I can apply the style sheet. I have experience with C, and if all else fails will write a C program to do what I want the style sheet to do, but I would rather get the style sheet working. Any advice appreciated. Hugh

On Wed, Jan 27, 2016 at 05:43:50PM +1100, h via luv-main wrote:
[...] I have run some tests, using xsltproc, on a sample xml and it all works well.
So I'll assume your style sheet is correct.
The xml file I am wanting to process is 650 GB in size.
I'd rather ask why you are processing such large XML documents? Is there is any way you can chunk them into smaller pieces?
When I run the xsltproc command the response is "Killed". I had hoped that xsltproc would step through the xml file, creating output as it goes. This does not seem to be the case.
Nope, it'll parse the entire XML and build the DOM tree in memory, so it would be getting OOM killed...
I am running Mageia x64. I have added a 1T disk to the system and set it as a swap partition.
...or for some reason, crashing while getting paged out to swap.
Does anyone know of an xml parser that will work with "limited" memory? - or a method by which I can apply the style sheet.
You want a SAX (Simple API XML) parser, which allows you to process the data in a stream by registering callbacks, although that's relatively complicated. I'm assuming that by using xsltproc, all you really need is a CLI, you could try using ``saxon''[1], but it depends on a JVM, and if that's a problem...
I have experience with C, and if all else fails will write a C program to do what I want the style sheet to do, but I would rather get the style sheet working.
...you could probably write something that uses libxml2's xmlReader[2], this should allow you to walk the tree iteratively and process each node in sequence (and calling xsltProcessOneNode from libxslt presumably) There's also Python bindings, if C wasn't already your thing. [1] http://www.saxonica.com/html/documentation/using-xsl/commandline.html [2] http://www.xmlsoft.org/xmlreader.html

Joel W. Shea via luv-main writes:
You want a SAX (Simple API XML) parser, which allows you to process the data in a stream by registering callbacks, although that's relatively complicated. ...you could probably write something that uses libxml2's xmlReader[2], [...] There's also Python bindings, if C wasn't already your thing.
So *that's* why Python's XML parser UI was so flipping complicated!
participants (3)
-
h
-
Joel W. Shea
-
trentbuck@gmail.com