Originally Published: 2015-05-06

If you try processing a large XML file with Saxon-HE on the command line, you may run into the following error:

If you increase Saxon’s memory allocation by passing e.g. the environment variable JAVA_TOOL_OPTIONS="-Xmx16G", you may still get this error:

The error java.lang.ArrayIndexOutOfBoundsException: -32768 doesn’t seem to be fixable by giving Saxon more memory. This is a known bug in versions of Saxon prior to, unfortunately, the “fix” appears to simply be acknowledging that Saxon won’t handle source large source documents, by explicitly throwing the error java.lang.IllegalStateException: Source document too large: more than 1G characters in text nodes.

To get around this, I resorted to using Xalan-J. Download the binary distribution, unzip it, and then you can run it with e.g.:

JAVA_TOOL_OPTIONS="-Xmx16G" java -classpath ~/source/xalan-j_2_7_2/xalan.jar org.apache.xalan.xslt.Process -INCREMENTAL -IN enwiktionary-20150413-pages-meta-current.xml -XSL filterlatin.xsl -OUT latin.xml

Unfortunately, this limits you to XSLT 1.0 stylesheets.

Note that if you don’t give Xalan-J enough memory, you can still run into errors, manifesting themselves as an org.apache.xml.utils.WrappedRuntimeException error in the stylesheet.

Thanks to Yakov Shafranovich for pointing out Xalan for this process.