If you try processing a large XML file with Saxon-HE on the command line, you may run into the following error:
If you increase Saxon’s memory allocation by passing e.g. the environment variable
JAVA_TOOL_OPTIONS="-Xmx16G", you may still get this error:
java.lang.ArrayIndexOutOfBoundsException: -32768 doesn’t seem to be fixable by giving Saxon more memory. This is a known bug in versions of Saxon prior to 18.104.22.168, unfortunately, the “fix” appears to simply be acknowledging that Saxon won’t handle source large source documents, by explicitly throwing the error
java.lang.IllegalStateException: Source document too large: more than 1G characters in text nodes.
To get around this, I resorted to using Xalan-J. Download the binary distribution, unzip it, and then you can run it with e.g.:
JAVA_TOOL_OPTIONS="-Xmx16G" java -classpath ~/source/xalan-j_2_7_2/xalan.jar org.apache.xalan.xslt.Process -INCREMENTAL -IN enwiktionary-20150413-pages-meta-current.xml -XSL filterlatin.xsl -OUT latin.xml
Unfortunately, this limits you to XSLT 1.0 stylesheets.
Note that if you don’t give Xalan-J enough memory, you can still run into errors, manifesting themselves as an
org.apache.xml.utils.WrappedRuntimeException error in the stylesheet.
Thanks to Yakov Shafranovich for pointing out Xalan for this process.