/etc

tags: ocr, docker, mac
Originally Published: 2015-03-18

Update (2015-09-08):

A pull request I submitted to Homebrew to add a --with-opencl option to the tesseract formula has now been accepted, so you should be able to just do brew install --HEAD --with-opencl tesseract. For issues with OpenCL-enabled Tesseract on OS X, please see this issue.


After coming across these instructions for building Tesseract with OpenCL support, I wanted to experiment with this feature to see if it would enable faster OCR processing. I also came across this blog post experimenting with the feature under Linux and Windows, but I wanted to try it on Mac OS X and AWS EC2 GPU instances.

Using Mac OS X with Homebrew

Here I built off my existing work modifying the Tesseract Homebrew formula to install the Tesseract training tools.

The only gotcha (as I serendipitously found out) is that there appears to be a bug in the OpenCL build under OS X that will cause it to fail if you don’t have a /opt/local directory for it to include. As I didn’t feel like fixing this, you can simply work around it by running sudo mkdir -p /opt/local before installing with the command:

brew install --training-tools --all-languages --opencl --HEAD https://github.com/ryanfb/homebrew/raw/tesseract_training/Library/Formula/tesseract.rb

If all went well, you should now have an OpenCL-enabled build of Tesseract.

Using an AWS GPU-Enabled Docker Host

For this I built off my existing work using Docker for VisualSFM under AWS. I’ve published the Docker build for this on Docker Hub as ryanfb/tesseract-opencl. For clarity, I’ll repeat the instructions for using this on EC2 here:

Results

With OpenCL suppport enabled, an initial run of tesseract will perform some automatic device detection and profiling on first run and save the results to various .bin files and a tesseract_opencl_profile_devices.dat file in the current working directory, which it will re-use on subsequent runs.

Here’s the diagnostic information for the three machines I tested with:

Here, (null) is the non-OpenCL Tesseract implementation (i.e. what you get if you build without OpenCL). You can see that on OS X, the OpenCL implementation also detects/reports the CPU as an available device for OpenCL. “Score” is the result of the timing profile, so higher values are worse. I’m not sure if the profiling/timing is correct on OS X or the OpenCL implementation is just simply always outperformed by the general implementation, but we can see on both sets of hardware here that that’s what gets selected.

The AWS EC2 g2.2xlarge results appeared promising, but in practice (testing my OCR process against this 527-page volume) I didn’t notice a giant speed improvement over running it on my iMac (about 20 vs. 30 minutes).

So, I think I’ll be sticking to building Tesseract without OpenCL for now. I think there are still great parallelization improvements that could be made in Tesseract, especially in the training process, but the current OpenCL implementation doesn’t appear to have completely solved that problem.