This is a short writeup of the working process I came up with for command-line OCR of a non-OCR’d PDF with searchable PDF output on OS X, after running into a thousand little gotchas. 1
Software Installation
- Install homebrew (if you haven’t already).
-
Install ImageMagick (needs TIFF and Ghostscript support):
brew install imagemagick
-
Install Tesseract with all languages:
brew install tesseract tesseract-lang
- Install pdftk server from the package installer.
Processing Workflow
I’m going to assume you have a non-OCR’d PDF you want to convert into a searchable PDF.
-
Split and convert the PDF with ImageMagick
convert
:convert -density 300 input.pdf -type Grayscale -compress lzw -background white +matte -depth 32 page_%05d.tif
-
OCR the pages with Tesseract: 2 3
for i in page_*.tif; do echo $i; tesseract $i $(basename $i .tif) pdf; done
-
Join your individual PDF files into a single, searchable PDF with
pdftk
: 4pdftk page_*.pdf cat output merged.pdf
Now merged.pdf
should contain your searchable, OCR’d PDF. I’ve wrapped this workflow up into a script, or alternately you may want to see if the robust OCRmyPDF script works for your needs.
Footnotes
-
A sampling of the various ways in which Tesseract/Leptonica is picky in its TIFF handling:
Error in pixConvertRGBToGray: pixs not 32 bpp
,Error in pixReadFromTiffStream: spp not in set
,Error in pixReadStreamTiff: pix not read
,Error in pixReadTiff: pix not read
,Error in pixRead: pix not read
,Error in findTiffCompression: function not present
,Error in pixReadStream: Unknown format: no pix returned
,Error in pixReadStream: tiff: no pix returned
,Unsupported image type.
↩ -
If your document isn’t in English, pass the
-l tla
flag as the first argument totesseract
. See theLANGUAGES
section ofman tesseract
. You can also install and use your own training data, for example, for Ancient Greek or Latin. On OS X, you’ll want to copy thelang.traineddata
file to/usr/local/share/tessdata
. ↩ -
If you have GNU Parallel installed (
brew install parallel
), you can parallelize this process:parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
-
I initially tried to use the
join.py
Preview Automator script that comes bundled with OS X (at/System/Library/Automator/Combine\ PDF\ Pages.action/Contents/Resources/join.py
), but this seems to mangle the actual OCR text into unsearchable whitespace for me (confusingly, this preserves selectable line/character bounding boxes, so it looks like there’s OCR’d text there but there’s not). I originally suggested using Ghostscript to combine the PDF files with the command:gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=merged.pdf page_*.pdf
However, this mangles non-Latin scripts. If you would still like to use Ghostscript instead of
pdftk
, the command:gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dProvideUnicode -sOutputFile=merged.pdf page_*.pdf
May give you good, relatively compressed results (from explicitly setting a more modern PDF compatibility level) while preserving non-Latin scripts.
I realized at the end of writing this guide that you can also use
convert
to create a multipage TIFF (omit the_%05d
format specifier in your output filename) and process/output that directly with Tesseract, but I like being able to parallelize the OCR,3 and recombining with pdftk gives me better compression in my testing. ↩