Latin OCR for Tesseractryanfb.github.io

Latin OCR training data and tools for Tesseract, based on Nick White's Ancient Greek OCR for Tesseract.

Development Resources

Downloads

v0.3.0 - move training process into Tesseract's new tesstrain.sh system.
v0.2.2 - add training on more ligatured forms & glyphs, tweak dictionaries.
v0.2.1 - add training on various punctuation marks and new fonts.
v0.2.0 - fix use of Tesseract character blacklist, vastly improving accuracy.
v0.1.0 - rebuild training under stable environment.
v0.1.0-alpha2 - add training on bold and italic font variants.
v0.1.0-alpha1 - initial training file prerelease.

Instructions: OS X / Linux / Windows

Code

latinocr-lat - The final training process for lat.traineddata. Includes results from latinocr-lattraining process.
latinocr-lattraining - Rules and tools to deterministically generate all prerequisites for the final training process.
latinocr-lattestfodder - Latin page scans and ground truth text for testing OCR accuracy.
tesseract_latinocr_docker - a Dockerfile for building lat.traineddata from scratch.

Training Repo Diagram

Contact

e-mail / twitter