Latin OCR Development Resources
Corpora & wordlists:
- Perseus
- 240,000 attested words, 49,000 lemmata, manually-entered editions
- Bruce Robertson’s Rigaudon dictionary
- Open Greek and Latin Project
- Machine-corrected versions of Lace OCR output
- Lace
- Raw OCR output of Rigaudon (Gamera-based process)
- David Bamman: 11K Latin Texts
- 3.9GB compressed, 11,261 texts, 1.38 billion tokens, OCR-derived
- CAMENA (Corpus Automatum Multiplex Electorum Neolatinitatis Auctorum)
- Words program of the late William Whitaker
- 1M word forms derived from 39,000 lemmata
- Spell checking lexicon for OpenOffice/LibreOffice constructed by Karl Zeiler
- Latin words in the English Wiktionary
- 655,434 word forms for 32,860 roots
- Springmann et al. canonical lexicon (forthcoming?)
- 2M word forms, 70,000 lemmata
- Bibliotecha Latina IntraText
- CLTK corpora
- Pleiades - 22,140 Latin place names
- Johann Ramminger’s Neulateinische Wortliste (Neolatin wordlist)
- “The NLW now contains 19225 lemmas with 1928 Variants […] The Sigelliste of NLW contains 2675 authors and 7402 works abbreviations.”
- Bibliotecha Augustana
- Musisque Deoque and Poeti d’Italia
- Corpus Grammaticorum Latinorum (CGL)
- Corpus Thomisticum
- Croatiae auctores Latini (CroALa)
- Index Thomisticus Treebank
- The PROIEL Treebank
- The Ancient Greek and Latin Dependency Treebanks
- Integrating Digital Papyrology (IDP)
- See also: Digital Critical Editions of Texts in Greek and Latin
- OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress, Springmann et al., 2014
- Improving OCR Accuracy for Classical Critical Editions, Boschetti et al., 2009
- Himeros.eu Latin OCR Trainings for Tesseract, Boschetti, 2008
- A Document Recognition System for Early Modern Latin, Sravana & Crane, 2006
- Based on Gamera and Perseus
- McGillivray, Barbara. Methods in Latin Computational Linguistics. Leiden: Brill, 2014.
- eMOP: Early Modern OCR Project
- Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts Using the TTLab Latin Tagger, Mehler et al., 2014
- On OCR ground truths and OCR post-correction gold standards, tools and formats, Reynaert, 2014
- PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts, Vobl et al., 2014
- A Fast Alignment Scheme for Automatic OCR Evaluation of Books, Yalniz & Manmatha, 2011
- OCR and the transformation of the Humanities, Crane, 2011
- Digitizing Latin Incunabula: Challenges, Methods, and Possibilities, Rydberg-Cox, 2009
- Automatic disambiguation of Latin abbreviations in early modern texts for humanities digital libraries, Rydberg-Cox, 2003
Non-Free Corpora
- Library of Latin Texts (LLT), Brepols
- Thesaurus linguae Latinae (TLL), De Gruyter
- PHI Latin Texts
- Bibliotheca Teubneriana Latina (BTL)
- LatinWordNet