User:Inductiveload/Scripts/DJVU OCR

This is a script to perform Tesseract-based OCR on a DJVU file. It requires that you have Tesseract and DjvuLibre, and if you want to convert to bitonal images, you also need ImageMagick.

Inspired by this Perl script. This script is designed for Linux, but it can be modified to run on Windows by changing the file paths as needed.

python ocr-djvu-tesseract.py -i ~/INFILE.djvu -u
 * To OCR a DJVU file, and update (-u) the text layer with new text. Without -u, the DJVU will be unaltered.

python ocr-djvu-tesseract.py -i ~/INFILE.djvu -d
 * Enable debugging mode (reports progress), use -d

python ocr-djvu-tesseract.py -i ~/INFILE.djvu -t
 * Enable Tesseract output, use -t

python ocr-djvu-tesseract.py -i ~/INFILE.djvu -b 50%
 * Use ImageMagick to convert to bitonal with a given threshold

python ocr-djvu-tesseract.py -i ~/INFILE.djvu -o ~/TRANSCRIPT.txt
 * Output a human-readable transcript of the DJVU file to a given location

python ocr-djvu-tesseract.py -i ~/INFILE.djvu -l LANG
 * Use a Tesseract language other than English