User:Inductiveload/Tesseract

Image pre-processing
Removing small specks can have a major effect on the OCR quality:


 * F34558171 →
 * F34558172 →

An image processor to nerf these specks could be a major uplift in OCR performance.

Removing islands
Simple function to remove unconnected islands under a certain area with OpenCV. Expects a white-on-black binary image:

The island size needs to be carefully chosen to avoid deleting things like colons and dots of i's.

By inverting the image, you can also delete small white specks in letters, though these do not seem to be as lethal to the OCR as black specks.

Fonts
18th century text is often printed using either Caslon (the original) or something very like it. It usually has more ligatures than the modern fonts.

A derivative of Adobe Caslon Pro may be possible.

Notable changes:


 * Much tighter kerning after a long-s (ſ) in the regular font (italic already kerned well)
 * Bar on t reduced in length (modern fonts have made that more obvious, which causes t's to be easily mistaken as i's or l's)
 * Less prominent serifs on r
 * More space before :;!?
 * Heavier top serifs on u to try to avoid mistaken o more often


 * Variants: (in PUA at U+E100)
 * Higher bar on 'e' - option, since this is not always true - otherwise e → c errors
 * i,j with a missing dot - option
 * t with truncated bar:

To try:


 * Variant chars:
 * Add glyphs representing more damaged glyphs to the font to prevent overfitting of the model (the model becomes too "fixated" on the perfect form of the 't'). Probably put them as a  font feature.
 * E.g. t with a truncated top is mistaken i, r or c
 * e with a light centre line -> c
 * i with a heavy dot -> r

Generate the ground-truth data
Construct "clean" text for the fonts, variants, styles, etc. that you want:

model: eng_oldcaslon_longs text: dir: corpus/eng_longs fonts: - face: Old Caslon sizes: - 25   variants: regular: {} italic: italic: true smallcaps: smallcaps: true ratio: 0.1 features: - features: - ss01 rate: 0.05 - features: - ss02 rate: 0.005 - features: - ss03 rate: 0.005 process: - noise: 0.2 erode: 3 - noise: 0.3 erode: 2 include_clean: false

./generate.py -c configs/eng_oldcaslon_longs.yml -o ~/src/tesstrain/data -m eng_oldcaslon_longs
 * Generate the images and output to the tesstrain  directory:

Once you have ground truth data
export MODEL_NAME=eng_oldcaslon_longs
 * Set your model name in the shell (match the model name used above)


 * Train the model
 * This will take a long time (hours, if you set a high ), go and proofread something
 * 20000 iterations seems to work OK, after that overfitting seems more likely than improvment (0.2% error seems around the lower limit for now)

make training MODEL_NAME=$MODEL_NAME START_MODEL=eng TESSDATA=~/src/tessdata_best

First it will read the training files and set up ltsm and box files and you will see thousands of lines like this: Tesseract Open Source OCR Engine v5.0.0-alpha-20210401-158-ge1761 with Leptonica PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/eng_oldprint-ground-truth/agrippa-occult.00420.png" -t "data/eng_oldprint-ground-truth/agrippa-occult.00420.gt.txt" > "data/eng_oldprint-ground-truth/agrippa-occult.00420.box" + tesseract data/eng_oldprint-ground-truth/agrippa-occult.00420.png data/eng_oldprint-ground-truth/agrippa-occult.00420 --psm 13 lstm.train

Then, it will start generating training output, and you will see the errors start to decrease. At iteration 2132/30400/30400, Mean rms=0.148000%, delta=0.023000%, char train=0.071000%, word train=0.109000%, skip ratio=0.000000%, New worst char error = 0.071000 wrote checkpoint.

At this point you can take any recent checkpoint file (one is generated every time the result gets 2% "better") for testing:

make traineddata CHECKPOINT_FILES="$(ls -t data/$MODEL_NAME/checkpoints/*.checkpoint | head -1)" MODEL_NAME=$MODEL_NAME TESSDATA=~/src/tessdata_best cp $(ls -t data/$MODEL_NAME/tessdata_best/*.traineddata | head -1) ~/.local/share/tessdata/$MODEL_NAME.traineddata
 * Create and copy the most recent  for use

tesseract --tessdata-dir ~/.local/share/tessdata -l $MODEL_NAME image.jpg -
 * Use it!

cp data/$MODEL_NAME.traineddata ~/.local/share/tessdata/$MODEL_NAME.traineddata
 * When it's done, the  is ready:


 * Continue training from that point (may need to increase ).
 * Beware that too much training on too little source data leads to overfitting - while the model may get better at the GT images, it gets less able to handle real life images that are not quite the same.

make training MODEL_NAME=$MODEL_NAME START_MODEL=$MODEL_NAME TESSDATA=data MAX_ITERATIONS=50000

Generate evaluation text
This can also be used to generate training data (but you will need a lot of it).

tesseract /tmp/theimage.jpg /tmp/hocr --tessdata-dir ~/.local/share/tessdata -l eng_oldcaslon_longs hocr
 * Generate a HOCR file of the image - using the model in question (hopefully!) gets you pretty close


 * Extract the HOCR file to image/text pairs ( is where the image is).

hocr-extract-images -b /tmp /tmp/hocr.hocr theimage-%03d.png


 * Correct the text lines as needed.
 * This is a pain and really needs some kind of a tool to help


 * Copy/emplace the evaluation ground truths:
 * Remember the text files need to end, not just

cp -r your_images ~/src/tesstrain/data/eval_$MODEL_NAME


 * Generate  files
 * This also generates a file  file which lists all the lstmf files.

make lists MODEL_NAME=eval_$MODEL_NAME

lstmeval --model "data/${MODEL_NAME}.traineddata" --eval_listfile "data/eval_${MODEL_DATA}/all-lstmf"
 * Evaluate the model against those files

Progress so far

 * Long-s usually recognised
 * Some confusion between italic h and b
 * Occasional mistaken t → c/r



58 Terræ-F1lius. n" x1,

founded upon this politick ſuppoſition, that when they had got a new Frmcng houſe, they could ne- ver want new books; but by what means ſocver it was bu't, my lord Clarendon has the honour,

and we, his happy poſlcrity, the invaluable beneſic of it,

I ſhould think it an undertaking well worthy the laborious Mr. Hearne, to give the world an ac- count, from year to year, of the many incompa- rable tomes, which iſſue from that illuſtrious preſs. This, I apprehend, would do great honour to the univerſity, and to its leamed authors, ſince the cata- logue would not be crouded with any of thoſe he- retical, pernicious, and free-thinking tracts, which are the noiſom ſpawn of other modern preſſes: we ſhould ſind there no ill.meaning Eſſays upon human Underſtandmg, no Oceana's, no Hypotheſes of Liber- ty, no deſcants upon Original Contracls, nor en- quiries into the Stare of Nature, no Appeals to the Laity and common Senſe in matters of religion, no vindications of Conſcience and privare Judgment, no defences of Reſiſtance in any poſſible caſes, no apologies for the Revolution, and the preſent Go- vernment, &c. to ſully the Academical Types, and reproach the ſclemn Imprimatur of the univerſity ——New, accurate Editions of primitive Fathers, and antient Chronicles, or modern ſermons, and long ſyſturas of Logick, Metaphyſicks, and School-divinity are the ſolid productions of this auguſt Typographa- um————Such are the effects, and ſuch the advan- tages of reſtraining the lrcence of the preſs! How would letters flouriſh? how would arts revive? bow would religiou lift up her awful front? and how wculd the church rejoyce, if ſuch a whole- ſome check were put upon the preſs throughout the world ? l

But Printixg is not the only, not the principal uſe, tar which theſe ſupendous ſtone-walls weie

erected 3

Links

 * GT4HistOCR: ground truth of Fraktur https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR