Help:Gadget-ocr



There are two OCR (optical character recognition) gadgets that can be enabled to produce new text for scan pages if the existing file does not have an acceptable text layer.

Usually, you will not need these, as you can now use the "Transcribe text" button at the top right of the editor. There are instructions for that tool at mw:Help:Extension:Wikisource/Wikimedia OCR.

Both gadgets place a button in the editing toolbar in the Page namespace.

Because the gadgets use different OCR engines, one of them may perform better than the other on certain pages.

Enable the gadgets in your gadget preferences.

The OCR gadget


The "basic" OCR gadget uses Tesseract to generate new OCR text. Generally, this gadget is better than the Google OCR gadget at recognising text columns, but has more character errors.

This tool uses an older Tesseract than the built-in OCR tool, so you may find the built-in tool has better results.

Google OCR


The Google OCR icon submits the page image to Google to be processed.

Generally the accuracy is excellent, but text in columns is sometimes not recognised as such and the lines are interleaved.