Wikisource:WikiProject OCR

Instruction
The participants listed below are users who have access to some kind of OCR software and are willing to extract text from scanned documents.

Users who desire for a text to be OCRed should place their request under the Requests section with the following format:



Note: "year published" should be when it was published in the U.S. as this will make determining the copyright status easier.

While these are the general instructions for requesting that a project be scanned, other users may have more specific instructions if they are to take on a project.

Requested uploads to Internet Archive
Uploading scan from any external website to Internet Archive saves the trouble of extracting the OCR text and Djvu conversion. Please follow the instructions of Help:Internet Archive/Requested uploads to request upload to IA.

Instructions
Preference given to:
 * 1) Smaller requests
 * 2) Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
 * 3) Works that are hard to find in text form elsewhere on the Internet
 * 4) Works that I do not proofread

I will only work on two large projects at a time (they are first come, first serve) and will work smaller projects in the mix as I make time for them.

Instructions
Preference given to:
 * 1) Smaller requests
 * 2) Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
 * 3) Works that are hard to find in text form elsewhere on the Internet
 * 4) Works that I have not proofread

Current projects
World Revolution

Instructions
Preference given to:
 * 1) Larger or non-standard requests, or where image batch-processing or DjVu conversion is needed
 * 2) English requests
 * 3) Requests where obtaining the scans is hard (batch-downloading is my favourite bot activity)
 * 4) Works that are hard to find in text form elsewhere on the Internet
 * 5) Works that are likely to be proofread soon
 * 6) Large reference works which, even if not proofread soon, provide a valuable reference resource.

Requests

 * Artabanzanus (1896) William M. Ferrar. pages 162/314 novel in pdf format with doubled pages, scan seems fairly good otherwise. Thx in advance Misarxist (talk) 14:10, 20 March 2012 (UTC)
 * Done, via image splitting and the Internet Archive. Index at Index:Artabanzanus (Ferrar, 1896).djvu. Inductiveload— talk/contribs  22:25, 3 April 2012 (UTC)

Done

 * Single European Act (on Wikipedia) a European Union treaty of 1986. It's quite short 29 pages a available in scanned PDF form. I've been looking for a text version for a while, but have never managed to find one. Blue-Haired Lawyer (talk) 18:07, 21 December 2008 (UTC)


 * Vlas Mikhaĭlovich Doroshevich (Дорошевич, Влас Михайлович) "The Way of the Cross" (translation by Stephen Graham, probably w:Stephen Graham (author)). Original Russian text in public domain (Doroshevich died in 1922). Book is public domain in USA (printed in 1916). --EugeneZelenko (talk) 03:41, 23 July 2009 (UTC)
 * Index:The Way of the Cross, Doroshevich, tr. Graham, 1916.djvu. Inductiveload—talk/contribs  17:41, 8 June 2011 (UTC)


 * Cyclopaedia, or Universal Dictionary of Arts and Sciences (on Wikipedia) (1728) - Ephraim Chambers. Seems to be about 1430, according to the TOC. --Rory096 02:59, 23 November 2006 (UTC)
 * Done via the Internet Archive. Index:Cyclopaedia, Chambers - Volume 1.djvu, Index:Cyclopaedia, Chambers - Volume 2.djvu, Index:Cyclopaedia, Chambers - Supplement, Volume 1.djvu, Index:Cyclopaedia, Chambers - Supplement, Volume 2.djvu - over 4000 pages in all! Inductiveload— talk/contribs  17:12, 19 November 2011 (UTC)


 * Letters of Junius (1772) - Junius. pages 358-394. Index:Letters of Junius, volume 2 (Woodfall, 1772).djvu
 * Done, but the noisy text with long-s has not OCR'd very cleanly. Is it sufficient? Inductiveload— talk/contribs  09:23, 22 February 2012 (UTC)
 * :( I dont think so. Thanks for trying. Moondyne (talk) 13:33, 22 February 2012 (UTC)
 * But I just found this. I might OK after all.  Moondyne (talk) 13:41, 22 February 2012 (UTC)
 * OK, sorry that didn't turn out so well. The OCR generated by Tesseract from that kind of scan is generally only really useful for match and split, since the noise and old-fashioned font work against clean OCR. Google has a much more powerful and well-tuned software for the job, but I don't know exactly what it is. Inductiveload— talk/contribs  18:40, 22 February 2012 (UTC)

OCR bot
There is an automatic tool for OCRing single pages at time, which is useful for repairing text on pages where it is missing or incomplete. It is available through the editing toolbar in the Page: namespace. It is accessed by clicking the button. The edit box will go grey while the server processes the image and the OCR will appear in the edit box within a few seconds (larger pages with more text take longer). You can check the status at http://tools.wmflabs.org/phetools/ocr.php. A further feature of the tool is that the next page is automatically OCR'd when one page is retrieved, so the next page's text should be ready by the time you edit the next page.