User:TarmstroBot

About the Bot
This is a bot run by User:Tarmstro99.

TarmstroBot’s primary job is to crawl the over 220,000 scanned pages from the United States Statutes at Large that are online here and correct obvious OCR errors, to ease the proofreading job of Wikisource’s human editors. If the bot can correct obvious OCR errors—such as  instead of “of”—the remaining errors (that is, the ones that really do require human analysis and judgment to correct) should become that much easier to spot.

To avoid clobbering the servers, TarmstroBot generally waits 10–12 seconds between page edits. At that rate, crawling over the entire collection of Statutes at Large scans currently available here can require a maximum of approximately thirty days to complete, although in practice the time required will frequently be much less because many pages will remain unaltered by any particular bot run.

The script that controls the bot’s behavior is user-fixes.py, which contains a set of instructions for the Pywikipedia replace.py routine. You can see what changes TarmstroBot is processing any time it is running by reading user-fixes.py.

The process I use for determining which edits the bot should be making looks something like this. This process should be repeatable many times; as the most common errors are fixed, the remaining errors should be easier to spot in the next iteration of the concordance file.
 * 1) download the text of one or more volumes of the Statutes at Large as they exist on this site. (Most recently I have been grabbing volumes 10 at a time to get a better aggregate picture of the contents.)
 * 2) create concordance files of the words contained in the downloaded text (not hard to do using free software tools), sorted by frequency of occurrence.
 * 3) browse the resulting concordance of words from highest to lowest freqency, looking for obvious misspellings that occur most freqently
 * 4) add the necessary replacement text to user-fixes.py.

Suggestions for other projects for the bot are welcome on the talk page.