User:Inductiveload/cleanup

Installation
The basic script can be installed as follows:



Once installed, you will access to the default cleanup tool configuration. You may wish to add more (work-specific) corrections, disable some, or add other configurations like possible languages or long-s corrections.

Concept
The tool performs a list of common actions:


 * Collapsing of lines and hyphens where appropriate
 * Adding paragraphs where likely
 * Typographic fixes like spaces before/after commas
 * Removing obvious running headers and scanning watermarks
 * Fixing OCR errors:
 * This uses a large (hundreds of entries) list of replacements, mostly tested against an English wordlist to avoid false positives. For example not many words end in, so   is likely to be  . However,   and   are not changed.
 * A separate (partially-complete) list of long-s scannos is also included, but this is more likely to have false positives, so it is not on be default.
 * Extra user-defined functions

Configuration
Configuration is via a "standard" (for me), which is called with a configuration object for you to update:

This is the default config object, which is what you will get if you do not add a config hook handler:


 * The logging level of the Cleanup functions. Set to 0 for, 1 for   and 2 for
 * The logging level of the Cleanup functions. Set to 0 for, 1 for   and 2 for


 * Enable the cleanup script (false prevents any of it from being added to the UI)
 * Enable the cleanup script (false prevents any of it from being added to the UI)


 * For now, internal
 * For now, internal


 * The portlet category to add the tool link to
 * The portlet category to add the tool link to


 * Namespaces to load in (does nothing in other namespaces)
 * Namespaces to load in (does nothing in other namespaces)


 * The name in the sidebar
 * The name in the sidebar


 * A list of additional OCR fixes (see below)
 * A list of additional OCR fixes (see below)


 * A list of disabled replacements (see below)
 * A list of disabled replacements (see below)


 * Additional functions to run at the end of the process (see below)
 * Additional functions to run at the end of the process (see below)


 * Convert “smart quotes” to "straight quotes".
 * Convert “smart quotes” to "straight quotes".


 * Perform a set of fixes for badly-OCR'd texts using long-s. For example:  →
 * Perform a set of fixes for badly-OCR'd texts using long-s. For example:  →


 * Collapse paragraphs together if they look "suspect". For example, if one ends without punctuation, and the next starts with a lowercase letter.
 * Collapse paragraphs together if they look "suspect". For example, if one ends without punctuation, and the next starts with a lowercase letter.


 * Dirty hack to re-insert paragraphs lost in the DjVu round trip. Adds a paragraph break to lines shorter than this that appear to be a sentence end and the next line looks like a sentence start. The spiritual inverse of
 * Dirty hack to re-insert paragraphs lost in the DjVu round trip. Adds a paragraph break to lines shorter than this that appear to be a sentence end and the next line looks like a sentence start. The spiritual inverse of


 * Set to  if the work might contain these languages: this disables some replacements that would be invalid in those languages. For example, in German   is valid, but in English, it's more likely to be a scanno for.
 * Set to  if the work might contain these languages: this disables some replacements that would be invalid in those languages. For example, in German   is valid, but in English, it's more likely to be a scanno for.


 * Italicise foreign words. This is a very short list at present.
 * Italicise foreign words. This is a very short list at present.


 * Abbreviations to put inside an asc template. Note, and  are already templates.
 * Abbreviations to put inside an asc template. Note, and  are already templates.


 * The template to use for small abbreviations.
 * The template to use for small abbreviations.


 * The edit summary to automatically add, if any
 * The edit summary to automatically add, if any


 * Also mark the page as . Note: you still have to actually proofread the page yourself, this isn't magic!
 * Also mark the page as . Note: you still have to actually proofread the page yourself, this isn't magic!


 * The access key to use: e.g.  → Ctrl+Alt+c
 * The access key to use: e.g.  → Ctrl+Alt+c

Additional replacements
This is a list of replacements to make. It is a list of tuples of  entries.

Often, you will make special replacements only in certain works:

The  is always applied to these regexes. If you need a non-global regex, use a. Other flags are kept (e.g. ). Replacement references like   work.

Disabled replacements
Disable replacements are a list of replacements to not apply even though they are part of the normal script:

If only one element is given in an item, all replacements with that regex are disabled. If two are given, the regex and the replacement must match for the disabling to happen. Only the "text" of the regex is compared, flags are not used.

Cleanup functions
Final functions to run. This is a list of functions that are given the  editors (as in TemplateScript) as parameters. They run in order.

cleanup_functions