User:SnowyCinema/QuickTranscribe

QuickTranscribe is a Wikisource transcription tool, started by PseudoSkull, that allows for transcription of entire works on only one page, with a shorthand markup language called "QT markup". A bot does all the tedious work semi-automatically. The software is public domain.

The QuickTranscribe project is currently a work-in-progress. I will make this generally available to the Wikisource community when it is more user-friendly, and its use is more applicable to a wider range of transcription project types.

Here's an example of what QuickTranscribe is capable of. Keep in mind that most of this was literally automatically generated from a single page!
 * Aladdin O'Brien by Gouverneur Morris IV (originally proofread from this page)
 * Q121502566 - work Wikidata item, automatically populated by the software
 * Q121502607 - version Wikidata item, automatically populated by the software
 * c:Category:Aladdin O'Brien - Commons category, with drop initial and other images that were autouploaded by the software
 * User:PseudoSkull/P/Aladdin O'Brien - how the software parsed every bit of the shorthand QT markup into the correct formatting and templates
 * Aladdin O'Brien/TOC - the auxTOC (entirely autogenerated)
 * Index:Aladdin O'Brien (1902).pdf - the index page (entirely autogenerated)
 * Index:Aladdin O'Brien (1902).pdf/styles.css - the style sheet (almost entirely autogenerated)
 * Aladdin O'Brien/Chapter 1 - an example of a chapter that was autotranscluded

Milestones

 * Jalna (1927) by Mazo de la Roche — first work ever transcribed with QT
 * Master Frisky (1902) by Clarence Hawkes — second work transcribed, which introduced Wikimedia Commons automatic file uploading

Tutorial

 * To be added


 * Installation
 * Transcription preparation
 * Transcribing a work
 * QT markup documentation
 * Post-transcription maintenance

Types of works supported

 * Basic chaptered novels
 * Front-matter-only prose works (such as children's books)
 * Collections (of short stories, poems, essays):
 * For these collections, quite an exhaustive effort is required to maintain the data for each individual subwork. For example:
 * Wisdom of the Wilderness/The Little Homeless One — short story
 * The Little Homeless One, Little Homeless One — redirects to short story (which may one day become versions pages since they represent the work itself now)
 * Q123437864 — version item for "The Little Homeless One"
 * Q123437858 — work item for "The Little Homeless One"
 * Q123436596 — Wisdom of the Wilderness version item, which has all versions of each short story listed as parts
 * Author:Charles_George_Douglas_Roberts — automatically listed from collection, and sorted along with the short stories that were already listed there

IA/Hathi/Books

 * Can download all necessary image/scan files from HathiTrust and the Internet Archive

OCR
Can automatically extract OCR from every page of any PDF file and return it to be processed by the proofreader.

Transcription cleanup

 * Finds hyphenation inconsistencies (such as "foot-ball" and "football" appearing in the same transcription, which is most definitely an error)
 * Finds lots of probable scannos
 * Weird symbols within words ("roUed", "<0uld")
 * Bad single symbol (such as " f ")
 * Paragraphs not ending with punctuation

Wikidata

 * Can create Wikidata items for both the base work and the version, and can add all necessary data to those items when necessary
 * Very cool features including:
 * Main work image (either cover or frontispiece), which is automatically detected by the software
 * Automatic parsing of a "dedications" page to add to a "dedicated to" property

Wikimedia Commons

 * Can get all work image data based on 1. placement in the transcription and 2. iterative name ("1.png", "2.png", etc.)
 * Can create a Commons category for the work, with the necessary parent categories
 * Can create a Creator page for an author if it doesn't exist
 * Can upload both the scan file and all the work images, with a good titling scheme and with valid file descriptions and categories

Transcription parsing

 * Can automatically convert beginning text of a chapter into small-caps (sc) properly, or put drop initial (di) at the beginning
 * Can automatically generate tables of contents, based on a format provided, and text inputted into the chapter headers
 * Can automatically place images into the work after uploaded
 * Supports formatting continuations between pages (fine block/s to fine block/e in headers/footers, continuations of poems across pages with ppoem etc.)

Transclusion

 * Can automatically create Index pages
 * Automatically creates a default style sheet in the Index page based on templates used
 * Can input all pages properly into the Page namespace (assuming they're either of status "Proofread" (3) or "not needing to be proofread" (0))
 * Can transclude an entire chaptered novel accurately

To be done

 * Disambiguation page creation/handling
 * Version page creation/handling
 * Semi-automated author page creation, updating, and disambiguation
 * Support for poetry collections
 * Support for film transcription (an improvement of the WikiProject Film draft system)
 * Support for periodicals
 * Support for newspapers
 * Support for dictionaries/directories/catalogs
 * Support for encyclopedias
 * Support for local (enwikisource) uploading of files, and categorization of those files