User:Xover/DjVu

General DjVu process

 * Grab the jp2's from IA
 * The jp2's are typically higher resolution than the jpg's
 * Adjust the image files to match with book pages
 * In particular, delete any scan-artifact first and last pages before or after the actual book covers
 * Use GraphicsMagic to convert the jp2's to jpg's
 * Since DjVuLibre can't read JPEG2000
 * Use DjVuLibre (c44) to generate single-page .djvu files from the page images
 * Use Tesseract to do OCR of each page, spitting out .hocr files
 * Write some custom code to:
 * Merge of the individual .djvu's into a multi-page .djcu for the whole book
 * Parse the hOCR data from Tesseract and generate DjVuLibre s-expressions
 * Use djvused to add a hidden text layer to the book .djvu
 * Upload to Commons

Rough outline algorithm notes

 * Relevant libraries:
 * HTML::Parser (hOCR is a HTML-based microformat) (link to spec here)
 * Use LWP or something to do the download and upload steps?
 * Look for something to help with the parser logic or state machine?
 * Tesseract
 * Is there a decent library for this so we won't have to wrap the command-line?
 * GraphicsMagick
 * What happened to PerlMagick? Where are the bindings for this?
 * Use a simple pseudo-state machine for each level of hOCR data:
 * There's some overall OCR data that can probably be ignored (it'll be per-page in this case)
 * hOCR supports columns, but ignore these for now (too complicated)
 * First state will be HOCR_PAGE
 * Maybe ignore this for OCR purposes and just use it to determine right DjVu page to add the hidden text layer to?
 * Second state will be HOCR_PARA
 * Is it worth mapping this to DjVuLibre's equivalent concept? Maybe just ignore it.
 * Third state will be HOCR_LINE
 * Fourth state will be HOCR_WORD
 * Fifth possible state will be HOCR_CHAR, and DjVuLibre supports it, but I don't think it's worth dealing with
 * Each parsing state is a constant
 * Need a global var or lightweight object to keep track of current state
 * HTML::Parser is event driven
 * Need to catch start tag events and end tag events
 * Need to check for valid events in each given state (not too many: hOCR is strictly nested and general HTML can be ignored; no tagsoup)
 * Build a tree in memory, or implement this as a streaming algorithm spitting out the sexprs as we go along?
 * Is it worthwhile to spend time on a generic data structure for this that can be serialized to many formats?
 * Maybe it makes more sense to write it as a straight hocr2sexpr converter and spit out per-page .sexpr files?
 * This would make the overall algorithm dumber-but-simpler
 * And given we'll be wrapping commandline utilities in any case, we can't avoid the "dumb" part. Maybe try to get the "simple" part too then?
 * Then again, a fully-streaming implementation is probably not that much more complicated, all things considered, provided we can rely on djvused not crapping out on us too much
 * That's a big if: for book-length djvu's, the DjVuLibre tools have crapped out rather a lot
 * Maybe it's better if we can operate on a "page-at-a-time" level?
 * Then again, we need to keep track of at least HOCR_PAGE and HOCR_LINE to generate working sexprs

Most manual workflow

 * At a minimum need custom code to convert .hocr to .sexpr
 * NB! hOCR (top left) and sexpr (bottom left) use different coordinate systems!
 * At a minimum need custom code to convert .hocr to .sexpr
 * NB! hOCR (top left) and sexpr (bottom left) use different coordinate systems!