User:Inductiveload/dp reformat

This is a script to attempt to convert a raw text file holding page-wise proofread text from Distributed Proofreaders to a format that can be inserted into the Page: namespace by the Help:Match and Split bot.

Prerequisites
mw.loader.load("//en.wikisource.org/w/index.php?title=User:Inductiveload/dp_reformat.js&action=raw&ctype=text/javascript");
 * Install ../dp_reformat.js/ to your common.js page:
 * Enable the Match and Split gadget in your gadget preferences
 * Download a "concatenated page text file" from Distributed Proofreaders. These files are not available for "archived" projects, and DP will not provide access to them.
 * Upload the matching scan (usually DP will mention if they use an IA scan) and create an Index page.

Process

 * Open a new page (can be in mainspace, if you will proofread it soon, your user space, or even the Sandbox. You do not need to save.
 * Paste in the contents of the text file from Distributed Proofreaders
 * Click "Reformat DP text" in the side bar
 * Fill in the target Index file—this is the index page you created above
 * Fill in the target offset. This is needed because "page 1" according to DP is not the first page in the DJVU. If they have removed blank pages in the body of the work, you may need to work in sections.
 * It is important to check this carefully, as incorrect splits need a bot and admin (due to the redirects) to fix
 * Click "Done".
 * The text should now be transformed into a split ready format that looks something like this:

Page:The ways of war - Kettle - 1917.pdf/7
THE WAYS OF WAR

Page:The ways of war - Kettle - 1917.pdf/8
Page content

More page content

Page:The ways of war - Kettle - 1917.pdf/9
....
 * Make any other adjustments to the text now if you want. The content of each  will be placed into the matching page, so it's easier to do bulk edits replacements now, when it is still one file, rather than later.
 * Save the page now.
 * There should be a "split" tab at the top of the page, next to "Discussion".
 * Click this, and the Match and Split bot will start to move the text to the Page namespace. This can take a short while, as the bot does not create pages very fast.
 * You can check the split has started here.
 * When it is done, the page will be updated with transcluded text from the Page namespace.

Limitations

 * PGDP use a very general syntax like  to mark "special formatting", where we'd normally apply the formatting ourselves using, say fine block. So these occurences need to be dealt with on a case-by-case basis.
 * Line-break's are retaining inside  blocks using  . This might not be right in all cases, but PGDP aren't more specific, so, again, it's a case-by-case thing.
 * Some diacritics might be missing, let me know if you spot a mapping from something like  to a character like ā and I'll add it

Remember, this tool is not a substitute for proofreading, it's just an aid.

Other subdomains
Other language subdomains can be supported. I need to know the correct values for the  dictionary, which defines how a domain handles certain formatting.

Configuration
Some configuration options are exposed.

Set them by adding a handler for the  hook and setting the in the provided config object. This is optional: if you do not, defaults will be used.


 * : convert  to Wiki-style  . Note: DP claims ownership of comments, so they are stripped by default.
 * : default, strip comments.
 * : default, strip comments.