User:DeirdreAnne/DjVu Files


 * Specific trials
 * OCR with Tesseract
 * See also Wikimedia Commons:Commons:DjVu.

Image extraction
It is tempting to take images you need directly out of the DJVU files, but these are heavily compressed, and optimised for text. If you extract images from a DJVU file, they will be badly and irreparably damaged by this compression. If there is no other source, then extract from the DJVU and tag the file with at Commons. Otherwise, please use a better source, such as JPG/PNG/TIFF scans of the text.

If the DJVU came from Archive.org, there are often high-quality JPG files that are viewable online (go to the Archive.org details page, and choose "read online", and from there you can increase the size of the image, then right click and save an image. This is in fact easier than ripping from DJVU, as you don't have to mess around with a screenshot and trimming the image, and the resulting quality is hugely better.

If the DJVU was made from a Google books scan, the Google books PDF can be used to good effect. Compare the following two examples:

Images → virtual printer → DjVu
If the page scans are made available as a PDF file, e.g. Google Books scans, then this can be directly converted into a DjVu file using one of the following:


 * The free Any2DjVu online service; this can also OCR the text and embed it in the .djvu file.
 * The freeware Pdf To Djvu GUI. Note that this requires the installation of the cygwin environment as a prerequisite to its own installation.
 * The free software command-line pdf2djvu (available in repositories, also for Linux), which is usually as simple as . There's also a GUI available.
 * If you need to crop the pdf, you can use pdfcrop.pl (see below) for black margins or freeware Govert's PDF Cropper for white margins (it requires Ghostscript and .Net 2.0).

If the scanned images are made available as individual images, then the easiest option is to print them to a PDF via one of the many "virtual printer" tools, such as the free PDFCreator; then convert the PDF to DjVu as described above.

Note that there are many other options for converting pages to .djvu. One could convert using PostScript or multipage TIFF as the intermediate format, rather than PDF, but this would of course require different conversion tools. It is also possible to convert from .pdf or .ps to .djvu with the DjVuLibre software and its GSDjVu plug-in but due to licensing restrictions installing the plug-in is a fairly intricate process that involves compiling a patched version of Ghostscript.

Another free Windows tool that can come in handy for the images-to-pdf-to-djvu process is ConcatPDF, a GUI tool that permits easy splitting and merging of PDF files. An example of how ConcatPDF might be used is: if a 100-page document has previously been scanned and converted to .djvu and the single page #42 needs to be re-scanned, ConcatPDF would allow that one page to be inserted into the intermediate .pdf file without tracking down the other page images and re-composing the entire document. Installing ConcatPDF version 1.1 requires as prerequisites that the free Microsoft program libraries Microsoft .NET Framework Version 1 and the corresponding Visual J# .NET Redistributable Package be installed beforehand.

Images directly to DjVu
However, a far higher quality document can be achieved using the DjVuLibre software library. Jpeg images can be directly encoded into individual DjVu pages using the c44 encoder. Images in lossless formats such as PNG should be converted to PPM (for colour scans) or PGM (for greyscale scans), then encoded using c44. For bitonal (i.e. black-and-white) scans, such as most page text images, a smaller DjVu file can be obtained by converting the page images to the monochrome PBM format, then encoding to DjVu using the cjb2 encoder. All of these image format conversions can be performed by the freeware ImageMagick library (in batch, with mogrify). Individual DjVu pages can be aggregated into a multi-page DjVu using the djvm program; this program can also be used to insert or delete pages from a djvu file.

An important caveat of this process is that high quality scans come at the cost of larger files, and there is currently a 100Mb limit on uploads to commons.

Scripting djVuLibre
This script allows you to take a whole directory of image files (JPG, PNG, GIF, TIFF, and any file than Imagemagick can convert to PPM) images and convert and collate them automatically into a DJVU file. Currently this script is for Windows, but it can be easily converted for Linux. To use it, you will need Python, Imagemagick and DjvuLibre.

Linux

 * See also: User:GrafZahl/How to digitalise works for Wikisource

Method 1 - page at a time with DjVuLibre
You need the djvu software, which includes a viewer, and some tools for creating and handling DJVU files. You will probably also need the Imagemagick software for converting scans from one format to another. The tool cjb2 is used to creating a DJVU file from a PBM or TIFF file. Therefore you need to convert your scans if there are not already in one of these formats.

convert rig_veda-000.png rig_veda-000.pbm
 * Conversion from PNG format to PBM format with the tool convert from Imagemagick


 * Depending on the quality of the original scans, you may find it useful to process them with the unpaper utility, which deletes black borders around the pages and aligns the scanned text squarely on the page. Unpaper is also capable of extracting two separate page images where facing pages of a book have been scanned into a single image.  Another utility is mkbitmap, another pdfcrop.pl (Perl-based and free software, it requires Ghostscript and texlive-extra-utils on Ubuntu; it uses BoundingBox; it can crop a whole multipage pdf in just one passage). PDFCrop (another one!) deletes white margins.

cjb2 -clean rig_veda-000.pbm rig_veda-000.djvu
 * Creation of a DJVU file from a PBM file

djvm -i rig_veda.djvu rig_veda-000.djvu
 * Adding the DJVU file to the final document

You need to repeat these steps with a script for each page of the book. Example:

There is also another way to add all the *.djvu parts into one:

djvm -c rig_veda.djvu rig_veda-000.djvu rig_veda-001.djvu rig_veda-002.djvu

See the following section for an automated process for multiple pages.

Method 2 - PDF to DjVu bash script
Use this script, which converts pdf (multiple or single page) into images, automagically crop them with ImageMagick, convert them in DjVu and bundle them. This is very slow (huge pdf can require days) but a little more efficient than the following method.

The resulting pdf is quite big and low-quality, probably because of poor font recognition, which may be fixed by newer versions of poppler (the used library): the version avilable in repositories is usually several months old.

You can also remove the pdftoppm part and use the script to convert multiple images directly in a multiple page pdf. If images are not in pbm format, you can convert them with single command using mogrify from ImageMagick.

Method 3 - pdf2djvu
Simply download the pdf2djvu tool from your repository to directly convert pdf (single or multiple pages) into DjVu. This is slow (several hours for a pdf of about 100 MB, depending on your hardware), but requires little memory and CPU. The obtained DjVu file is quite low-quality and big in size, and with no OCR.

Moreover, you need to crop directly the pdf before the conversion. On Linux this is quite difficult. You could use ImageMagick, but attention: with multiple page big pdf, this can take several GB of memory (the limit is 16 TB!) and kill your computer if you don't use the   option directly after. This make the convertion very long.

The resulting pdf is increased in size and reduced in quality because of rastering.

See other crop tools above.

Method 4 - DjVuDigital
Use djvudigital, which like pdf2djvu converts pdf directly in DjVu. There are licensing problems, because the GSDjVu library has a different license, then you'll need to compile it by yourself; the included utils make this step quite easy, but still long (about 1 hour) and a bit annoying.

But, then you can convert pdf into DjVu with a single command (see the previous section for crop). The conversion is slow (I find it will complete a 300 page PDF in about 30-40 minutes). The resulting DjVu is of higher quality and lower filesize compared to both the previous two methods. Additionally, DjVuDigital can handle JPEG2000 (aka JPX) files embedded in PDFs, which is a feature of many Google books. pdf2djvu, Any2Djvu and Internet Archive conversions all fail to convert these files, leaving blank pages in the output.

DjVuDigital has many advanced options to improve results, but they can be difficult to master. In general, altering the --dpi option can give you a qucik reduction in filesize without too much fiddling.

Any2Djvu
Another method to convert the images to djvu is to zip them and use the Any2Djvu site to create the djvu file. The Any2Djvu will extract the images in the zip and create a OCRed djvu. OCR functions well only with English text.

Any2Djvu cannot handle huge files. Big files are best dealt with if you upload them by URL (e.g. by entering a link like ftp://ftp.bnf.fr/005/N0051165_PDF_1_-1DM.pdf). Conversion can take several hours. Any2Djvu will sometimes fun out of memory on large or highly-detailed files and fail. It will also not convert "JPX" images embedded into PDFs, which are common in Google Books scans.

Internet Archive
Another method is to upload pdf to the Internet Archive. You need to login (don't use OpenId, it won't function ). Click "Upload" at the top-right corner. The JavaScript upload ("Share" button) won't function with Firefox (use Opera or Internet Explorer instead ) or Linux. You can use the FTP upload instead, but this is slower and seems crashy.

When the upload has been completed, archive.org will start the "derive" work: OCR to create pdf with text, then conversion to DjVu with text, text only etc. This is very, very, slow, and can take several days, but you don't need to do anything. The length of time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests. You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in)

The Internet Archive uses a professional, proprietary, commercial ABBYY software with a quite good images and OCR output in many languages and fonts and an aggressive compression which mantains an high quality of the final DjVu file. However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF and convert manually.

OCR via Any2DjVu
The OCR option available at the free conversion service Any2DjVu does do an OCR of the scanned image but the resulting text is embedded within the .djvu file itself and must be extracted so it can be used on Wikisource.

One way to do this is to use the DjVuLibre software to extract the text, via a command like

or

JVbot can automatically upload the text layer of a DJVU to the pages on Wikisource. For example, Robert the Bruce and the struggle for Scottish independence - 1909

OCR via Internet Archive
See above: if you upload a DjVu file, the derive process will OCR it.

OCR with Tesseract
OCR can be done with Tesseract, a free OCR software, and a script:


 * OCR with Tesseract. Perl script
 * OCR with Tesseract (Python), slightly more user-friendly Python script. Based on the Perl script

Linux
To extract images from a DjVu file, you can use ddjvu ddjvu -page=8 -format=tiff myfile.djvu myfile.tif

If you done all the pages (without ) you can split the multi-page tiff into single pages png (or any other format) convert -limit area 1 myfile.tif myfile.png

Splitting DjVu files
Large works can not be uploaded onto Wikimedia servers which have a 100 MB upload limit. To split the DjVu, use DjVuLibre "Save as", and specify a page range which will produce a file small enough to be uploaded. Some trial and error may be necessary.

djvused can do this to : djvused myfile.djvu -e 'select 10; save-page-with p10.djvu'

This can be done for every page.

To know the number of page of the file : djvused myfile.djvu -e 'n'

Displaying a particular page
The link tag accepts a named parameter "page" so that, for example, this wiki code displays the image of page 164 of the file Emily Dickinson Poems (1890).djvu on the right, 150 pixels wide (the rear cover of the book, containing no text):



The page image can be displayed in the DjVu in place of text as in Page:Personal Recollections of Joan of Arc.djvu/9 using:

The page image can be displayed in the books Wikisource main space as with Personal Recollections of Joan of Arc/Book I/Chapter 2 using: