Help:DjVu files


 * Specific trials (mostly obsoleted by new tools)
 * OCR with Tesseract (obsoleted by ocrodjvu, see below)
 * See also Wikimedia Commons:Commons:DjVu.

Image extraction
DjVu files generally have very heavy image compression that is optimised for text. This results in severe damage to image quality for illustrations and photographs. In general, it is better not to extract images from DjVu files and instead use more original files, for example, the page JP2s at the Internet Archive. Help:Image extraction contains more guidance.

Windows
DjvuToy is a software which provides different functionalities:
 * make a Djvu
 * merge Djvu files
 * split Djvu files
 * edit Djvu files
 * generate a bundled file
 * export from Djvu to another file
 * extract text from Djvu
 * download Djvu file structure info (eg. OCR coordinates)

Images → virtual printer → DjVu
If the page scans are made available as a PDF file, e.g. Google Books scans, then this can be directly converted into a DjVu file using one of the following:


 * The free Any2DjVu online service; this can also OCR the text and embed it in the .djvu file.
 * The freeware Pdf To Djvu GUI. Note that this requires the installation of the Cygwin environment as a prerequisite to its own installation.
 * The freeware command-line tool with GUI for Windows is the Djvu-Spec Pdf 2 Djvu Converter from the djvu-spec.narod.ru software page. This tool offers many settings to change the quality and size of the resulting djvu file.
 * The free software command-line pdf2djvu (available in repositories, also for Linux), which is usually as simple as . There's also a GUI available.
 * If you need to crop the PDF document, you can use pdfcrop.pl (see below) for black margins or freeware Govert's PDF Cropper for white margins (it requires Ghostscript and .Net 2.0).

If the scanned images are made available as individual images, then the easiest option is to print them to a PDF document via one of the many "virtual printer" tools, such as the free PDFCreator; then convert the PDF document to DjVu as described above.

Note that there are many other options for converting pages to .djvu. One could convert using PostScript or multipage TIFF as the intermediate format, rather than PDF, but this would of course require different conversion tools. It is also possible to convert from .pdf or .ps to .djvu with the DjVuLibre software and its GSDjVu plug-in but due to licensing restrictions installing the plug-in is a fairly intricate process that involves compiling a patched version of Ghostscript.

Another free Windows tool that can come in handy for the images-to-pdf-to-djvu process is ConcatPDF, a GUI tool that permits easy splitting and merging of PDF files. This tool can also be used online. An example of how ConcatPDF might be used is: if a 100-page document has previously been scanned and converted to .djvu and the single page #42 needs to be re-scanned, ConcatPDF would allow that one page to be inserted into the intermediate .pdf file without tracking down the other page images and re-composing the entire document. Installing ConcatPDF version 1.1 requires as prerequisites that the free Microsoft program libraries Microsoft .NET Framework Version 1 and the corresponding Visual J# .NET Redistributable Package be installed beforehand.

Images directly to DjVu
However, a far higher quality document can be achieved using the DjVuLibre software library. Jpeg images can be directly encoded into individual DjVu pages using the c44 encoder. Images in lossless formats such as PNG should be converted to PPM (for colour scans) or PGM (for greyscale scans), then encoded using c44. For bitonal (i.e. black-and-white) scans, such as most page text images, a smaller DjVu file can be obtained by converting the page images to the monochrome PBM format, then encoding to DjVu using the cjb2 encoder. All of these image format conversions can be performed by the free ImageMagick library (in batch, with mogrify). Individual DjVu pages can be aggregated into a multi-page DjVu using the djvm program; this program can also be used to insert or delete pages from a djvu file.

An important caveat of this process is that high quality scans come at the cost of larger files, and there is currently a 100 Mb limit on uploads to commons. The size can be substantially reduced by applying foreground/background separation with didjvu and/or minidjvu.

Scripting djVuLibre
This script allows you to take a whole directory of image files (JPG, PNG, GIF, TIFF, and any file that Imagemagick can convert to PPM) and convert and collate them automatically into a DJVU file. Currently this script is for Windows, but it can be easily converted for Linux. To use it, you will need Python, Imagemagick and DjvuLibre.

Linux

 * See also: User:GrafZahl/How to digitalise works for Wikisource

Method 0 - converting graphic files with foreground/background separation
Just use didjvu.

You may consider preprocessing the scans with Scan Tailor.

Method 1 - page at a time with DjVuLibre
You need the djvu software, which includes a viewer, and some tools for creating and handling DJVU files. You will probably also need the Imagemagick software for converting scans from one format to another: Therefore you need to convert your scans if they are not already in one of these formats.
 * The tool cjb2 is used to create a DJVU file from (bitonal) PBM or TIFF file.
 * The tool c44 is used to create a DJVU file a PNM or JPEG files. This handles colour images, but the compression is lower.

Conversion to intermediate format
DJVU cannot use JP2 or PNG as a format. So next, you need to convert to a format that will work as input to a DJVU. Options include PBM (turns all pixels black or white, no shades of grey); PGM (greyscale, lossless); or JPEG (lossy compression optimized for photographs). convert filename-000.png filename-000.pbm
 * Conversion from PNG format to PBM format with the tool convert from Imagemagick


 * Depending on the quality of the original scans, you may find it useful to process them with the unpaper utility, which deletes black borders around the pages and aligns the scanned text squarely on the page. Unpaper is also capable of extracting two separate page images where facing pages of a book have been scanned into a single image. Another utility is mkbitmap, another pdfcrop.pl (Perl-based and free software, it requires Ghostscript and texlive-extra-utils on Ubuntu; it uses BoundingBox; it can crop a whole multipage PDF document in just one passage). PDFCrop (another one!) deletes white margins.

Conversion to DJVU page file
cjb2 -clean filename-000.pbm filename-000.djvu
 * Creation of a DJVU file from a PBM file: (this command will not work for PGM or JPG)

c44 -dpi 300 p100.jpg p100.djvu (In this example, the PGM is specified to use a resolution of 300 dpi. The -dpi argument may be left out; the default value is 100.)
 * Creation of a DJVU file from a PGM or JPEG file:

Creating final DJVU document
djvm -i filename.djvu filename-000.djvu
 * Adding the DJVU file to the final document

You need to repeat these steps with a script for each page of the book. Example:

There is also another way to add all the *.djvu parts into one:

djvm -c filename.djvu filename-000.djvu filename-001.djvu filename-002.djvu

See the following section for an automated process for multiple pages.

Method 2 - PDF to DjVu bash script
Use this script, which converts a PDF document (multiple or single page) into images, automatically crops them with ImageMagick, converts them in DjVu and bundles them. This is very slow (a large PDF document can require days) but a little more efficient than the following method.

The resulting DjVu document is quite big and low-quality, probably because of poor font recognition, which may be fixed by newer versions of poppler (the used library): the version available in repositories is usually several months old.

You can also remove the pdftoppm part and use the script to convert multiple images directly in a multiple page PDF document. If images are not in pbm format, you can convert them with a single command using mogrify from ImageMagick.

Method 3 - pdf2djvu
Simply download the pdf2djvu tool from your repository to directly convert PDF document (single or multiple pages) into DjVu.

If the document contains the results of OCR (as is the case e.g. with FineReader output) then they are preserved in the DjVu document as the hidden text layer. Some other properties of the source document, including metadata, are also preserved. The quality and the size of the output depends primarily on the features of the source document but can also be controlled with several program parameters, such the resolution of foreground and background. The program is capable to use several threads to speed up the conversion.

As of 2019, file size on Wikimedia Commons is less important than image quality (although PDFs around 1 GiB in size can have problems with thumbnails). The simplest way to increase quality is to change  (default 3, max 12) to 2 or 1 (best quality).

An example command may therefore be: pdf2djvu -j0 --bg-subsample=1 -o output.djvu input.pdf

Note on cropping
With pdf2djvu, you need to crop directly the pdf before the conversion. On Linux this may be quite difficult. You could use ImageMagick, but attention: with multiple page big PDF document, this can take several GB of memory (the limit is 16 TB!) and kill your computer if you don't use the   option directly after. This make the conversion very long.

When using ImageMagick, the resulting PDF document is increased in size and reduced in quality because of rastering.

See other crop tools above.

Method 4 - DjVuDigital
Use djvudigital, which like pdf2djvu converts pdf directly in DjVu. There are licensing problems, because the GSDjVu library has a different license, then you'll need to compile it by yourself; the included utils make this step quite easy, but still long (about 1 hour) and a bit annoying.

But, then you can convert PDF document into DjVu with a single command (see the previous section for crop). The conversion is slow (I find it will complete a 300 page PDF document in about 30-40 minutes). The resulting DjVu is of higher quality and lower file size compared to both the previous two methods. Additionally, DjVuDigital can handle JPEG2000 (aka JPX) files embedded in PDF documents, which is a feature of many Google books. pdf2djvu, Any2Djvu and Internet Archive conversions all fail to convert these files, leaving blank pages in the output.

DjVuDigital has many advanced options to improve results, but they can be difficult to master. In general, altering the --dpi option can give you a quick reduction in file size without too much fiddling.

Any2Djvu
Another method to convert the images to djvu is to zip them and use the Any2Djvu site to create the djvu file. The Any2Djvu will extract the images in the zip and create a OCRed djvu. OCR functions will only with English text.

Any2Djvu cannot handle huge files. Big files are best dealt with if you upload them by URL (e.g. by entering a link like ftp://ftp.bnf.fr/005/N0051165_PDF_1_-1DM.pdf). Conversion can take several hours. Any2Djvu will sometimes run out of memory on large or highly-detailed files and fail. It will also not convert "JPX" images embedded into PDF documents, which are common in Google Books scans.

The Internet Archive
Another method is to upload a PDF document (or archive of image files) to the Internet Archive. You need to login (don't use OpenId, it won't function ).

Uploading
Click "Upload" at the top-right corner. The flash upload (standard "Share" button) won't function with Firefox (use Opera or Internet Explorer instead ) or Linux. You can use the standard JavaScript non-flash method (although there's a file size limit of 2 GB with Firefox, but not with Chromium); FTP upload is deprecated because it's slower and crashy but is the only easy to learn possibility if you have to upload many files (which shouldn't be the case here).

OCR tricks
When the upload has been completed, the Internet Archive will start the "derive" work: OCR to create an XML document of detected text based on the uploaded PDF file, then conversion of that to a DjVu file with embedded text, creation of plain text-only dump file, among others.

Don't forget to set the correct language in the metadata before starting the derive (which is run automatically after upload if there's something to derive), otherwise the OCR language will be set to English and results will be poor for works based in any other language. It's not possible to set multiple OCR languages, but you're invited to upload the same book twice with the two languages to have two OCRs. The length of processing time depends on the size and complexity of your file, as well as the current Internet Archive backlog of conversion tests. You can check your progress in the queue here and more detailed information about jobs you submitted here (must be logged in).

The Internet Archive uses a professional, proprietary, commercial ABBYY software with a quite good images and OCR output in many languages and fonts and an aggressive compression which mantains an high quality of the final DjVu file. However, the Internet Archive sometimes produces over-compressed DjVu files with poor quality. If this happens, you can often download a PDF document and convert manually. You can reduce the resolution the derivation aims at, which is normally set automatically by some "guessing", via the  field, setting it to 300 (dpi) or lower to reduce sizes, processing time and (sometimes) errors.

Images formats
Book scans split into several tiff, jpg, jp2 format images (other formats are not accepted) are converted ("derived") as well, if you put them in a properly created tar or zip archive. It's usually better to upload uncompressed scans or JPEGs; the jp2 files produced in the derivation process are compressed in a way you won't be able to emulate without a lot of effort.

Troubleshooting
If you have severe problems with your deriving process and you need admin intervention (tasks shown in red in your tasks list), ask help at infoarchive.org, they're usually amazingly helpful. General requests for help should be placed in the forums though, don't bother them for nothing! 

OCR via Any2DjVu
The OCR option available at the free conversion service Any2DjVu does do an OCR of the scanned image but the resulting text is embedded within the .djvu file itself and must be extracted so it can be used on Wikisource.

One way to do this is to use the DjVuLibre software to extract the text, via a command like

or

JVbot can automatically upload the text layer of a DJVU to the pages on Wikisource. For example, Robert the Bruce and the struggle for Scottish independence - 1909.

OCR via the Internet Archive
See above: if you upload a DjVu file, the derive process will OCR it.

OCR with Tesseract
OCR can be done with Tesseract, a free OCR software, and a script:


 * OCR with Tesseract. Perl script.
 * OCR with Tesseract (Python), slightly more user-friendly Python script. Based on the Perl script.

OCR with Tesseract 3.x and other free OCR engines
Use ocrodjvu.

Linux
To extract images from a DjVu file, you can use ddjvu ddjvu -page=8 -format=tiff myfile.djvu myfile.tif

If you done all the pages (without ) you can split the multi-page tiff into single pages png (or any other format) convert -limit area 1 myfile.tif myfile.png

Extract all pages to single pages tiff with 80% quality. ddjvu -format=tiff -eachpage -quality=80 myfile.djvu myfile-%03d.tiff

Manipulating
There's some advice about manipulating DjVu files or images to be used to generate DjVu elsewhere:
 * (second bullet point)
 * User:GrafZahl/How to digitalise works for Wikisource/pbmextract.c
 * Help:DjVu files/other pages, Help:DjVu files/other pages
 * fr:Aide:Créer_un_fichier_DjVu/Linux

Splitting DjVu files
The DjVu documents come in two flavours: bundled and unbundled (indirect); the latter format stores every page in a separate file. The comment below made by the original author concerns only bundled documents, which should be avoided.

Large works can not be uploaded onto Wikimedia servers which have a 100 MB upload limit. To split the DjVu, use DjVuLibre "Save as", and specify a page range which will produce a file small enough to be uploaded. Some trial and error may be necessary.

The easiest way to split DjVu files from the command line is with djvmcvt: mkdir mydoc/ && djvmcvt -i 'mydoc.djvu' 'mydoc/' 'new-mydoc-index.djvu'

Alternatively, djvused can be used from the command line: djvused myfile.djvu -e 'select 10; save-page-with p10.djvu'

This can be done for every page. To know the number of page of the file : djvused myfile.djvu -e 'n'

Removing a copyright page
Many of the already-created djvu files available at archive.org and other sites have the Google copyright page attached to the front of the document. Wikimedia policy, based on an analysis of the underlying law, does not accept that copyright is established on a public domain work simply by scanning or copying it or taking a two-dimensional photograph that faithfully represents its subject. See Wikimedia Commons for more information about scans, artwork and the position of the WMF.

Such copyright pages and other extraneous material can be removed with DjVuLibre, an open source program maintained by the inventors of djvu under the GNU Public License. Binaries are available for Windows, Mac, Linux, Solaris, and IRIX. It includes djvm.exe, which is run as a command-line utility. If you cannot figure out how to do this, you can message Mkoyle (talk), and he will do it for your file and email the file to you for upload. The command line to delete (-d) the first page (1) is as follows:

djvm -d filename.djvu 1

Inserting a new pages (e.g. a placeholder)
If a DJVU file is missing pages, you can insert placeholders, so that if the pages are found and inserted later, existing pages won't need to be moved. You can use File:Generic placeholder page.djvu for the placeholder.

djvm -i main_document.djvu placeholder_file.djvu 

Note: work backwards from the last missing page in the file, to avoid having to recalculate the page numbers as you insert pages.

Displaying a particular page
The link tag accepts a named parameter "page" so that, for example, this wiki code displays the image of page 164 of the file Emily Dickinson Poems (1890).djvu on the right, 150 pixels wide (the rear cover of the book, containing no text):



The page image can be displayed in the books Wikisource main space as with Personal Recollections of Joan of Arc/Book I/Chapter 2 using: