Wikisource:DjVu vs. PDF

The ProofreadPage Extension is the backbone of Wikisource's workspace (the Index namespace and the Page namespace) in which proofreading takes place. This uses a scanned file of a physical work to create an Index page, from which Page pages are created, which are eventually transcluded into the main namespace for anyone to read.

These days, the ProofreadPage Extension can use different formats for the scanned file. This was not always the case. When the extension was first written, only DjVu files were compatible with our needs, so only DjVu files were considered for the software. Since then, the PDF format has changed; it is now acceptable for Wikisource's purposes and covered by the extension.

For historical reasons, DjVu files are still preferred on Wikisource but either DjVus or PDFs can be used and there are advantages to both. This leaves the question: Which file format should be chosen when starting a new transcription project?

DjVu
DjVu files are an open source container format, holding page images and text to replicate scanned documents. As an open format from the start, it was allowed to be hosted on, and supported by, Wikimedia. DjVu has been supported since the beginning of the ProofreadPage extension and held a monopoly regarding proofreading on Wikisource for a long time. This head-start is part of why DjVus are still the most popular format on the project.

Advantages

 * Compatible philosophy: The DjVu format is and always has been an open format, while PDF was originally a proprietary format owned by Adobe. PDF became an open standard (ISO32000) in 2008, and is no longer controlled by Adobe or requires royalty payments from implementators, but it is often still considered less open than DjVu.
 * Smaller files: DjVu files are generally smaller than equivalent PDF files. Wikimedia Commons used to have a 100MB limit, and earlier it was even smaller than that, but the limit is now 4GB (with most uploading methods). Nevertheless, it can be easier to work with smaller files because various processes (extracting images etc.) are quicker.
 * Tried and tested: DjVus have been in use on Wikisource for longer than PDFs. It is more likely that any problems that can occur have already occurred and have been solved. DjVu files are less likely to cause problems with the ProofreadPage Extension or Wikisource in general.

Disadvantages

 * Lower resolution: DjVu files have a lower resolution than PDFs. For the most part, this is not a problem for proofreading as long as text is legible.  It can be a problem with smaller text or text on the borderline of legibility. This might be a problem for illustrations and other images, if the images are being extracted from the file.
 * Glyphs: Due to the compression system used in DjVu files some glyphs (letters, numbers and other symbols) may be incorrect. The pixels of each page are divided into symbols and a dictionary of these symbols is then created. The pages are then encoded by describing which symbols (from the dictionary) appear where.  Therefore, the page image may not always be an exact representation of the original work.  Note that this should only happen with poor quality, lossy compression.
 * Less external support: The DjVu format is not as widely supported as PDF. There is less software available for creating and editing files in this format.

PDF
In the early days of Wikimedia, PDFs were not allowed to be hosted on Commons or Wikisource because it was not an open standard. This ended in 2008 when the format was released by Adobe. Following this change, PDFs can be freely uploaded to Wikimedia Commons and the ProofreadPage extension has been adapted to be compatible with this format.

Advantages

 * Higher resolution: PDFs have a higher resolution than DjVus. This may not make much difference with text which just needs to be legible to be used for transcription by Wikisource, unless the text is particularly small or on the borderline of legibility, in which case PDFs are clearer.  This might be an advantage for illustrations and other images, if the images are being extracted from the file.
 * Wider external support: PDFs are more widely known and supported than DjVu files. Many users can find it easier to create and edit PDFs for this reason and PDFs may be easier to acquire than DjVu files.

Disadvantages

 * Bugs: As a later addition to the original software, PDF-support is more likely to exhibit "buggy" behaviour and less likely to be as fully tested as DjVu support. Known bugs include:
 * The inability of the ProofreadPage Extension to recognise accents and diacritics in PDFs.
 * Some recent versions of PDF cannot be read properly by Wikimedia's Ghostscript software. In these cases, all pages appear blank when viewed on a Wikimedia website.
 * Larger file size: PDFs are generally larger than an equivalent DjVu file. Wikimedia Commons had a 100MB filesize limit, which may be a problem with very large documents (and certainly was a problem in the past when this limit was much smaller).
 * Expensive software: Editing a PDF can still require more expensive, proprietary software, whereas DjVu is free in both senses of the word.

Other options
In some cases, Index pages can be created without a single source but with individual page images. This only really has the advantage of being able to make use of resources available without having to convert them to a different format. The Index page itself will be more difficult to set up for no real advantage. This is not recommended for works with more than a small handful of pages.