User:Peteforsyth/Images from Internet Archive

Many of the works on Wikisource are derived from scans at the Internet Archive. Typically, we use the DJVU (or PDF) files to create scan-backed texts. These files are primarily designed to present an accurate version of the text, and are optimized for size; so they generally contain very low quality images.

If a work contains photographs or other images, we should upload higher quality versions of these, and then include them in the transcribed work. Once uploaded, an additional benefit is that these images will be available for other uses in the Wikimedia projects and elsewhere.

The Internet Archive offers the high quality original page scans, in addition to the DJVU or PDF files we use for scan-backing. This page describes how to access these files, and how to convert them from the high quality but inconvenient JPEG 2000 (JP2) format to PNG.

It is likely possible to do all the requisite tasks on Windows or MacOS, but it's quite convenient to accomplish these tasks with free tools on Linux; so these instructions are for using the Linux command line. If you don't have access to a Linux machine, please let me know, there may be options.

Find the high quality scans
Most books available in DJVU or PDF format from Internet Archive (IA) have high quality raw page scans available; these are the scans that were used to produce the (compressed) PDF and DJVU files, which are optimized for file size and readability of the text (not for image quality). The links below use Oregon Historical Quarterly/Volume 1 as an example.

First go to the book's IA page. The file's page on Wikimedia (accessible by clicking the preview on its "source" or "index" page) should have a link to the IA page. On that page, scroll down past the preview. On the righthand side you'll see "Download Options," and at the bottom of that section, "Show All." Click that link.

On the "Show All" page, you will typically find two lines with "jp2" in the filename. In this case, they are:


 * oregonhistorical01oreguoft_jp2.zip (View Contents)
 * oregonhistorical01oreguoft_raw_jp2.zip (View Contents)

The "raw" link is the original scans; the other one may have light editing, such as a loose crop and page rotation. Either one should be fine for our purposes.

If you are looking for one particular page, click one of the "View Contents" links. This will list every page in the book; and on each line, the "jpg" link will let you preview the file (so, for instance, you can easily confirm exactly which page is the one you're looking for; important because sometimes the page numbering might be slightly offset from the numbering on Wikisource.)

Download the scan(s)
Once you have found the page scans, decide whether you are interested in a small number of pages (which you can download individually) or whether you want a substantial portion of the book (e.g., if the book has 100 photos, and you intend to upload them all to Wikimedia Commons). I typically download the entire book, and then remove the page scans for pages that are all text prior to converting.

Note that these archive files can be quite large - and they will be much larger, as much as 20x the size, once decompressed! So consider your bandwidth and the available space on your drive.

Use the wget command, as follows.

To download an entire archive:

Or, to download one or more files:

Note that the wget command has some nice options. It is designed so that you can run it independent of your login session (so, for instance, you can start it running, log out, and come back the next day). You can resume partial downloads, and you can limit the download speed if you're worried about hogging your network's bandwidth. See here for more info on wget.

You can unzip a zip archive as follows:

Remove the non-image pages
Let's assume you downloaded and unzipped an entire archive in the last step. Since converting the files can take a fair amount of time and disk space, in most cases, you will want to delete all the files you don't care about prior to conversion.

First, be sure you are in the directory with all the files.

Now, delete the files you won't need. If you're on the command line, you'll use the  command. At this stage, though (as a bit of a command line novice) I will often load the directory in a GUI (either directly on the Linux machine, or by mounting a file share from it on a Mac or PC), and use more familiar tools to remove ranges of files. Use whatever tools are available and familiar for this step.

In our example, you can see from the book's index page that the first photo plate]] occurs after page 340. But, the numbering on the index page reflects the page numbers in the original book; hover over the pages to see what the scan number will be. In this case, we will remove all pages up to and including scan number 354. Remember you can consult the "View Contents" links to confirm which page numbers correspond to which pages. Then, remove the other pages.

Convert from JP2 to PNG
Now, convert the files. I always convert to PNG, because it's a format that offers lossless compression. You could use JPG, but there will be some amount of degradation in the file.

Keep in mind that each file will expand a great deal, maybe as much as 20x the size. So keep an eye on your disk space before proceeding (and consider breaking the job up into chunks).

The following command (which will require you to install the proper package) operates on the directory you're currently in, indicated by the "." It will convert every JP2 file in the directory to PNG. It will take a while to run; you might want to watch it and then cancel the job (with control-c) after one or two conversions, to ensure that the output is what you expect before proceeding.

Further details here: Scriptorium/Help/Archives/2018

Basic processing of the PNG files
Now that you have a directory full of PNG (and JP2) files, you'll probably want to tidy them up a little prior to upload. I like to make sure to upload a full resolution version, with very generous cropping, first; after that, I might make some adjustments (a tighter crop, blurring out halftone dots, etc.) and then upload a new version. That way, the file history on Commons still contains the full quality version, in case somebody else needs it; you'll save them the effort of accessing and converting the original from the IA servers like you just did.

The IA files often have lots of extra space around the edges, and even when they are of monochrome photos, they are typically full color scans. So a loose crop, and conversion to greyscale, are examples of quick edits you might want to make to reduce file size and make the files a bit more usable.

When saving a PNG file, you may choose the level of compression; the more compression you use, the longer it will take to save, and the more disk space will be saved. There is not a trade-off with file quality, as with JPGs; a highly compressed PNG is in no way inferior to an uncompressed file.

Prior to upload, you may also want to optimize the PNGs, which will save more space. (Note, I believe this is a separate function from compression, but I'm not sure.) The linux command  will further reduce the file size.

Upload to Wikimedia
Consider the file names, either prior to or during upload. Ideally, according to Commons policy, each image should have a file name specific to what it depicts (rather than merely the name and page number of the book). You may wish to cut some corners for expedience, but consider it carefully prior to uploading a batch of files.

Categories: Be sure, at minimum, to add all your uploads from a certain book into one category. This will ease any organizing or processing after upload.