Help:Internet Archive



The Internet Archive is a non-profit digital library that holds nearly 3 million digitised books as well as music, audio, video and other files. It is one of the main sources of DjVu files for use on Wikisource. As well as files based on their own scans, the Internet Archive will also derive files (including DjVu files) from scans uploaded by its users. This can be a useful way to convert user-made scans into a DjVu file compatible with Wikisource (as well as making the work available for others).

This help page focuses on DjVu files, because that is the most used file type on Wikisource, but the process can be used for any other file type available from the Internet Archive.

Searching

 * 1. Go to the Internet Archive
 * 2. Search for the book (or other text) you want. The basic search has a text field and a drop-down list.  Type the title of the book in the text field and set the drop-down to "Texts".
 * 3. Click "Go"


 * 4. If the correct files are found on the Archive, you should see it in the search results. If there are multiple appropriate files, select the one you deem the best. This is subjective, but a clear scan will work best for proofreading, so aim for the best quality available (also note that some scans may have dirt or writing on the pages, which may or may not make proofreading harder).  Different scans may come from different editions. If so, it is up to you which you pick but the earliest edition available is a popular choice.


 * 5. If unsuccessful, you can also try following links, searching by subject, searching by author, or using the Advanced Search function.

If you didn't find the intended book but found some that is interesting to work, is strongly recommended to check if it is really suitable to be available on Wikisource in licensing terms (e.g., if it is a public domain work or licensed using compatible copyleft licences). Internet Archive accepts contributions still in copyright or under some restrictive licensing terms, but Wikisource will not accept them automatically, simply because they are available on archive.org - they must also meet licensing requirements.

DjVu file
Note: DjVu files are no longer created for new uploads to the Internet Archive as of March 2016, so you may not find one if the book was uploaded after this date.

The DjVu file can be downloaded (and uploaded to Wikimedia Commons) by following the steps below or manually tweaking the URL to the default DjVu URL format.


 * 1. On the right of lower half of the details page, a box with the title "DOWNLOAD OPTIONS" as shown in Fig. 3. This section of the page will probably not be visible until you scroll down past the document viewing area.
 * 2. Click on the link INDICATED BY THE RED ARROW to get to the list of files in Fig. 4.


 * 3. This will open a list of files, as shown in Fig. 4.
 * 4. Locate the file with the  suffix. This is indicated by the red arrow in Fig. 3.
 * Other files can be downloaded instead of the DjVu. If required, proceed with the most appropriate file from the list.
 * An alternative format for text are PDF documents, with the  suffix.
 * Audio files in the Ogg Vorbis format have the  suffix.
 * Video files in the Ogg Theora format have the  suffix.
 * The original scans are available from this list as well. In this example, the file  is an archive of JPEG 2000 images of individual pages.  This can sometimes be useful as it will contain high quality versions of illustrations, photographs and other elements of the book.
 * 5. This is the file that needs to be uploaded to Wikimedia Commons. See Uploading (below).

OR

The DjVu file download link can be retrieved by manually tweaking the book URL to the default DjVu download URL format.
 * https://archive.org/details/$File$ to https://archive.org/stream/$File$/$File$.djvu

Uploading
There are three main ways to upload the file to Wikimedia Commons.

One: IA Upload tool
The IA Upload tool is currently the most easy-to-use way to upload files from archive.org to Wikimedia Commons. You can check or contribute to its open source code.


 * 1) Go to IA-Upload. It will request an "OAuth" (permission to have restricted access) from your account on Wikimedia Commons at each run.
 * 2) Insert the archive.org identifier-access (the   portion of the URL as in  ) in the first field.
 * 3) Insert the desired filename for the file to be uploaded on Commons in the second field, without the   prefix or   suffix, and proceed.
 * 4) Review the automatic metadata, changing it as and if needed.  It will be based on Commons'  template.
 * 5) * Note that you can select different source files for the DjVu: if you select to create the DjVu from either JP2 or PDF, then your request will be placed in a queue (displayed on the tool homepage) and will usually take about 15 minutes. If you select DjVu as the source, the upload will happen immediately (but not all IA items have this as an option).
 * 6) * Using the JP2 files as a source will not result in high quality images being uploaded to Wikimedia servers.
 * 7) Proceed and after a short wait you will find the file properly uploaded to Commons and list in your contributions.
 * 8) With certain books on archive.org, the generated Djvu file at Commons, will contain a misaligned text layer owing to an unresolved technical issue. If this happens, please mark the generated DJVU for deletion at Commons and post a request for a clean scan of the work on the Scriptorum.

Two: Automatic transfer
Use the URL2Commons tool to automatically transfer the DjVu file from the Internet Archive to Wikimedia Commons.


 * 1) Refer to Help:URL2Commons for information on using the tool.
 * 2) Right click on the appropriate file in the Internet Archive file list and select "Copy Shortcut" or equivalent.
 * 3) Paste this into the top panel of the URL2Commons tool.
 * 4) Proceed as described in the URL2Commons help document.

Three: Manual download and upload
Download the file to your own computer, then upload it to Wikimedia Commons manually.


 * 1) To download, right click on the appropriate file in the Internet Archive file list and select "Save Target As.." or equivalent.
 * This may take some time, depending on the size of the file.
 * If you use download manager software of any kind, follow the instructions for that software.
 * 1) Once downloaded, go to Wikimedia Commons' Upload Wizard (guided upload process with helpful steps) or Upload page (quicker but requires more knowledge of Commons' policies and methods).

Others
There are other ways to upload files to Wikimedia Commons, such as the bulk uploader Commonist. These still require downloading the file(s) to your own computer before uploading to Commons.

Adding files
Files can be added to the Internet Archive by any registered user. The following information is presented for ease of use and reference for Wikisource users. However, Wikisource is not affiliated with the Internet Archive and any or all of these stages may be changed by the Archive at any time. It is strongly recommended that anyone attempting this should refer to the Internet Archive's own instructions, and follow those above the steps listed here.

These instructions are:


 * Internet Archive FAQ — Uploading Content

The following Internet Archive blog posts might be useful as well:


 * New Upload Format, *_images.zip, for Scribe-style Uploads
 * Uploading images for text items
 * Book Scan Wizard software now supports Internet Archive uploads!

Preparing the file
If uploading a collection of page scans:


 * 1) The page scans should each be in an image format.  For example, JPEG format.
 * 2) The page scans should be named in the correct alphabetical order.  It may be a good idea to use a naming format such as "MyScan001.jpg", "MyScan002.jpg" etc.  If so, remember to use leading zeroes, otherwise page 10 will come after page 1 but before page 2.
 * 3) Make sure that the page scans are the only file in the folder you are using.
 * 4) Create a zip file of the folder containing your page scans.  The file name should be in the format "Myscan_images.zip", where "Myscan" is whatever you want to call the file.  The "_images" suffix is important; your scan may not derive properly later if this is omitted.

Files such as PDFs can just be uploaded as they are.

Uploading
Note: the following instructions are for the classic uploader, superseded by the 2013 upload and create item wizard. Most of the instructions below should be unnecessary and ignorable if you use the new, simpler uploader. A blog post How to upload a scanned book to the Internet Archive is available with many screenshots; the advice on identifiers and metadata is just the author's personal opinion and is optional, however.


 * 1) Log in to the Internet Archive.
 * 2) Click the "Upload" button at the top right of the screen.
 * 3) Select the file to upload
 * 4) Fill in the information requested and choose an appropriate licence (this will be similar to the licences on Wikisource).
 * 5) * Title (required)
 * 6) * Description (required)
 * 7) * Keywords (required)
 * 8) * Author
 * 9) * Creative Commons Licence or Public Domain Mark
 * 10) Wait for the upload to complete.
 * 11) Click the "Share my File(s)" button.
 * 12) You will see the message "Please wait while your page is created..." then "Your Page is Ready!" followed by link to page.
 * 13) Clicking the link will result in a "Your item is not yet public" message.
 * 14) Pick a collection for your file.  The options will include "movie, audio, text, etree" and "community video, community audio, community text".  You will probably be using "text" and "community text".  Select the appropriate collection and click the "Submit" to the right.
 * 15) * At this stage, you might be told to wait and come back later. This text is: "Your item is in the process of being derived, and you may not replace the metadata until the derive has finished (because any changes queued now would roll back those being made by the derive). Please try this page again after your item has finished deriving. [Item History]"  In this case, simply follow those instructions: try again later.
 * 16) In the Metadata Editor complete more information (including the information from earlier stages).
 * 17) Click the Submit button.  This will enter the file into log.  This will take some time to complete

Deriving
Derivation can take up to a few days. This can be monitored either through the filename or the 'Contributions' page. The various formats of the work should automatically be derived from the files that were uploaded. If this has not occurred, the "View the book" in the left-hand sidebar will not be showing the various available formats (DjVu, EPUB, Kindle, Daisy etc). Derivation failure can have numerous reasons, many of which are internal to IA and have nothing to do with the uploaded file.

First, force the derivation from the file page:


 * 1) Click "Edit item"
 * 2) You will see two choices: "change the information" and "change the files".  Click "change the information".
 * 3) Click "Item Manager"
 * 4) Click "Derive"

In case this fails:


 * 1) Go to the 'Contributions' page.
 * 2) Click on 'See your contribution tasks that are not yet completed.'
 * 3) The screen will display a list similar to this image.
 * 4) If the derivation process is still running, then wait.
 * 5) If the process has stopped and marked red, and 'waiting for Admin', then email to info@undefinedarchive.org, advise them of the problem and request restart of the derivation process. Be sure to include the uploaded page link.

Requested uploads
You can request mass upload of public domain book scans from any external website to Internet Archive by preparing
 * 1) A list of URLs of the books to download
 * 2) A CSV table with title, creator, date, description, sponsor (digitising institution) etc.

Admins who are also Wikisourcerors
Some Internet Archive volunteers are given admin status on specific collections and can edit all items in those collections. No volunteers are known to have admin status on the general "Community texts" collection, but they can still help in the simplest cases, namely a derive.php red row waiting for admin or moving items into collections.

The following users are available for requests if you don't feel like disturbing the staff:
 * Nemo
 * Alex brollo (admin for [//archive.org/collection/opallibriantichi opallibriantichi collection])
 * Hydriz