User:George Orwell III/whitepaper2

__NOINDEX__

You are what you upload
In my view, that should be rule number one around here. Of course, any newbie is not expected to know or follow such nonsense; for those of you with any respectable amount of contribution time under your belt already, it should be part of your transcription arsenal by now. You can be the absolute most consistent best top quality transcriber that ever walked the Earth but the chances of anyone coming to such a realization diminishes, to some degree or another depending, along with the quality of the accompanying source files of your work.

A. — What you need to know first

 * 1) Comprehension/ability to tell time & date.
 * 2) Understanding that rarely is anything dealing with technology ever static and that we depend on technology. The entire endeavor from Google scanning a physical book into a digital format to validating a transcribed source file derived from that book here on Wikisource would not be possible were it not for the technology behind it all. Without getting too technical -- if technology improves over time, its safe to say any results based on or in that same technology should also see improvement over time. The newer the version of any piece of hardware or software technology, chances are the "better" they will be compared to their predecessors (e.g. ABBYY 8.0 is not a good as ABBYY 9.0 is not as good as ABBYY 10.0 is not as good as ABBYY 10.1 and so on...)
 * 3) Familiarity with the basics of image and text file types.-- Understanding the nuances of stuff like: the results of scanning a paper page of text almost always produces an image file - a facsimile of the paper page. In some cases the text content might still be embedded within that same file however - the presence of such text does not change the fact the file "type" is an image. Multiple image files comprised of scanned results per paper page can be compiled into a single document file to mimic the pagination of the original source (a .PDF file for a physical book; the individual images making up a contiguous order of positions -- not pages -- in that document file). The virtual /position number rarely ever lines up with the printed page number. Etc.
 * 4) Familiarity with various pages associated with a single work hosted on Internet Archive
 * 5) The URL for the "main" page of any work follows this format:  where for our example, the IDENTIFIER is   making our target URL.
 * 6) The link of the URL listing all the files related to or needed by that "main" page (otherwise known as an Index) can be found in the left-hand side-bar titled View the book. Its labeled near the bottom next to   as  . While the URL to the Index incorporates the INDENTIFIER introduced in A4-1, there is no consistent easy way to ascertain the exact address other than the HTTPS link found on the "main" page.
 * 7) The link to the entire "history" of both the "main" page of a hosted work and the "Index" of files supporting it follows this format:  where, again, for our example, the IDENTIFIER is   making our target URL.
 * 8) ... I'm sure I'll remember something -- placeholder

B. — What you need to check for

 * 1) How old is the candidate file?.-- If you accept the premise laid out in A4-2, the age of the candidate file is worth taking into consideration in your decision making process. And by age we mean two things:
 * 2) When was the work first scanned into a digital format and by who (like what Google does).
 * 3) When was that work put through the derive process on Internet Archive
 * Ascertaining the "original" date of creation in 1-1 is not always easy nor always worth the effort. What is easy is determining when the current work on IA was first processed. Why? Because the older that date is compared to today, the more likely the file is not the most optimal possible based on the ever improving technology over time premise. How? By analyzing the Index of the candidate file on Internet Archive.

 What exactly is in the Index?.-- Once a file is uploaded to IA and the basic metadata (Title, Author, Language, etc.) has been inserted, the "derive" process begins. Depending on the type and make-up of the uploaded file, a consistent pre-set batch of file manipulation programs are executed against the uploaded file until a final set of resulting files have been created -- this processing is more commonly known as the "derive process". Files created in the Index are the products resulting from the upload and derive stages of processing. Knowing which type of file is the result of what stage in addition to inspecting the timestamps of each helps us determine the age aspect. Using the previous identifier example for illustrative purposes here again, a typical Index looks something like this: Index of /10/items/womanswhoswhoofa00leon/ Without diving too deep into the details just yet, one thing is obvious -- the source file uploaded to Commons is approximately 4-years, 3-months old already. How old was the source file uploaded to Internet Archive that produced the file ultimately uploaded to Commons? Can't say with any certainty but for our argument's sake, lets say this book was scanned and the resulting file or files were uploaded for processing by IA on the same day. So what does a 4.3 year difference make? You tell me. Our example file's derive history highlights on the left and a 2 day old file's derive history highlights to the right -- note the (v##### or version no. for each. Not only have the modules and software been updated in those 4 some-odd years but we are presented with an unusual opportunity in this case. Look closer at the Index of files for our example... notice the .PDF file (line 4) is not the "oldest" file of the bunch? This means it wasn't the original source file uploaded to IA for processing but one of the result products of the processing. The .tar archive (line 17) of presumably 1 .jp2 file for every 1 page scanned is the source file. Now a bad scan is bad scan and no amount of re-jiggering will dramatically improve on resulting quality of files derived. The flip side being a good-scan is a good scan and re-running the latest derive modules and updated software against it will likely improve results one way or the other - maybe even all around (i.e. better thumbnails and a superior text-layer). So what should you take away from all this... INVEST SOME TIME and RESEARCH into what you select for upload & hosting by us and stop letting the 'eye candy of the moment' guide your decision making for you!!! -- George Orwell III (talk) 04:08, 3 January 2015 (UTC)