Wikisource:WikiProject DNB/Raw materials

Dealing with scanned text from the DNB is the nitty-gritty of the project. To put it another way, the work of the project would be made much simpler if a full set of DNB scans were easily available in a form that required only light proof-reading.

There are four serious sources of scanned text right now, three of those being posted at archive.org. Those posts at archive.org have been the backbone of the project, whatever their many deficiencies. In some cases text can just be typed in, but that can only play a minor part in getting the job finished. The fourth source is the ODNB site, and that has its own drawbacks, to be mentioned further down the page.

Scans at archive.org
The project operates in patchwork fashion, taking pieces of text from various scans: but there is a priority order of where to look that will save large amounts of time.

The "Toronto" scans are the best of the bunch at archive.org, and are available for 43 out of 63 of the volumes (with two of those not being complete). If you intend to work through a single volume, as some participants do, it would make a lot of sense to choose one such volume.

The "Hyderabad" scans for volumes 17, 51, 53 and 62 are a lot better than nothing, but not on average as good as the "Toronto" scans. For the remaining 16 cases there are only the "Google" scans, which can be bad as in "you wouldn't believe". They are free and that is about all that can be said in their favour in some instances. One of our long-term issues will be to get round their deficiencies. In other words about 25% of the volumes are only covered by technically poor scans.

Tabulated data is available at WikiProject DNB/Progress. If you want a fuller list of scans try w:Wikipedia:WikiProject Missing encyclopedic articles/DNB scans which has things other than the 63 volumes.

Other formats
Besides the "full text" link on a typical archive.org page, the "read online" and "PDF" links are also of interest for proofing. The "read online" format is much quicker than downloading the whole PDF, if you know the page number; it is rather more rigid as a view in some ways, though. When the djvu provided here is too poor to be useful for proofing, it may be that other choices at archive.org offer a better view (the initial choices of uploads were not the best ones in numerous cases).

Scans elsewhere
There are probably hundreds of DNB biographies posted on the Web somewhere (well proof-read or not, copyedited or not, from the first edition or not, who knows). It could be of interest to collate some information about what else is out there.

The other source of major interest is that on the ODNB site. This is a source of text proof-read to high quality; but unfortunately not of the edition we are working on. Anyone thinking this is a soft option needs to read what follows very closely: in practical terms using this text for this project requires alertness to the possible differences from the first edition. Those differences come in two kinds: altered format, and updates.

Altered format
The ODNB texts show the following systematic changes to what we know from working here on the first edition:


 * They have "(d 1066)" where we would have "(d. 1066)"
 * They have "[q.v.]" where we usually have the spaced "[q. v.]", and punctuate differently, with "[q.v.] ," where we have "[q. v.],", also "[q.v.]." where we have "[q. v.]" at the end of sentences.
 * They have the references section headed "Sources", and not enclosed in [ and ] and in small type.
 * At the end of the article the author's initials are not given necessarily with the same abbreviation as here (e.g. S.L. for S.L.L.).
 * They romanise any text in Greek, which the first edition didn't.
 * Certain titles of nobility have small caps applied in a different way: e.g. "[see, second Earl of Bristol]" instead of "[see , second ]".

Something unsystematic but worth noting is the way that the procedure of dehyphenating as a form of removing the link breaks of the original is not necessarily carried out the way it would be here. In other words is it "roadrunner" or "road-runner" where the original has a line break after "road-"? Here the default is to leave in a hyphen when in doubt.

Updating
The text at the ODNB has been updated in various ways, some of which are more obvious than others. The 1904 Errata have apparently been incorporated, but that is not all. Watch out for these:


 * rethought introductory or closing passages;
 * additions of a few lines of solid data (genealogical, political positions are typical);
 * updated references in the section at the end, typically books published between the first edition and later edition - these may replace older references as well as supplementing them.

The inline references may also have been altered, either to bring in a new source, or to change one abbreviated form of title to another considered clearer.

When it comes down to it, the differences can be just about anywhere, and one has to work at spotting them.

Way of working
For the less light-hearted jobs of proofing, copying text into a text editor may be worthwhile, to take advantage of spell-checking or find and replace functions. One point that comes up is the nuisance of hard line-breaks in scanned text. They aren't visible in the editing window, but backspace can "detect" them all right. They interfere with the format markup (italic text will finish at a hard linebreak) and are particular nuisances in the small text references sections. A text editor can remove them all (and probably the paragraph structure too); this may be well worth while. Having done that, you can remove "- ", i.e. hyphens from linebreaks of the original that are now followed by spaces; and another worthwhile replace is " ;" by ";", because the scans very often insert an unwanted space before a semicolon. These three passes take only seconds and can make it less wearisome to proof-read a page.

To help identify hard line-breaks, it can be useful to switch between vertical and horizonal presentations using.

The interleaved text bogeyman
Where the scanner has gone straight across the page you get lines in the pattern ABABAB... from column A and column B, usually with some corruption from the central line or at the far ends. This "interleaved text" can be sorted out, if it is the only scan you have to work with, but this is time-consuming. It seems best to report bigger patches of interleaved text as you find it, in the hope of getting replacement text from somewhere. Please use the Talk page of this page for the moment.