Help digitize 20 years of Dickens’ weekly magazines

As you’ve doubtless noticed by now, I love digitization projects that allow, nay, desperately need random nerds such as ourselves to root around in historical archives. This one is easily my favorite because it gives you the chance to read entire issues of the weekly magazine Charles Dickens owned and edited for 20 years.

The magazine saw the debut in serial form of some of Dickens’ most famous works like Great Expectations and A Tale of Two Cities, but that’s just the tip of the iceberg. The journals contain articles about society and politics, exposés of appalling conditions in factories and prisons, dispatches from the front of mid-19th century conflicts, and the literary stylings of luminaries like Wilkie Collins, Elizabeth Barrett Browning and Elizabeth Gaskell. Every Christmas Dickens would collaborate with some of them to create seasonal plays and stories.

The journal began in 1850 as Household Words. Dickens owned half of it and his agents Forster and Wills owned another quarter. The remaining shares belonged to his publishers, Bradbury and Evans. That 25% was enough to guarantee them interference, so when they and Dickens had a falling out in 1859 the author decided to start a new magazine over which he would have complete creative control.

He took Bradbury and Evans to court (Chancery Court, no less, the systemic maelström at the center of Bleak House) to win back the rights to the trade name “Household Words” but wasted no time getting the new venture off the ground. All the Year Round debuted on April 30, 1859. One month later Dickens won his case and folded Household Words into All the Year Round. He continued to edit the magazine until his death in 1870.

There are 1,101 editions of Dickens’ weeklies, that’s 33,000 pages. Starting in 2006, a valiant team of three people at the University of Buckingham has been working on scanning and digitizing them all so that they can be readable and searchable on the Internet. It’s an immense project, however, because even a high-resolution document scan is still replete with OCR errors and extraneous data, and it takes one person a lot of time to copy-edit a billion words. They’ve only managed to go through about 15% of the archive thus far.

Enter the Online Text Correction (OTC) Project.

All though the image files were created using a state-of-the-art scanning device, the quality of the original journal pages varied and some contained paper folds, smudge marks, transparency, etc. and as a result the text files contain a number of errors that vary from file to file. This is the main dilemma that we are trying to correct. A secondary problem, relatively trivial, is that the text file contains unwanted information and styling, which can also be corrected at the same time as the actual mistakes.

We have decided to make a magazine, typically 24 pages long, the smallest unit of contribution and as a result we will have 1,101 units of work at the end of the day. So if we find around 1,000 volunteers to take on 1 or 2 magazines each, we will reach the target between us. We reckon that with a typical magazine, it will take about 10 minutes to review and correct each page = 240 minutes or 4 hours’ work).

I love this approach. It means that you get to read an entire issue cover to cover, fixing OCR and formatting errors. You get all the pleasure of curling up with a stack of old Dickens magazines while at the same time helping ensure they will be available at the click of a mouse in perpetuity. The goal is to get the entire archive online in time to launch the new Dickens Journals Online website by February 7, 2012, the bicentennial of the birth of Dickens.

To help them accomplish this laudable goal while getting the chance to immerse yourself in Victorian society, register on the OTC Project website. Once you’re logged in, you select an uncorrected issue from the Magazine Index and dig in. Scroll down on this page for details on how they want the text formatted and corrected.

Share

RSS feed

7 Comments »

Comment by Mr. Murphy in VA
2011-08-07 09:38:26

It seems odd that they don’t capture the pages as digital camera (vs scanner) images, CORRECT the images and THEN print them to PDFs and afterwards OCR scan the images that way. All of the image corrections could be performed in an application such as Adobe Photoshop. This method would enable scanning of images that are free from many of the defects that make OCR so difficult. The technology to correct the original images is powerful, readily available and produces very surprising results.

Comment by livius drusus
2011-08-07 14:09:44

You mean like remove the smudges and page numbers and whatnot? I’m not sure it would be worthwhile given that they’re going to have to visually copy-edit the final OCR product no matter what, so might as well just correct it all at once.

 
 
Comment by Ral
2011-08-07 10:00:05

Are you kidding? 90900 words per page.

2 724 795 words per journal.

181,253 in the New Testament so every two pages of the journal = 1 x New Testament.

Get real.

Comment by livius drusus
2011-08-07 14:14:40

Good point. I’ll take the 3 billion figure out since I can’t confirm it and it seems like it must have been an error.

 
 
Comment by Mr. Murphy in VA
2011-08-07 16:47:16

Having the best possible image (increasing contrast, knocking out goobers, etc.) reduces OCR errors. Unwanted copy and other elements can be removed much faster from the image than from the gobbeldygook in the OCR scan. This is a better workflow than having to extract this unwanted material by sorting through the text. I speak from experience rather than just speculating on the subject.

 
Comment by Ral
2011-08-08 03:01:07

Sorry about being rude. I saw this report in the Guardian. I would have thought that they would have checked it. Should be
three MILLION? More logical BUT that beggars the question as to the time. Look at the Gutenberg Project ~ http://www.gutenberg.org They have done something like 3 Billion. And if they can why do the boffins not consult with them.

Comment by livius drusus
2011-08-09 23:49:11

Oh, no problem at all. They really should have fact-checked, as should I.

 
 
Name (required)
E-mail (required - never shown publicly)
URI

;) :yes: :thanks: :skull: :shifty: :p :ohnoes: :notworthy: :no: :love: :lol: :hattip: :giggle: :facepalm: :evil: :eek: :cry: :cool: :confused: :chicken: :boogie: :blush: :blankstare: :angry: :D :) :(

Your Comment (smaller size | larger size)

You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> in your comment.

Navigation

Search

Archives

September 2019
S M T W T F S
« Aug    
1234567
891011121314
15161718192021
22232425262728
2930  

Other

Add to Technorati Favorites

Syndication