What lurks beneath the ECCO page

This week’s DH adventure has involved peeking under page images in Eighteenth-Century Collections Online (ECCO), a proprietary database from Gale Cengage containing books and journals digitized from microfilm copies. Not only does this database make rare works widely available, Gale Cengage has run the page images through a customized optical-character recognition (OCR) program to make them machine-readable. Scholars can now do word and phrase searches, and it has opened up a new way of doing literary research.

It is magical that text recognition of an eighteenth-century document can even be done, but it is not perfect. Even a 90% accuracy rate means one letter is ten is wrong. So Gale Cengage has partnered with 18thConnect to do crowd-source correction of many ECCO texts using a tool called TypeWright. As there are no Austen texts available (it’s all pre-1800), I’ve been working with booklets published in the 1780s by the Toxophilite Society, a London archery club.

Here’s how it works.  TypeWright feeds me a line of printed text and shows me how the OCR program interpreted it (click on the image to enlarge):

TypeWright sample 1
I then mark the line as correct (green check) or make changes to it. A little pencil appears next to lines I edit.

TypeWright sample 2
I work my way through the booklet, line by line and page by page, until I reach the end. Then I proofread my corrections and make additional edits. Then I proofread again. (I always catch something I missed.) The OCR does a remarkably good job, but I still have to make a lot of corrections.

As you see, line 14 was fine, line 15 a mess, and line 16 pretty good. That’s fairly representative. I was anticipating trouble with the long-s (see ECCO OCR Troubleshooting by Sayre Greenfield) but also encountered some surprises. The lower-case “e” was sometimes read as “c” (see “the” and “Treasurer”) and ligatures (single pieces of type containing two letters as in the “st” in first) were difficult for the machine to figure out. So were words or phrases set in upper-case type with loose tracking.

There were also difficulties with bad printing, faded ink, random blots, and broken type. The human eye can see past these things easily; a machine, it seems, cannot. It tried to OCR the blots; sometimes, it missed entire lines or left out the start or end of a line. But a patient and meticulous human can fix all that.

Text clean-up with TypeWright would be a good project for classes on digital humanities, research methods, or eighteenth-century studies, at either the graduate or undergraduate level. Among other things, students would learn just how imprecise their word searches in full-text databases can be: the machine-readable text underneath a page image is not accurate enough to ensure thoroughness. One is really searching underlying OCRed text, and, as TypeWright shows us, its text is pretty good, but by no means perfect.

Advertisements
Posted in Uncategorized | Tagged , , , , , , | 8 Comments

First things first

This blog will document my adventures, successes, and failures in the digital humanities as I learn how to do things with the novels of Jane Austen.

Posted in Uncategorized | Tagged , , , | Leave a comment