What lurks beneath the ECCO page

This week’s DH adventure has involved peeking under page images in Eighteenth-Century Collections Online (ECCO), a proprietary database from Gale Cengage containing books and journals digitized from microfilm copies. Not only does this database make rare works widely available, Gale Cengage has run the page images through a customized optical-character recognition (OCR) program to make them machine-readable. Scholars can now do word and phrase searches, and it has opened up a new way of doing literary research.

It is magical that text recognition of an eighteenth-century document can even be done, but it is not perfect. Even a 90% accuracy rate means one letter is ten is wrong. So Gale Cengage has partnered with 18thConnect to do crowd-source correction of many ECCO texts using a tool called TypeWright. As there are no Austen texts available (it’s all pre-1800), I’ve been working with booklets published in the 1780s by the Toxophilite Society, a London archery club.

Here’s how it works.  TypeWright feeds me a line of printed text and shows me how the OCR program interpreted it (click on the image to enlarge):

TypeWright sample 1
I then mark the line as correct (green check) or make changes to it. A little pencil appears next to lines I edit.

TypeWright sample 2
I work my way through the booklet, line by line and page by page, until I reach the end. Then I proofread my corrections and make additional edits. Then I proofread again. (I always catch something I missed.) The OCR does a remarkably good job, but I still have to make a lot of corrections.

As you see, line 14 was fine, line 15 a mess, and line 16 pretty good. That’s fairly representative. I was anticipating trouble with the long-s (see ECCO OCR Troubleshooting by Sayre Greenfield) but also encountered some surprises. The lower-case “e” was sometimes read as “c” (see “the” and “Treasurer”) and ligatures (single pieces of type containing two letters as in the “st” in first) were difficult for the machine to figure out. So were words or phrases set in upper-case type with loose tracking.

There were also difficulties with bad printing, faded ink, random blots, and broken type. The human eye can see past these things easily; a machine, it seems, cannot. It tried to OCR the blots; sometimes, it missed entire lines or left out the start or end of a line. But a patient and meticulous human can fix all that.

Text clean-up with TypeWright would be a good project for classes on digital humanities, research methods, or eighteenth-century studies, at either the graduate or undergraduate level. Among other things, students would learn just how imprecise their word searches in full-text databases can be: the machine-readable text underneath a page image is not accurate enough to ensure thoroughness. One is really searching underlying OCRed text, and, as TypeWright shows us, its text is pretty good, but by no means perfect.

Advertisements

About L Troost

academic, Austenite, student of the digital humanities
This entry was posted in Uncategorized and tagged , , , , , , . Bookmark the permalink.

8 Responses to What lurks beneath the ECCO page

  1. Tara F. says:

    I like many things about this blog, among them the picture that shows that someone else organizes book by publisher as well as author. So pretty that way. Will be looking forward to learning about the digital humanities!

  2. Pingback: Documenting successes and failures | NixoNARA

  3. JL says:

    What benefit does ECCO offer to those editors who aren’t fortunate enough to belong to an institution that pays for access to their quite expensive data base? Will they provide free access once a text is fully edited (at no cost to them)? Or will they simply have a marketing tool – “now we have more accurate searching!” But this may not be an issue – given how messy the text I’ve looked at so far is, I don’t see how any scholar will have time to do much with TypeWright, except perhaps for texts they are planning to work with intensively and wish to have available in searchable form for their own work. — But I agree that exposure to the gap between the image and the underlying text is an excellent teaching tool.

  4. ltroost says:

    There is such a plan. Laura Mandell at Texas A&M (and the moving force behind 18thConnect) has arranged a preconvention workshop at ASECS 2014, “Liberate the Text!! (while Creating a Publishable, Digital Textual Edition,” as a start. I think we’ll be hearing more about this in months to come. (I do not have access to ECCO, either.)

  5. All I can say is that if ever there were “a patient and meticulous human” capable of fixing all that and more, it is thou! Thank you for bringing so much of the past to life for so many.

  6. Just read about Laura Mandell’s eMop project–which involves Typewright, among other things, in a really ingenious workflow–mopping up the mess of Early Modern texts? 🙂 See http://emop.tamu.edu/about

  7. I’m really enjoying the theme/design of your blog.

    Do you ever run into any browser compatibility issues? A number of my blog readers
    have complained about my blog not working correctly in Explorer but looks great in Chrome.
    Do you have any advice to help fix this issue?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s