Libraries Look a Gift Horse in the Mouth
UC Berkeley I-School professor Paul Duguid is quoted in an article from tomorrow’s NYT about libraries rejecting Google’s digitization program in favor of working with the Internet Archive. The article focuses on Google’s clauses against allowing other commercial search engines to index the scans, but doesn’t mention another aspect of the deals which is worse: the OCR output of Google-scanned books isn’t made available to the participating libraries or to the public. Thus researchers who need digitized corpuses for developing information retrieval or natural language processing technology can’t make use of their own university libraries’ resources. This isn’t the case with books scanned by the Internet Archive, the OCR output of which are made available to everyone. Fortunately UC Berkeley is one of the libraries working with the Internet Archive’s scanning program, and the OCR output of those scans is proving to be very useful for my own research. As Clifford Lynch has written, providing access to library resources must go beyond simply making them available to human readers, toward making them available to be computed upon. Kudos to the libraries who are realizing this and choosing to work with the Internet Archive.
November 8, 2007 update: Some people have made the point that many of the library contracts that are publicly available specify that the libraries should receive OCR output. (Some of the the links on the Google Book Search Library Partners page lead to the pages that link to contracts, but you have to dig a bit.) So the contracts do mention OCR, but as I suspected they do not specify what the OCR output should consist of, because the libraries were thinking only of access to the digital files (i.e. people reading them), not computing on those files (i.e. machines processing them). Apparently (according to Peter Brantley) only UC had the foresight to think about that (and you can be sure that Google was thinking about it). So I stand by my assertion that the libraries that did not negotiate for the full OCR output made a mistake, and ceded a tremendous amount to Google.
October 22nd, 2007 at 1:21 pm
I would make an additional distinction that I wish the libraries understood before they signed these deals: There is text, and there is text with some format hinting. Knowing that one line is larger than those following is a hint that it is a section heading. Knowing how much bigger it is, where it occurred on a page, etc. adds additional information that allows you to recreate the structure of the work and not just its text. This in turn allows you to understand the knowledge model used in the book. And this finally lets you mine textbooks and other works of nonfiction for ontology. This is hugely valuable, and generally overlooked in the whole exercise.
It can be argued that these old books are out of date, but wouldn’t it be cool to compare the domain models for something like chemistry or EE across time? If you want to find the history of a concept, you need to be able to search for that concept in old texts *using the then-contemporary knowledge models* for that concept.
I actually argued in our IP law class that this raised additional IP issues for Google Print (that of derivative works) not covered by the “we only expose snippets” excuse they hide behind. It was an interesting exchange with the Google Print IP counsel.
Your friendly neighborhood ontogeek -
November 5th, 2007 at 11:54 am
Thanks for this. Hey, Patrick, I made a special post on my Google blog about your comment here. Please go to my blog and elaborate on this idea. It’s very interesting.
Siva
November 5th, 2007 at 3:55 pm
I can’t speak for the other partner libraries, but Google does provide the OCR output to the University of Michigan.
You can see the files, in all their awkward glory, by taking a look at any of the public domain books that Michigan is making available through their MBooks interface. At the bottom of each page, you can choose to view the page as an image, as text (the OCR output), or as a PDF.
For an example, see The Acquisitive Society, by R. H. Tawney. The MBooks FAQ explains how to find more.
November 5th, 2007 at 4:29 pm
That’s great, I had missed that option last time I looked at MBooks. Paul should make this clearer, as it’s an important distinction that talking generally about “digital files” obscures. I suspect however, that this still doesn’t address Patrick’s concern above: from the MBooks interface it appears that only the text is available, without structural information. Note how a “search within the book” on MBooks does not show results as highlighted boxes on the image (which would indicate that Michigan has structural OCR data), as it does on Google Books or the Open Library. This is confirmed in the FAQ. Do you know if this is in fact because Michigan does not have this data, or have they simply not provided an interface to it?
November 8th, 2007 at 3:10 pm
We do get the OCR data along with the page images at UVA as well. Just because something isn’t delivered doesn’t mean that we don’t have the data.
It’s not true that the OCR text is never available to the public in GBS. For works in the public domain, GBS offers a “View Plain Text” optionwhere you can read the OCR instead of the browsing the page images. Here’s a link to “The Poetical Works of Sir Walter Scott” with such a link: http://books.google.com/books?id=iza0kGRRvEEC&printsec=frontcover
November 8th, 2007 at 4:17 pm
Leslie, I’ve heard other Google partners say that they are not getting the OCR data, or that there are restrictions on how they can use it. There is a lot of misinformation floating around due to the secretive nature of the deals and Google in general.
As for “View Plain Text” options at GBS, that is nice but not really what I’m talking about, which is OCR output in machine-readable formats for computing upon text corpuses, not just reading them.
January 18th, 2008 at 6:11 pm
Well here in Europe also a big discussion is starting about this now. Are the opinions in the USA allready changed?
If so, I would be happy to know.
Many thanks,
GL
The Netherlands
January 18th, 2008 at 8:28 pm
I would say that there isn’t a clear consensus here in the US. There are strong feelings on both sides. Everyone agrees that cooperating with Google is the fastest way to get collections digitized, but many people have concerns about: