UC Berkeley I-School professor Paul Duguid is quoted in an article from tomorrow’s NYT about libraries rejecting Google’s digitization program in favor of working with the Internet Archive. The article focuses on Google’s clauses against allowing other commercial search engines to index the scans, but doesn’t mention another aspect of the deals which is worse: the OCR output of Google-scanned books isn’t made available to the participating libraries or to the public. Thus researchers who need digitized corpuses for developing information retrieval or natural language processing technology can’t make use of their own university libraries’ resources. This isn’t the case with books scanned by the Internet Archive, the OCR output of which are made available to everyone. Fortunately UC Berkeley is one of the libraries working with the Internet Archive’s scanning program, and the OCR output of those scans is proving to be very useful for my own research. As Clifford Lynch has written, providing access to library resources must go beyond simply making them available to human readers, toward making them available to be computed upon. Kudos to the libraries who are realizing this and choosing to work with the Internet Archive.
November 8, 2007 update: Some people have made the point that many of the library contracts that are publicly available specify that the libraries should receive OCR output. (Some of the the links on the Google Book Search Library Partners page lead to the pages that link to contracts, but you have to dig a bit.) So the contracts do mention OCR, but as I suspected they do not specify what the OCR output should consist of, because the libraries were thinking only of access to the digital files (i.e. people reading them), not computing on those files (i.e. machines processing them). Apparently (according to Peter Brantley) only UC had the foresight to think about that (and you can be sure that Google was thinking about it). So I stand by my assertion that the libraries that did not negotiate for the full OCR output made a mistake, and ceded a tremendous amount to Google.