» 65,000 Texts to Mine? Matthew L. Jockers

A story in the Feb. 7th issue of the Telegraph reports that the British Library is going to make 65,000 first edition texts available for public download via Amazon’s Kindle. This news is almost as exciting as Google’s decision some years ago to partner with a consortium of big libraries in order to digitize all their books. What makes this project from the British Library particularly exciting is that the texts being offered are all works of 19th century fiction.

Unlike the Google project that is digitizing everything, this offering from the BL is already presorted to include just the kind of content that literary researchers can really use. With Google, I assume, one is going to have to figure out how to sort the legal books from the cook books, the memoirs from the fiction. Here, however, the BL has already done a big part of the work.

It will be interesting to see how this material gets offered and what sort of metadata is included with the individual files. For those of us who are interested in corpus-mining and macroanalysis (as opposed to just reading a single book at a time) the metadata is crucial. If, for example, we have the publication date of each text in an easily extractable format (e.g. TEI XML) we could explore all kinds of chronological investigations.

In prior research, working with a corpus of just 250 19th century British novels, I explored the “theme” of childhood by quantifying the relative frequency of a “cluster” or “semantic field” of words suggestive of “childhood”. In that work, I discovered a proportionally higher incidence of the theme during the Victorian period, a finding that tends to confirm the idea that childhood was an “invention” of the Victorians. But, then again, a corpus of 250 novels doesn’t even scratch the surface.

I’m not sure just what’s included in the British Library’s 65,000 texts. I assume these are not just British texts, but American, German, etc. Franco Moretti has estimated that there were 8,000 to 10,000 novels published in the Great Britain in the 19th century (20-40,000 works of prose fiction). Surely a good many of these are part of the BL’s 65,000. Which brings us back to the metadata question. Will it be possible to generate a list of which texts in the 65,000 are British-authored and British-published *novels*? If the answer is yes, then the game is on.

Get the texts, convert from mobi to pdf, html, or other text format using any number of open source apps and then poof! You’ve got a COUS–Corpus of Unusual Size! Of course, it’d be a lot easier if the BL would make the texts available (for researchers at least) through a channel that doesn’t involve Amazon or one of the eBook formats. I’m investigating that path now and will report on any progress.