Those who do corpus level computational text analysis are always hungry for more and more texts to analyze. Though we’ve become adept at locating texts from a wide range of sources (our own institutional repositories as well as a number of other places including Google Books, the Internet Archive, and Project Gutenberg), we still face a number of preprocessing tasks to bring those various files into some standard format. The texts found at these resources are not always in a format friendly to the tools we use for processing those texts. For example, I’ve developed lots of processing scripts that are designed to leverage the metadata that is frequently encoded into TEI-based xml. A text from Project Gutenberg, however, is not only just plain text, but it has a lot of boilerplate text at the beginning and end of each file that needs to be removed prior to text analysis.

I’m currently building a corpus of 19th century novels and discovered that many of the texts I would like to include have already been digitized by Project Gutenberg. This, of course, was great news. But, the system I have developed for ingesting texts into my corpus assumes that the texts will all be in TEI-XML with markup indicating such important things as “author,” “title”, and “date” of publication. I downloaded about 100 novels and was about to begin opening them up one by one and adding the metadata. . .eek! I quickly realized the mundanity of the task and thought, “hmm, I bet someone has written a nice regex script for doing this sort of thing.” A quick trolling of the web led me to the web page of Michiel Overtoom who had developed some python scripts for downloading and cleaning up (“beautifying” in his language) Dutch Gutenberg texts for his eBook Reader. Overtoom’s process is mainly designed to strip out the boilerplate and then rename the files with naming conventions that reflect the author and title of the books.

With Overtoom’s script as a base, I reengineered the code to convert a Gutenberg text into a minimally encoded and TEI-compliant XML file. The script builds a teiHeader that includes the author and title of the work (unfortunately, Project Gutenberg texts do not include publication dates, why?) and then adds “text”, “body”, div, and all the p tags. The final result is a document that meets basic TEI requirements. The script is copied below, but since the all important python spacing may be destroyed by this posting, it’s better to download it here and then change the file extension from .txt. to “.py”. Enjoy!