Those who do corpus level computational text analysis are always hungry for more and more texts to analyze. Though we’ve become adept at locating texts from a wide range of sources (our own institutional repositories as well as a number of other places including Google Books, the Internet Archive, and Project Gutenberg), we still face a number of preprocessing tasks to bring those various files into some standard format. The texts found at these resources are not always in a format friendly to the tools we use for processing those texts. For example, I’ve developed lots of processing scripts that are designed to leverage the metadata that is frequently encoded into TEI-based xml. A text from Project Gutenberg, however, is not only just plain text, but it has a lot of boilerplate text at the beginning and end of each file that needs to be removed prior to text analysis.
I’m currently building a corpus of 19th century novels and discovered that many of the texts I would like to include have already been digitized by Project Gutenberg. This, of course, was great news. But, the system I have developed for ingesting texts into my corpus assumes that the texts will all be in TEI-XML with markup indicating such important things as “author,” “title”, and “date” of publication. I downloaded about 100 novels and was about to begin opening them up one by one and adding the metadata. . .eek! I quickly realized the mundanity of the task and thought, “hmm, I bet someone has written a nice regex script for doing this sort of thing.” A quick trolling of the web led me to the web page of Michiel Overtoom who had developed some python scripts for downloading and cleaning up (“beautifying” in his language) Dutch Gutenberg texts for his eBook Reader. Overtoom’s process is mainly designed to strip out the boilerplate and then rename the files with naming conventions that reflect the author and title of the books.
With Overtoom’s script as a base, I reengineered the code to convert a Gutenberg text into a minimally encoded and TEI-compliant XML file. The script builds a teiHeader that includes the author and title of the work (unfortunately, Project Gutenberg texts do not include publication dates, why?) and then adds “text”, “body”, div, and all the p tags. The final result is a document that meets basic TEI requirements. The script is copied below, but since the all important python spacing may be destroyed by this posting, it’s better to download it here and then change the file extension from .txt. to “.py”. Enjoy!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
# gutenbergToTei.py # # Reformats and renames etexts downloaded from Project Gutenberg. # # Software adapted from Michiel Overtoom, motoom@xs4all.nl, july 2009. # # Modified by Matthew Jockers August 17, 2010 to encode result into TEI based XML # # October 26, 2015: Peeter Tinits notes that for non-latin characters ', encoding="utf8"' could be added to both "open" functions. import os import re import shutil remove = ["Produced by","End of the Project Gutenberg","End of Project Gutenberg"] def beautify(fn, outputDir, filename): ''' Reads a raw Project Gutenberg etext, reformat paragraphs, and removes fluff. Determines the title of the book and uses it as a filename to write the resulting output text. ''' lines = [line.strip() for line in open(fn)] collect = False lookforsubtitle = False outlines = [] startseen = endseen = False title="" one="<?xml version=\"1.0\" encoding=\"utf-8\"?><TEI xmlns=\"http://www.tei-c.org/ns/1.0\" version=\"5.0\"><teiHeader><fileDesc><titleStmt>" two = "</titleStmt><publicationStmt><publisher></publisher><pubPlace></pubPlace><availability status=\"free\"><p>Project Gutenberg</p></availability></publicationStmt><seriesStmt><title>Project Gutenberg Full-Text Database</title></seriesStmt><sourceDesc default=\"false\"><biblFull default=\"false\"><titleStmt>" three = "</titleStmt><extent></extent><publicationStmt><publisher></publisher><pubPlace></pubPlace><date></date></publicationStmt></biblFull></sourceDesc></fileDesc><encodingDesc><editorialDecl default=\"false\"><p>Preliminaries omitted.</p></editorialDecl></encodingDesc></teiHeader><text><body><div>" for line in lines: if line.startswith("Author: "): author = line[8:] authorTemp = line[8:] continue if line.startswith("Title: "): title = line[7:] titleTemp = line[7:] lookforsubtitle = True continue if lookforsubtitle: if not line.strip(): lookforsubtitle = False else: subtitle = line.strip() subtitle = subtitle.strip(".") title += ", " + subtitle if ("*** START" in line) or ("***START" in line): collect = startseen = True paragraph = "" continue if ("*** END" in line) or ("***END" in line): endseen = True break if not collect: continue if (titleTemp) and (authorTemp): outlines.append(one) outlines.append("<title>") outlines.append(titleTemp) outlines.append("</title>") outlines.append("<author>") outlines.append(authorTemp) outlines.append("</author>") outlines.append(two) outlines.append("<title>") outlines.append(titleTemp) outlines.append("</title>") outlines.append("<author>") outlines.append(authorTemp) outlines.append("</author>") outlines.append(three) authorTemp = False titleTemp = False continue if not line: paragraph = paragraph.strip() for term in remove: if paragraph.startswith(term): paragraph = "" if paragraph: paragraph = paragraph.replace("&", "&") outlines.append(paragraph) outlines.append("</p>") paragraph = "<p>" else: paragraph += " " + line # Compose a filename. Replace some illegal file name characters with alternatives. #ofn = author + title[:150] + ".xml" ofn = filename ofn = ofn.replace("&", "") ofn = ofn.replace("/", "") ofn = ofn.replace("\"", "") ofn = ofn.replace(":", "") ofn = ofn.replace(",,", "") ofn = ofn.replace(" ", "") ofn = ofn.replace("txt", "xml") outlines.append("</div></body></text></TEI>") text = "\n".join(outlines) text = re.sub("End of the Project Gutenberg .*", "", text, re.M) text = re.sub("Produced by .*", "", text, re.M) text = re.sub("<p>\s+<\/p>", "", text) text = re.sub("\s+", " ", text) f = open(outputDir+ofn, "wt") f.write(text) f.close() sourcepattern = re.compile(".*\.txt$") sourceDir = "/Path/to/your/ProjectGutenberg/files/" outputDir = "/Path/to/your/ProjectGutenberg/TEI/Output/files/" for fn in os.listdir(sourceDir): if sourcepattern.match(fn): beautify(sourceDir+fn, outputDir, fn) |