65,000 Texts to Mine?

A story in the Feb. 7th issue of the Telegraph reports that the British Library is going to make 65,000 first edition texts available for public download via Amazon’s Kindle. This news is almost as exciting as Google’s decision some years ago to partner with a consortium of big libraries in order to digitize all their books. What makes this project from the British Library particularly exciting is that the texts being offered are all works of 19th century fiction.

Unlike the Google project that is digitizing everything, this offering from the BL is already presorted to include just the kind of content that literary researchers can really use. With Google, I assume, one is going to have to figure out how to sort the legal books from the cook books, the memoirs from the fiction. Here, however, the BL has already done a big part of the work.

It will be interesting to see how this material gets offered and what sort of metadata is included with the individual files. For those of us who are interested in corpus-mining and macroanalysis (as opposed to just reading a single book at a time) the metadata is crucial. If, for example, we have the publication date of each text in an easily extractable format (e.g. TEI XML) we could explore all kinds of chronological investigations.

In prior research, working with a corpus of just 250 19th century British novels, I explored the “theme” of childhood by quantifying the relative frequency of a “cluster” or “semantic field” of words suggestive of “childhood”. In that work, I discovered a proportionally higher incidence of the theme during the Victorian period, a finding that tends to confirm the idea that childhood was an “invention” of the Victorians. But, then again, a corpus of 250 novels doesn’t even scratch the surface.

I’m not sure just what’s included in the British Library’s 65,000 texts. I assume these are not just British texts, but American, German, etc. Franco Moretti has estimated that there were 8,000 to 10,000 novels published in the Great Britain in the 19th century (20-40,000 works of prose fiction). Surely a good many of these are part of the BL’s 65,000. Which brings us back to the metadata question. Will it be possible to generate a list of which texts in the 65,000 are British-authored and British-published *novels*? If the answer is yes, then the game is on.

Get the texts, convert from mobi to pdf, html, or other text format using any number of open source apps and then poof! You’ve got a COUS–Corpus of Unusual Size! Of course, it’d be a lot easier if the BL would make the texts available (for researchers at least) through a channel that doesn’t involve Amazon or one of the eBook formats. I’m investigating that path now and will report on any progress.

Is it the Joyce Industry or the Shakespeare Industry?

At the recent Digital Humanities Conference in Maryland, Matthew Wilkins and I got into a discussion about famous authors and the “industries” of scholarship that their works have inspired (see Matt’s blog post about our discussion and his survey analysis of the MLA bibliography).

The first time I ever heard the term “industry” used in this context was in reference to the scholarship generated by Joyce’s novel Ulysses. As Joyce himself predicted (bragged) the book would keep scholars busy for centuries to come, and, of course, Joyce was right–well maybe not centuries, but you get the idea. But can we really compare the Shakespeare “industry” to the Joyce “industry” given that the Bard had such a significant head start in terms of establishing his scholarly “fan base”?

Using the MLA bibliography, Matthew W. took a stab at this and compiled some rough figures of recent scholarship on the two masters. By Matt’s count, since 1923, Joyce has inspired just 9315 citations to Shakespeare’s massive 35,489.

But there is an obvious problem here: the figures begin in 1923 and Ulysses, the book that really puts Joyce on the map, was only published in 1922. So Joyce is getting into the industry-building business a bit late. Clearly we must do some norming here to account for the Bard’s head start.

Now, since I am pretty sure that I owe Matt a beer if the Bard has a bigger industry, I think some well thought out math is warranted here:-) . . .

Shakespeare dies in 1616 and Joyce dies in 1941. Subtracting each death date from the last year of Matt’s analysis (2008) means that Shakespeare had 392 years to develop his industry and Joyce only 67 years. If we divide the total number of citations Matt found in the MLA bibliography by the total number of industry-building years, then the figures tell a very different story. Joyce averages 139 citations per year whereas Shakespeare manages only a paltry 90.5

But wait, there’s more. . . Querying the MLA bibliography using the search terms “shakespeare and hamlet” results in 4079 citations. A similar query for “joyce and ulysses” returns 3269. Normed for years of industry-building time these figures tell a sad, sad tale for the man from Stratford. Ulysses inspires 48.8 citations per year and Hamlet a meager 10.4.

In this sense, the Bard can be thought of as the steady industrial giant. His stock increases little by little, and he is a generally good investment. For the sake of convenience, let’s call him “GM.”

Joyce, on the other hand is a relative new comer to the marketplace. He is more like a Silicon Valley startup and his stock starts off slow and then sky-rockets. For the sake of convenience, we’ll call him “Google.”

Now, getting back to the central question, who’s bigger. . . You’ll find the answer here.

Machine-Classifying Novels and Plays by Genre

In the post that follows here, I describe some recent experiments that I (and others) have conducted. The goal of these experiments was to accurately machine-classify novels and plays (Shakespeare’s) by genre. One of the most interesting results ends up having more to do with feature extraction than classification algorithm


Several weeks ago, Mike Witmore visited the Beyond Search workshop that I organize here at Stanford. In prior work, Witmore and some colleagues utilized a program called Docuscope (Developed at Carnegie Mellon) to distinguish between and classify (statistically) Shakespeare’s histories and comedies.

“Equipped with a specialized dictionary, Docuscope is able to divide texts into strings of words that are then sorted into one of eighteen word categories, such as “Inner Thinking” and “Past Events.” The program turns differentiating amongst genres into a statistical task by testing the frequency of occurence of words in each of the categories for each individual genre and recognizing where significant differences occur.”

Docuscope was designed as a tool for analyzing student writing, but Witmore (et. al.) discovered that it could also be employed as a specialized sort of feature extraction tool.

To test the efficacy of Docuscope as a tool for detecting and clustering novels by genre, Franco Moretti and I created a full text corpus that included 36 19th century novels (striped of title page and other identifying information). We divided this corpus into three groups and organized them by genre:

  • Group one consisted of 12 texts belonging to 3 different (but fairly similar) genres (gothic, historical tale, and national tale)
  • Group two consisted of 12 texts belonging to 3 different genres that were quite different (industrial, silver-fork, bildungsroman).
  • Group three consisted of 12 texts belonging to 6 different genres that mix 3 genres from those already included in group one or two and 3 new genres (evangelical, newgate, and anti-jacobin).

Witmore was given this corpus in electronic form (each novel in plain text). For identification purposes (since Mike was not privy to the actual genres or titles of the novels), he labeled each of the 12 genre groups with a number 1-12. Witmore’s numberings correspond to genres as follows:

  1. Gothic
  2. Historical Novels
  3. National Tales
  4. Industrial Novels
  5. Silver-Fork Novels
  6. Bildungsroman
  7. Anti-Jacobin
  8. Industrial
  9. Gothic
  10. Evangelical
  11. Newgate
  12. Bildungsroman

Using Docuscope, Witmore ran a series of tests in attempt to cluster the similar genres together. The experiment was designed to pick the three groups from 7-12 that have genre cognates in 1-6. Witmore’s results for the closest affiliated genres were impressive:

  • 2:9 (Historical with Gothic)
  • 1:9 (Gothic with Gothic) Witmore notes that this 2nd cluster was a close (statistically) second to the above
  • 4:8 (Industrial with Industrial)
  • 6:12 (Bildungsroman with Bildungsroman)

Witmore’s results also suggested an especially close relationship between the Gothic and Historical, Witmore writes that “groups 1 and 2 looked like they paired with the same candidate group (9).”

Additional Experiments

All of this work Witmore had done and the results he derived got me thinking more completely about the problem of genre classification. In many ways, genre classification is akin to authorship attribution. Generally speaking though, with authorship problems one attempts to extract a feature set that excludes context sensitive features from the analysis. (The consensus in most authorship attribution research suggests that a feature set made up primarily of frequent, or closed-class, word features yields the most accurate results) For genre classification, however, one would intuitively assume that context words would be critical (e.g. Gothic novels often have “castles” so we would not want to exclude context sensitive words like “castle.”) But my preliminary experiments have suggested just the opposite, namely that a distinct and detectable genre “signal” may be derived from a limited set of high-frequency features

Using just 42 word and punctuation features, I was able to classify the novels in the corpus described above equally as well as Witmore did using Docuscope (and a far more complex feature set). To derive my feature set, I lowercase the texts, count and convert to relative frequency the various features types, and then winnow the feature set by choosing only those features that have a mean relative frequency of 3% or greater. This results in the following 42 features (The prefix “p_” indicates a punctuation token instead of a word token.):

“a”, “all”, “an”, “and”, “as”, “at”, “be”, “but”, “by”, “for”, “from”, “had”, “have”, “he”, “her”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “on”, “p_apos”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_quote”, “p_semi”, “she”, “that”, “the”, “this”, “to”, “was”, “were”, “which”, “with”, “you”

Using the “dist” and “hclust” functions in the open-source “R” statistics application, I cluster the texts and output the following dendrogram:

These results were compelling, and after I shared them with Mike Witmore, he suggested testing this methodology on his Shakespeare corpus. Again the results were compelling and this process accurately clustered the majority of Shakespeare’s plays into appropriate clusters of “tragedy,” “comedy,” and “history”. The dendrogram below shows the results of my Shakespeare experiment using these 37 features

“a”, “and”, “as”, “be”, “but”, “for”, “have”, “he”, “him”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “p_apos”, “p_colon”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_ques”, “p_semi”, “so”, “that”, “the”, “this”, “thou”, “to”, “what”, “will”, “with”, “you”, “your”.

These initial tests raise a number of important questions, not the least of which is the question of how much of a factor genre plays in determining the usage of high frequency word and punctuation tokens. We have plans to conduct a series of more rigorous experiments, and the results of these tests will be forthcoming. In the meantime, my initial tests appear to confirm, again, the significant role that common function words play in defining literary style .

Executing R in Php

For their final project, the students in my Introduction to Digital Humanities seminar decided to analyze narrative style in Faulkner’s Sound and the Fury. In addition to significant off-line analysis, we are building a web-based application that allows visitors to compare the different sections of the novel to each other and also to new, unseen texts that visitors to the site can enter themselves.


To achieve this end, the web application must be able to “ingest” a text sample, tokenize it, extract data about the various features that will be used in the comparison, and then prepare (organize, tabulate) those features in a manner that will allow for statistical analysis/comparison. Since my course is not a course in statistics, we decided that I would be responsible for the number crunching.

So, while my students work out the text ingestion, tokenization, and preparation part of the project, I was tasked with figuring out how to crunch the numbers. My first instinct (not good) was to begin thinking about how to do the required math in php, which is the language the students are using for the rest of the project. Writing a complex statistical algorithm in php did not sound like much fun. My facility with statistics is almost entirely limited to descriptive statistics and though I do employ more complex statistical procedures in my work, I can’t say that I fully understand the equations that underlie them. So I quickly found myself wishing for a web-based version of (or maybe an API for) the open-source stats application “R”, which I use frequently in my own work. It turns out, of course, that lots of other folks had thought of this before me, and there are all sorts of web implementations of R. This was good news, but unfortunately, it wasn’t exactly what I was after. I did not want to replicate the R interface online. Instead, I wanted to be able to utilize the power of R through a user friendly front end. After five or six hours of hammering away, I eventually got what I was after: a way to call R from within php and return the results to a web interface.

What follows here is a simple step by step for replicating what I did. I offer this not as an example of something revolutionary or unprecedented–others have figured this out and there is nothing exceptional here–but instead as a way of documenting, in one place, the process I discovered after scraping the R archives, the various php sites, and the brain of my more R-savvy colleague Claudia. Hopefully this will prove useful to other rookies who may want to take a stab at something similar.

The steps below outline how to set up a php web page that accesses the “R” statistics package and outputs a .jpg Cluster Dendrogram showing how the texts clustered. The steps assume that you have already developed a script that ingests and processes a user-submitted text file and then adds that data to an existing data file containing data for the “canned” texts you wish to compare the user-submitted file against. I also assume that you have php and “R” installed on your server.

It warrants noting that what I present here is not exactly how I finally implemented the solution. What I show here provides for the clearest explanation of the process and works perfectly well, but it is not the “streamlined” final version of the script

For this solution, I use four separate files:

  • form.html –a simple html form with two fields. The first field allows the user to enter a name for their file (e.g. “MySample”), and the second field is a “textarea” where the user can paste the text for comparison.
  • rtest.php –a result page for the form that gets executed after the user hits submit on the html form page. The php code in this file executes the R code on the server.
  • rCommands.txt –a text file containing a “canned” series of R commands.
  • cannedData.txt –a tab delimited text file containing rows of data for analysis. Each row contains three columns: “textId,” “word,” and “frequency” where textId is the name of the text sample (e.g. “Faulkner’s Benjy”) and frequency is a normalized (percentage) value representing the percentage frequency of the “word” in the given text sample.

Now the Steps:

  1. The user submits a text and text name at form.html.
  2. The text of the user submitted sample is processed into a term frequency array “$sampeData” using the built in php functions “str_word_count” and “array_count_values” to compute term-frequencies (relative word frequencies) for every word type in the file.
  3. The contents of cannedData.txt is read into a new php variable.
  4. A temporary file “data.txt” is created on the server and the contents of cannedData are written to “data.txt.”
  5. The contents of $sampleData are appended to the end of “data.txt” in the format: “\”$sampleName\”\t\”$word\”\t\”$percent\”\n” where “$sampleName” is the user entered name for the the text sample, “$word” is a given word-type from the sample, and “$percent” is the normalized frequency of the word in the sample. Upon completion “data.txt” is closed.
  6. Using the php “exec” function, the script executes the unix cmd: “cat rCommands.txt | /usr/bin/R –vanilla”, which launches “R” and executes the R commands found inside the file “rCommands.txt”
  7. “rCommands.txt” is a canned sequence of commands that loads “data.txt” into a data frame, runs the cross tabulation function to create a term frequency matrix that can be processed with the dist and hclust functions as follows:
  8. The result is a .jpg file (“plot.jpg”) of the resulting cluster dendrogram created in the current directory.
  9. “plot.jpg is then called in simple html (e.g. “<img src=”plot.jpg”>”) as a final step of the script

Readers can download all the source files here execRSourceFiles.zip.

Chronicle of Higher Education Article

This week the Chronicle of Higher Education ran an article written by Jennifer Howard about “literary geospaces.” The article featured some work I have done mapping Irish-American literature using Google Earth (and also profiled the work of Janelle Jenstad who has been mapping early modern London).

Picture of Jockers with Google Earth by Noah Berger

Photo by Noah Berger

The bit about my Google Earth/Irish-American literature mash up resulted in several emails from folks wanting to know more about the project and more specifics about my findings. . . beware what you ask for. . .

I began building a bibliographic database of Irish-American literature many years ago when I was working on my dissertation (Jockers, Matthew L. “In search of Tir-Na-Nog: Irish and Irish-American Literature in the West.” Southern Illinois University, 1997). In 2002 I received a grant from the Stanford Humanities Laboratory to fund a web project called “The Irish-American West.” At that point I moved the database into MySql and put the whole thing on line with a search interface. As part of the grant, I also began digitizing and putting on line a number of specific Irish-American novels from the west. All of this work was later moved to the web site of the Western Institute of Irish Studies, a non-profit that I helped establish with then Irish Consul Donal Denham and a few other Bay Area enthusiasts. The archive and the database are alive and well at the Institute, and each year students who take my Introduction to Humanities Computing course help the archive grow by encoding one or two more full texts. (The group projects my students complete each year can be found on my courses page)

Ironically, on St. Patrick’s day in 2007, I was invited to present a paper at the 2007 MLA meeting in Chicago as part of a panel session titled “Literary Geospaces.” The paper I delivered “Beyond Boston: Georeferencing Irish-American Literature” utilized Google Earth to help the audience visualize both the landscape and chronology of Irish-American literary history. I warned the audience at the time not to be seduced by the incredible visual appeal of Google Earth; GE is a stunning application, and I was honestly worried that my audience would lose track of my central thesis about the literary history of Irish-America if they got too caught up in the visualization of the data. I was also worried about the amount of time that went into the preparation of the Google Earth mash-up. The MLA is a meeting of literature and language professors, and I didn’t want to give the impression that putting something like this together was a simple matter (along with the Google Earth app itself, I’d utilized php, xml, xsl, html, and Mysql to build the .kml file that runs the whole show).

The central thesis of the paper was that in order to understand Irish-American literature we need to look not simply to the watershed moments of Irish-American history, but we must look to the very geography of America. As long ago as 1997, my research had shown that the Irish experience in America was largely determined by place. It’s true, of course, that the time of immigration to the U.S. was important in coloring the Irish experience: were these pre-famine immigrants, famine refugees, or the 1980’s so-called “commuter Irish.” But I discovered that equally important to chronology was place and the business of where the immigrants settled. For my research, I divided the country up into a number of regions (Midwest, mountain, southwest, pacific. . .) and each one of these regions turned out to have a distinct “brand” of Irish-American writing. Generally speaking, though, the further west we go the more likely we are to find writers describing the Irish-American experience in positive terms. And perhaps more importantly, the further west we go the more Irish writing there seems to be if we view “more” in relative terms, as a percentage of the Irish population.

I suppose one of the most interesting things I discovered along the way involves what was happening in the early part of the 20th century. My colleague Charles Fanning has speculated that in the early 1900s, from around 1900 to 1930, Irish-Americans turned away from writing about their experience in the United States. These were difficult times for Irish-Americans, and Fanning writes in his impressive book The Irish Voice in America how “a number of circumstances–historical, cultural, and political, including the politics of literature–combined to [create] a form of wholesale cultural amnesia (3).”

What I discovered was that Irish writers in the western U.S. were largely undeterred.

And this all made perfectly good sense: Irish writers in the West did not have to face the same prejudice that there counterparts in the East faced. There was no established Anglo-Protestant majority in the West, there was far less competition for good jobs, and generally speaking the Irish who ventured west were better off and typically better educated than their countrymen in the East. Thus they had more means and more opportunity for writing. So if we look at the entire corpus we find not a period of literary recession in the early 1900s, but instead a period of heightened activity. It’s only when we probe that activity that we discover that writers from west of the Mississippi are the ones being active.

Here is a link to a Quicktime video of the Google Earth mash-up. I’m still working on setting up an interactive version that will query my database dynamically and allow visitors to sort and probe the entire collection. . . more on that later.

POS Tagging XML with xGrid and the Stanford Log-linear Part-Of-Speech Tagger

Recently (4/2008) I had reason to Part-Of-Speech tag a whole mess of novels, around 1200. I installed the Stanford Tagger and ran my first job of 250 novels on an old G4 under my desk. Everything worked fine, but the job took six days. After that experience, I figured out how to utilize xGrid for “distributed” tagging, or what I’ll call, according to convention, “Tagging@Home.” At the time that I was working on this tagging project, the folks in Stanford’s NLP group, especially Chris Manning and Anna Rafferty, were improving the tagger and adding some XML functionality to the program. I’m very grateful to Chris and Anna for their work. What follows is a practical guide for those who might wish to employ the tagger for use with XML or who might want to understand how to set up xGrid to help distribute a big tagging job.

First I provide some simple examples showing how to take advantage of the XML functionality added in the May 19, 2008 release of the
Stanford Log-linear Part-Of-Speech Tagger. Further down in this page I include information about setting up xGrid to farm out a large tagging job to a network of Macs, useful if you want to POS tag a large corpus. These example assume that you have installed the tagger and understand how to invoke the tagger from the cmd line. If you are not yet familiar with the tagger, you should consult the ReadMe.txt file and javadoc that come with it. In the javadoc, see specifically the link for “MaxentTagger” where there is a useful “parameter description” table.

Example One: Tagging XML with the Stanford Log-linear Part-Of-Speech Tagger

Many texts are currently available to us in XML format, and in literary circles the most common flavor of XML is
TEI. In this example we will POS tag a typical TEI encoded XML file.

Begin by examining the structure of your source XML file to determine which XML elements “contain” the content that you wish to POS tag. More than likely, you don’t want to tag the title page, for example, but are interested primarily in the main text. To complete the exercises below, you may want to
download a shortened version of James McHenry’s
The Wilderness
which is the text I use in the examples; alternatively, you may use your own .xml file. The example file “wilderness.xml” is marked up according to TEI standards and thus contains two major structural divisions: “teiHeader” and “text.” The “teiHeader” element contains metadata about the file and the “text” element contains the actual text that has been marked up. For purposes of this example, I shortened the text to include only the first two chapters.

For this example, I assume that you wish to POS tag the text portions of the book that are contained in the main “body” of the book, that is to say, you are not interested in POS tagging the title page(s) or any ancillary material that may come before or after the primary text. To separate the main body of the text, TEI uses a “body” element, so we might begin by having the tagger focus only on the text that occurs within the body element.

If we were simply tagging a plain text file, such as what you might find at
Project Gutenberg, the usual command (as found in the tagger “readme.txt” file) would be as follows:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -textFile sample-input.txt > sample-output.txt

To deal with XML based input, however, we need to add another parameter indicating to the tagger that the source file is XML and more importantly that we only want to tag the contents of a specific XML element (or several specific elements, which is also possible). The additional parameter is “-xmlInput” and the parameter is followed by a space delimited list of XML tags whose content we want the POS tagger to tag. The revised command for tagging just the contents of the “body” element would look like this:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput body -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

When invoked in this manner, the tagger grabs the content of the body tag for processing and ignores (or strips out) any XML markup contained within the selected tag. Running the command above, thus has the effect of stripping out all of the rich markup for chapters, paragraphs and etc. In order to preserve more of the structural markup, a slightly better approach is to use the “p” tag instead of the “body” tag, as follows:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

If you run this command, you’ll be a bit closer, but you might notice that our source file includes a series of poetic epigraphs at the beginning of chapters and a periodic bit of poetry dispersed elsewhere throughout the prose. Using just the “p” tag above, we fail to POS tag the poetic sections. Fortunately, the tagger allows us to specify more than one POS tag in the command, and we can thus POS tag both “p” tags and “l” tags (which contain lines of poetry)
*see note*:

java -mx300m -classpath stanford-postaggerar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p\ l -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

Complete execution time for the command above using McHenry’s
Wildernesson a typical desktop computer is about 30-40 seconds but the processing time will vary depending on your machine and the speed of your processor. Here is a snippet of what the resulting output will look like:

<l>As/IN slow/JJ our/PRP$ ship/NN her/PRP$ foamy/NN track/NN ,/,</l>
<l>Against/IN the/DT wind/NN was/VBD cleaving/VBG ,/,</l>
<l>Her/PRP$ trembling/VBG pendant/JJ still/RB look/VB 'd/NNP back/RB</l>
<l>To/TO that/DT dear/RB isle/VB 'twas/NNS leaving/VBG ;/:</l>
<l>So/RB loth/NN we/PRP part/VBP from/IN all/DT we/PRP love/VBP ,/,</l>
<l>From/IN all/DT the/DT links/NNS that/IN bind/NN us/PRP ,/,</l>
<l>So/RB turn/VB our/PRP$ hearts/NNS where'er/VBP we/PRP rove/VBP ,/,</l>
<l>To/TO those/DT we/PRP 've/VBP left/VBN behind/IN us/PRP !/. Moore/NNP</l>

<p>Let/VB melancholy/JJ spirits/NNS talk/VBP as/IN they/PRP please/VBP concerning/VBG the/DT
degeneracy/NN and/CC increasing/VBG miseries/NNS of/IN mankind/NN ,/, I/PRP will/MD not/RB
believe/VB them/PRP ./. They/PRP have/VBP been/VBN speaking/VBG ill/JJ of/IN themselves/PRP ,/,
and/CC predicting/VBG worse/JJR of/IN their/PRP$ posterity/NN ,/, from/IN time/NN immemorial/JJ ;/:
and/CC yet/RB ,/, in/IN the/DT present/JJ year/NN ,/, 1823/CD ,/, when/WRB ,/, if/IN the/DT one/CD
hundreth/NN part/NN of/IN their/PRP$ gloomy/JJ forebodings/NNS had/VBD been/VBN realized/VBN ,/,
the/DT earth/NN must/MD have/VB become/VBN a/DT Pandemonium/NN ,/, and/CC men/NNS something/NN
worse/JJR than/IN devils/NNS ,/, -LRB-/-LRB- for/IN devils/NNS they/PRP have/VBP been/VBN long/JJ
ago/RB ,/, in/IN the/DT opinion/NN of/IN these/DT charitable/JJ denunciators/NNS ,/, -RRB-/-RRB-
I/PRP am/VBP free/JJ to/TO assert/VB ,/, that/IN we/PRP have/VBP as/IN many/JJ honest/JJ men/NNS ,/,
pretty/RB women/NNS ,/, healthy/JJ children/NNS ,/, cultivated/VBN fields/NNS ,/, convenient/JJ
houses/NNS ,/, elegant/JJ kinds/NNS of/IN furniture/NN ,/, and/CC comfortable/JJ clothes/NNS ,/,
as/IN any/DT generation/NN of/IN our/PRP$ ancestors/NNS ever/RB possessed/VBN ./.</p>

*Note that after the -xmlInput parameter we include the “p” tag and the “l” tag separated by an *escaped* space character. UNIX chokes on the space character if we don’t escape it with a backslash. Martin Holmes pointed out to me that if you are using windows, then you should put the space delimited tags inside quotes (like this: “p l”) and disregard the backslash.

Example Two: POS Tagging an XML source file and outputting results in XML

In addition to being able to “read” xml, the 5-19-2008 tagger release also includes functionality allowing users to output POS tagged results as well-formed xml. The process for doing so is very much the same as Example One above, however, to output XML, we need to add an additional “-xmlOutput” parameter to the command.

java -mx300m -classpath stanford-postaggerar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p\ l -xmlOutput -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

In this case, the resulting output is a far richer XML in which each sentence is wrapped in a “sentence” tag and each word in a “word” tag. The sentence tag includes an “id” attribute indicating its number in the sequence of sentences in the entire document. Likewise, the word element contains an id attribute referecning the word’s position in the sentence followed by a “pos” attribute indicating what part of speech the tagger assigned the word. Here is a sample of the output:

<sentence id="17">
<word id="0" pos="VB">Let</word>
<word id="1" pos="JJ">melancholy</word>
<word id="2" pos="NNS">spirits</word>
<word id="3" pos="VBP">talk</word>
<word id="4" pos="IN">as</word>
<word id="5" pos="PRP">they</word>
<word id="6" pos="VBP">please</word>
<word id="7" pos="VBG">concerning</word>
<word id="8" pos="DT">the</word>
<word id="9" pos="NN">degeneracy</word>
<word id="10" pos="CC">and</word>
<word id="11" pos="VBG">increasing</word>
<word id="12" pos="NNS">miseries</word>
<word id="13" pos="IN">of</word>
<word id="14" pos="NN">mankind</word>
<word id="15" pos=",">,</word>
<word id="16" pos="PRP">I</word>
<word id="17" pos="MD">will</word>
<word id="18" pos="RB">not</word>
<word id="19" pos="VB">believe</word>
<word id="20" pos="PRP">them</word>
<word id="21" pos=".">.</word>

Using the Stanford Log-linear Part-Of-Speech Tagger with xGrid

There is really nothing special about using the tagger on an xGrid, but since it took me six hours to figure out how to do it and how to set everything up, I provide below a basic set up guide that will have you tagging in ten minutes (give or take an hour)

Not being a sys admin, I found the available documentation about xGrid a bit tricky to navigate, and frankly it just didn’t address what I feel to be rather fundemental questions about exactly how xGrid does what it does. More useful than Apple’s own xGrid documentation were
Charles Parnot’s xGrid tutorials available
here (The Apple manual even refers to these). Especially useful is Parnot’s “GridStuffer” application (more on that in a minute). The one problem I had with Parnot’s tutorials is that they all assumed that I was using a single machine as Client, Controller, and Agent. In xGrid lingo, the Client is the machine that submits a job to the Controller. The Controller is where xGrid “lives,” and it is the Controller that serves as the distributor and the distribution point for sending a job out to the “Agents.” Agents are the machines that are enlisted to do the heavy lifting, that is, they are all the machines on the network that are signed up to parallel process your job.

Example Three: POS Tagging with xGrid from the Command Line

To make life easy, the first thing I did was to figure out how to submit the job to xGrid from the cmd line. In this case I had ssh access to the server hosting the controller, so I installed the Tagger in a folder I created inside /Users/Shared/. The full path to my installation of the tagger was “/Users/Shared/Tagger/stanford-postagger-full-2008-05-19/”. Once this was done, I cd’ed (changed directory) down into this directory and then entered the following cmd at the prompt (you would substitute “hostname” and “password” with your information, e.g. – h myserver.mydomain.edu -p myPW)

xgrid -h <hostname> -p <password> -job submit /usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -textFile sample-input.txt

You will notice one thing about the tagger portion of this cmd that is different: there is no redirect of the output. Above, when we were just tagging files without xGrid, the cmd ended like this “-textFile sample-input.txt > sample-output.txt” telling the program to print its output to a new file titled “sample-output.txt”. xGrid has it’s own way of handling “results” or stdout; xGrid saves the results of a job on its own and then allows us to retrieve the results using a job id.

So, after submitting the cmd above, xGrid returns some important information to the terminal, somehting like this:

jobIdentifier = 2822;

This tells us that our job is id number 2822, and we’ll need that number in order to “return” the results after the job has finished. But after we submit the job we can also check on its status by entering the following cmd:

xgrid -h <hostname> -p <password> -job attributes -id 2195

Which returns something like this:

jobAttributes = {
activeCPUPower = 0;
applicationIdentifier = "com.apple.xgrid.cli";
dateNow = 2008-05-23 15:15:16 -0700;
dateStarted = 2008-05-23 15:14:45 -0700;
dateStopped = 2008-05-23 15:14:49 -0700;
dateSubmitted = 2008-05-23 15:14:41 -0700;
jobStatus = Finished;
name = "/usr/bin/java";
percentDone = 100;
taskCount = 1;
undoneTaskCount = 0;

jobStatus here tells us if the job is running, if it is finished or if it has failed. This job has finished, and so I enter the next cmd to retrieve the results:

xgrid -h <hostname> -p <password> -job results -id 2195

This command tells the xGrid to print the results of the job to my Terminal window, which isn’t all that great if I’ve just tagged a big file. What I really want is to send the result to a new file, so I can modify the cmd above to redirect the output to a file, like this:

xgrid -h <hostname> -p <password> -job results -id 2195 > sample-output.txt

You’ll now find a nicely POS-Tagged file titled “sample-output.txt” in your working directory.

Example Four: POS Tagging with xGrid Using Charles Parnot’s GridStuffer

You can download a copy of Charles Parnot’s GridStuffer at the
xGrid@Stanford website. GridStuffer is a slick little application that greatly simplifies the work involved in putting together a large xGrid job. You should read
Parnot’s GridStuffer Tutorialto understand what GridStuffer does and how it works. Everything that I do here is more or less exactly what is done in the tutorial, well, not exactly. The real point of difference pertains to the input file that contains all the “commands.” Because the Stanford Tagger is a Java application (and not quite like Parnot’s Fasta program) I had to figure out how to articulate the command lines for the GridStuffer input file. In retrospect it all seems very logical and simple, but trying to move from Parnot’s example to my own case actually proved quite challenging. In fact, it was only by reading through one of the forums that I discovered a key missing piece to my puzzle. Other users had reported problems calling Java application and a common thread had to do with paths and file locations. . .

The real trick (and it’s no trick once you understand how xGrid and GridStuffer work) involves how you define your paths and where you place files. I spent a lot of time trying to figure this all out, and even Parnot admits that understanding which paths are relative and which paths are absolute can get challenging. Let me explain. With GridStuffer, you are not, as I did above, logging into the Controller machine (server) and running an xGrid command locally. Instead, you are running GridStuffer on your own machine, separate from the Controller machine and thus acting as a true “Client.” This makes one wonder, right from the start about whether you need to be concerned about file paths on your machine or on the controller or on both. The answer is, “yes.”

Now there are always many ways to skin the cliche, so don’t assume that the way I set things up is the only approach or even the best approach; it is, however, an approach that works. I began by installing the tagger on my local machine (the machine that would serve as the Client and from which I would launch and invoke GridStuffer. I installed the tagger in my “Shared” directory at the following path: /Users/Shared/tagger/stanford-postagger-full-2008-05-19

Inside of this root folder of the tagger (stanford-postagger-full-2008-05-19), I then added another folder titled “input” into which I copied all of the files that I wanted to tag (in this case several hundred novels marked up in TEI XML). Next I created the “input file” (or “commands”) that GridStuffer requires and most importantly, I
put it into the exact same directory(stanford-postagger-full-2008-05-19). Well, the truth is that I did not actually do this at first and spent a good deal of time trying to figure out why the program was choking. In Parnot’s tutorial, he has you store these files on a folder on your Desktop. This practice apparently works just fine with his Fasta tutorial, but it makes a java app like the POS Tagger (and others) choke. Moving the input file to the root directory of the Tagger application solved all my problems (well, almost all of them). Anyhow, not only is the placement of this file important, but this is the critical file in the entire
shebang; it is the file with *all* the calls to the tagger. Here is a snippet from my file:

/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.3.xml
/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.4.xml
/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.5.xml
/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.6.xml

Now for my big job of several hundred novels, I didn’t actaully cut and paste each of these. I wrote a simple script to iterate through a directory of files and then write this “commands” file. I used a little php script run from the command line of the terminal; you could use Perl or Python or some other for the same purpose. Note in the code above that I am tagging xml files and selecting the contents of the “p” elements just as I did in the tagger examples above.

It took me about six bangs of my head on the desk to figure out that I needed to provide the full path to “java” (i.e. /usr/bin/java). That was the only other counterintuitive bit. With this as my input file, I then selected an output directory using the handy Gridstuffer interface and Voila! I was soon tagging away.