» POS Tagging XML with xGrid and the Stanford Log-linear Part-Of-Speech Tagger Matthew L. Jockers

Recently (4/2008) I had reason to Part-Of-Speech tag a whole mess of novels, around 1200. I installed the Stanford Tagger and ran my first job of 250 novels on an old G4 under my desk. Everything worked fine, but the job took six days. After that experience, I figured out how to utilize xGrid for “distributed” tagging, or what I’ll call, according to convention, “Tagging@Home.” At the time that I was working on this tagging project, the folks in Stanford’s NLP group, especially Chris Manning and Anna Rafferty, were improving the tagger and adding some XML functionality to the program. I’m very grateful to Chris and Anna for their work. What follows is a practical guide for those who might wish to employ the tagger for use with XML or who might want to understand how to set up xGrid to help distribute a big tagging job.

First I provide some simple examples showing how to take advantage of the XML functionality added in the May 19, 2008 release of the
Stanford Log-linear Part-Of-Speech Tagger. Further down in this page I include information about setting up xGrid to farm out a large tagging job to a network of Macs, useful if you want to POS tag a large corpus. These example assume that you have installed the tagger and understand how to invoke the tagger from the cmd line. If you are not yet familiar with the tagger, you should consult the ReadMe.txt file and javadoc that come with it. In the javadoc, see specifically the link for “MaxentTagger” where there is a useful “parameter description” table.

Example One: Tagging XML with the Stanford Log-linear Part-Of-Speech Tagger

Many texts are currently available to us in XML format, and in literary circles the most common flavor of XML is
TEI. In this example we will POS tag a typical TEI encoded XML file.

Begin by examining the structure of your source XML file to determine which XML elements “contain” the content that you wish to POS tag. More than likely, you don’t want to tag the title page, for example, but are interested primarily in the main text. To complete the exercises below, you may want to
download a shortened version of James McHenry’s
The Wildernesswhich is the text I use in the examples; alternatively, you may use your own .xml file. The example file “wilderness.xml” is marked up according to TEI standards and thus contains two major structural divisions: “teiHeader” and “text.” The “teiHeader” element contains metadata about the file and the “text” element contains the actual text that has been marked up. For purposes of this example, I shortened the text to include only the first two chapters.

For this example, I assume that you wish to POS tag the text portions of the book that are contained in the main “body” of the book, that is to say, you are not interested in POS tagging the title page(s) or any ancillary material that may come before or after the primary text. To separate the main body of the text, TEI uses a “body” element, so we might begin by having the tagger focus only on the text that occurs within the body element.

If we were simply tagging a plain text file, such as what you might find at
Project Gutenberg, the usual command (as found in the tagger “readme.txt” file) would be as follows:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -textFile sample-input.txt > sample-output.txt

To deal with XML based input, however, we need to add another parameter indicating to the tagger that the source file is XML and more importantly that we only want to tag the contents of a specific XML element (or several specific elements, which is also possible). The additional parameter is “-xmlInput” and the parameter is followed by a space delimited list of XML tags whose content we want the POS tagger to tag. The revised command for tagging just the contents of the “body” element would look like this:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput body -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

When invoked in this manner, the tagger grabs the content of the body tag for processing and ignores (or strips out) any XML markup contained within the selected tag. Running the command above, thus has the effect of stripping out all of the rich markup for chapters, paragraphs and etc. In order to preserve more of the structural markup, a slightly better approach is to use the “p” tag instead of the “body” tag, as follows:

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

If you run this command, you’ll be a bit closer, but you might notice that our source file includes a series of poetic epigraphs at the beginning of chapters and a periodic bit of poetry dispersed elsewhere throughout the prose. Using just the “p” tag above, we fail to POS tag the poetic sections. Fortunately, the tagger allows us to specify more than one POS tag in the command, and we can thus POS tag both “p” tags and “l” tags (which contain lines of poetry)
*see note*:

java -mx300m -classpath stanford-postaggerar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p\ l -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

Complete execution time for the command above using McHenry’s
Wildernesson a typical desktop computer is about 30-40 seconds but the processing time will vary depending on your machine and the speed of your processor. Here is a snippet of what the resulting output will look like:

<l>As/IN slow/JJ our/PRP$ ship/NN her/PRP$ foamy/NN track/NN ,/,</l> <l>Against/IN the/DT wind/NN was/VBD cleaving/VBG ,/,</l> <l>Her/PRP$ trembling/VBG pendant/JJ still/RB look/VB 'd/NNP back/RB</l> <l>To/TO that/DT dear/RB isle/VB 'twas/NNS leaving/VBG ;/:</l> <l>So/RB loth/NN we/PRP part/VBP from/IN all/DT we/PRP love/VBP ,/,</l> <l>From/IN all/DT the/DT links/NNS that/IN bind/NN us/PRP ,/,</l> <l>So/RB turn/VB our/PRP$ hearts/NNS where'er/VBP we/PRP rove/VBP ,/,</l> <l>To/TO those/DT we/PRP 've/VBP left/VBN behind/IN us/PRP !/. Moore/NNP</l>
<p>Let/VB melancholy/JJ spirits/NNS talk/VBP as/IN they/PRP please/VBP concerning/VBG the/DT degeneracy/NN and/CC increasing/VBG miseries/NNS of/IN mankind/NN ,/, I/PRP will/MD not/RB believe/VB them/PRP ./. They/PRP have/VBP been/VBN speaking/VBG ill/JJ of/IN themselves/PRP ,/, and/CC predicting/VBG worse/JJR of/IN their/PRP$ posterity/NN ,/, from/IN time/NN immemorial/JJ ;/: and/CC yet/RB ,/, in/IN the/DT present/JJ year/NN ,/, 1823/CD ,/, when/WRB ,/, if/IN the/DT one/CD hundreth/NN part/NN of/IN their/PRP$ gloomy/JJ forebodings/NNS had/VBD been/VBN realized/VBN ,/, the/DT earth/NN must/MD have/VB become/VBN a/DT Pandemonium/NN ,/, and/CC men/NNS something/NN worse/JJR than/IN devils/NNS ,/, -LRB-/-LRB- for/IN devils/NNS they/PRP have/VBP been/VBN long/JJ ago/RB ,/, in/IN the/DT opinion/NN of/IN these/DT charitable/JJ denunciators/NNS ,/, -RRB-/-RRB- I/PRP am/VBP free/JJ to/TO assert/VB ,/, that/IN we/PRP have/VBP as/IN many/JJ honest/JJ men/NNS ,/, pretty/RB women/NNS ,/, healthy/JJ children/NNS ,/, cultivated/VBN fields/NNS ,/, convenient/JJ houses/NNS ,/, elegant/JJ kinds/NNS of/IN furniture/NN ,/, and/CC comfortable/JJ clothes/NNS ,/, as/IN any/DT generation/NN of/IN our/PRP$ ancestors/NNS ever/RB possessed/VBN ./.</p>

*Note that after the -xmlInput parameter we include the “p” tag and the “l” tag separated by an *escaped* space character. UNIX chokes on the space character if we don’t escape it with a backslash. Martin Holmes pointed out to me that if you are using windows, then you should put the space delimited tags inside quotes (like this: “p l”) and disregard the backslash.

Example Two: POS Tagging an XML source file and outputting results in XML

In addition to being able to “read” xml, the 5-19-2008 tagger release also includes functionality allowing users to output POS tagged results as well-formed xml. The process for doing so is very much the same as Example One above, however, to output XML, we need to add an additional “-xmlOutput” parameter to the command.

java -mx300m -classpath stanford-postaggerar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p\ l -xmlOutput -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

In this case, the resulting output is a far richer XML in which each sentence is wrapped in a “sentence” tag and each word in a “word” tag. The sentence tag includes an “id” attribute indicating its number in the sequence of sentences in the entire document. Likewise, the word element contains an id attribute referecning the word’s position in the sentence followed by a “pos” attribute indicating what part of speech the tagger assigned the word. Here is a sample of the output:

<p> <sentence id="17"> <word id="0" pos="VB">Let</word> <word id="1" pos="JJ">melancholy</word> <word id="2" pos="NNS">spirits</word> <word id="3" pos="VBP">talk</word> <word id="4" pos="IN">as</word> <word id="5" pos="PRP">they</word> <word id="6" pos="VBP">please</word> <word id="7" pos="VBG">concerning</word> <word id="8" pos="DT">the</word> <word id="9" pos="NN">degeneracy</word> <word id="10" pos="CC">and</word> <word id="11" pos="VBG">increasing</word> <word id="12" pos="NNS">miseries</word> <word id="13" pos="IN">of</word> <word id="14" pos="NN">mankind</word> <word id="15" pos=",">,</word> <word id="16" pos="PRP">I</word> <word id="17" pos="MD">will</word> <word id="18" pos="RB">not</word> <word id="19" pos="VB">believe</word> <word id="20" pos="PRP">them</word> <word id="21" pos=".">.</word> </sentence> </p>

Using the Stanford Log-linear Part-Of-Speech Tagger with xGrid

There is really nothing special about using the tagger on an xGrid, but since it took me six hours to figure out how to do it and how to set everything up, I provide below a basic set up guide that will have you tagging in ten minutes (give or take an hour)

Not being a sys admin, I found the available documentation about xGrid a bit tricky to navigate, and frankly it just didn’t address what I feel to be rather fundemental questions about exactly how xGrid does what it does. More useful than Apple’s own xGrid documentation were
Charles Parnot’s xGrid tutorials available
here (The Apple manual even refers to these). Especially useful is Parnot’s “GridStuffer” application (more on that in a minute). The one problem I had with Parnot’s tutorials is that they all assumed that I was using a single machine as Client, Controller, and Agent. In xGrid lingo, the Client is the machine that submits a job to the Controller. The Controller is where xGrid “lives,” and it is the Controller that serves as the distributor and the distribution point for sending a job out to the “Agents.” Agents are the machines that are enlisted to do the heavy lifting, that is, they are all the machines on the network that are signed up to parallel process your job.

Example Three: POS Tagging with xGrid from the Command Line

To make life easy, the first thing I did was to figure out how to submit the job to xGrid from the cmd line. In this case I had ssh access to the server hosting the controller, so I installed the Tagger in a folder I created inside /Users/Shared/. The full path to my installation of the tagger was “/Users/Shared/Tagger/stanford-postagger-full-2008-05-19/”. Once this was done, I cd’ed (changed directory) down into this directory and then entered the following cmd at the prompt (you would substitute “hostname” and “password” with your information, e.g. – h myserver.mydomain.edu -p myPW)

xgrid -h <hostname> -p <password> -job submit /usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -textFile sample-input.txt

You will notice one thing about the tagger portion of this cmd that is different: there is no redirect of the output. Above, when we were just tagging files without xGrid, the cmd ended like this “-textFile sample-input.txt > sample-output.txt” telling the program to print its output to a new file titled “sample-output.txt”. xGrid has it’s own way of handling “results” or stdout; xGrid saves the results of a job on its own and then allows us to retrieve the results using a job id.

So, after submitting the cmd above, xGrid returns some important information to the terminal, somehting like this:

{ jobIdentifier = 2822; }

This tells us that our job is id number 2822, and we’ll need that number in order to “return” the results after the job has finished. But after we submit the job we can also check on its status by entering the following cmd:

xgrid -h <hostname> -p <password> -job attributes -id 2195

Which returns something like this:

{ jobAttributes = { activeCPUPower = 0; applicationIdentifier = "com.apple.xgrid.cli"; dateNow = 2008-05-23 15:15:16 -0700; dateStarted = 2008-05-23 15:14:45 -0700; dateStopped = 2008-05-23 15:14:49 -0700; dateSubmitted = 2008-05-23 15:14:41 -0700; jobStatus = Finished; name = "/usr/bin/java"; percentDone = 100; taskCount = 1; undoneTaskCount = 0; }; }

jobStatus here tells us if the job is running, if it is finished or if it has failed. This job has finished, and so I enter the next cmd to retrieve the results:

xgrid -h <hostname> -p <password> -job results -id 2195

This command tells the xGrid to print the results of the job to my Terminal window, which isn’t all that great if I’ve just tagged a big file. What I really want is to send the result to a new file, so I can modify the cmd above to redirect the output to a file, like this:

xgrid -h <hostname> -p <password> -job results -id 2195 > sample-output.txt

You’ll now find a nicely POS-Tagged file titled “sample-output.txt” in your working directory.

Example Four: POS Tagging with xGrid Using Charles Parnot’s GridStuffer

You can download a copy of Charles Parnot’s GridStuffer at the
xGrid@Stanford website. GridStuffer is a slick little application that greatly simplifies the work involved in putting together a large xGrid job. You should read
Parnot’s GridStuffer Tutorialto understand what GridStuffer does and how it works. Everything that I do here is more or less exactly what is done in the tutorial, well, not exactly. The real point of difference pertains to the input file that contains all the “commands.” Because the Stanford Tagger is a Java application (and not quite like Parnot’s Fasta program) I had to figure out how to articulate the command lines for the GridStuffer input file. In retrospect it all seems very logical and simple, but trying to move from Parnot’s example to my own case actually proved quite challenging. In fact, it was only by reading through one of the forums that I discovered a key missing piece to my puzzle. Other users had reported problems calling Java application and a common thread had to do with paths and file locations. . .

The real trick (and it’s no trick once you understand how xGrid and GridStuffer work) involves how you define your paths and where you place files. I spent a lot of time trying to figure this all out, and even Parnot admits that understanding which paths are relative and which paths are absolute can get challenging. Let me explain. With GridStuffer, you are not, as I did above, logging into the Controller machine (server) and running an xGrid command locally. Instead, you are running GridStuffer on your own machine, separate from the Controller machine and thus acting as a true “Client.” This makes one wonder, right from the start about whether you need to be concerned about file paths on your machine or on the controller or on both. The answer is, “yes.”

Now there are always many ways to skin the cliche, so don’t assume that the way I set things up is the only approach or even the best approach; it is, however, an approach that works. I began by installing the tagger on my local machine (the machine that would serve as the Client and from which I would launch and invoke GridStuffer. I installed the tagger in my “Shared” directory at the following path: /Users/Shared/tagger/stanford-postagger-full-2008-05-19

Inside of this root folder of the tagger (stanford-postagger-full-2008-05-19), I then added another folder titled “input” into which I copied all of the files that I wanted to tag (in this case several hundred novels marked up in TEI XML). Next I created the “input file” (or “commands”) that GridStuffer requires and most importantly, I
put it into the exact same directory(stanford-postagger-full-2008-05-19). Well, the truth is that I did not actually do this at first and spent a good deal of time trying to figure out why the program was choking. In Parnot’s tutorial, he has you store these files on a folder on your Desktop. This practice apparently works just fine with his Fasta tutorial, but it makes a java app like the POS Tagger (and others) choke. Moving the input file to the root directory of the Tagger application solved all my problems (well, almost all of them). Anyhow, not only is the placement of this file important, but this is the critical file in the entire
shebang; it is the file with *all* the calls to the tagger. Here is a snippet from my file:

/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.3.xml /usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.4.xml /usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.5.xml /usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.6.xml

Now for my big job of several hundred novels, I didn’t actaully cut and paste each of these. I wrote a simple script to iterate through a directory of files and then write this “commands” file. I used a little php script run from the command line of the terminal; you could use Perl or Python or some other for the same purpose. Note in the code above that I am tagging xml files and selecting the contents of the “p” elements just as I did in the tagger examples above.

It took me about six bangs of my head on the desk to figure out that I needed to provide the full path to “java” (i.e. /usr/bin/java). That was the only other counterintuitive bit. With this as my input file, I then selected an output directory using the handy Gridstuffer interface and Voila! I was soon tagging away.