Tips and Code - Matthew L. Jockers

Apr 21 2014

Academic, R-Code, Text-Mining, Tips and Code

Text Analysis with R . . . coming soon.

My new book, Text Analysis with R for Students of Literature is due from Springer sometime in May. I got the cover proofs this week (below). Looking good:-)

Feb 25 2014

1 Comment

Academic, R-Code, Text-Mining, Tips and Code

Experimenting with “gender” package in R

Yesterday afternoon, Lincoln Mullen and Cameron Blevins released a new R package that is designed to guess (infer) the gender of a name. In my class on literary characterization at the macroscale, students are working on a project that involves a computational study of character genders. . . needless to say, the ‘gender‘ package couldn’t have come at a better time. I’ve only begun to experiment with the package this morning, but so far it looks very promising.

It doesn’t do everything that we need, but it’s a great addition to our workflow. I’ve copied below some R code that uses the gender package in combination with some named entity recognition in order to try and extract character names and genders in a small sample of prose from Twain’s Tom Sawyer. I tried a few other text samples and discovered some significant challenges (e.g. Mrs. Dashwood), but these have more to do with last names and the difficult problems of accurate NER than anything to do with the gender package.

Anyhow, I’ve just begun to experiment, so no big conclusions here, just some starter code to get folks thinking. Hopefully others will take this idea and run with it!

library(gender)
library(openNLP)
require(NLP)

sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()

s <- as.String("Tom did play hookey, and he had a very good time. He got back home barely in season to help Jim, the small colored boy, saw next-day's wood and split the kindlings before supper—at least he was there in time to tell his adventures to Jim while Jim did three-fourths of the work. Tom's younger brother (or rather half-brother) Sid was already through with his part of the work (picking up chips), for he was a quiet boy, and had no adventurous, trouble-some ways. While Tom was eating his supper, and stealing sugar as opportunity offered, Aunt Polly asked him questions that were full of guile, and very deep—for she wanted to trap him into damaging revealments. Like many other simple-hearted souls, it was her pet vanity to believe she was endowed with a talent for dark and mysterious diplomacy, and she loved to contemplate her most transparent devices as marvels of low cunning.")

a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
entity_annotator <- Maxent_Entity_Annotator()
named.ents<-s[entity_annotator(s, a2)]
named.ents.l <- strsplit(named.ents, "\\W")
named.ents.v <- unlist(named.ents.l)
not.blanks.v  <-  which(named.ents.v!="")
named.ents.v <-  named.ents.v[not.blanks.v]
gender(tolower(named.ents.v))

And here is the output:

   name gender proportion_male proportion_female
1   tom   male          0.9971            0.0029
2   jim   male          0.9968            0.0032
3   tom   male          0.9971            0.0029
4   tom   male          0.9971            0.0029
5  aunt                 NA                NA
6 polly female          0.0000            1.0000

Dec 09 2013

Academic, Just for Fun, R-Code, Tips and Code

A Festivus Miracle: Some R Bingo code

A few weeks ago my daughter’s class was gearing up to celebrate the Thanksgiving Holiday, and I was asked to help prepare some “holiday bingo cards” for the kid’s party. Naturally, I wrote a program in R for the job! (I know, I know, Maslow’s hammer)

Since I learned a few R tricks for making a grid and placing text, and since today is the first day of the Hour of Code, I’ve decided to release the code;-)

The 13 lines (excluding comments) of code below will produce 25 random Festivus Bingo Cards and write them out to a single pdf file for easy printing. You supply the Festivus Nog.

# file to make random holiday bingo boards in R. . .

# How many different cards do you want to make?
num.unique.bingo.cards<-25

# 25 words to populate the board with:
my.words<-toupper(c("Seinfeld", "Costanza", "Festivus", "Kramer", "pole", "Grievances", "Strength", "meatloaf", "wrestling ", "Miracle", "Elaine", "Jerry","George", "Dinner", "NBC","December", "Sitcom", "1997", "two-face", "donation", "Kruger", "Flask", "Strike", "Submarine", "Gwen"))

# Put it all in a single pdf file for easy printing!
pdf("bingo.cards.pdf", width=7.5, height=7.5)
for(i in 1:num.unique.bingo.cards){
  # Build a 5 x 5 Matrix for the Grid; ignore the warnings:-)
  m <- matrix(0:1, nrow=5, ncol=5)
  image(m, col=c("darkgreen", "red"), xaxt="n", yaxt="n", main="Festivus Bingo")
  the.vals<-c(0, .25, .5, .75, 1,1, .75, .5, .25, 0)
  x.y.vals<-combn(the.vals, 2)
  the.coords<-unique(t(x.y.vals))
  word.sample<-sample(my.words)
  text(the.coords, word.sample, col="white", cex=.75)
}
dev.off()

Sep 03 2013

Academic, Text-Mining, Tips and Code

Text Analysis with R for Students of Literature

[Update (9/3/13 8:15 CST): Contributors list now active at the main Text Analysis with R for Students of Literature Resource Page]

Below this post you will find a link where you can download a draft of Text Analysis with R for Students of Literature. The book is under review with Springer as part of a new series titled “Quantitative Methods in the Humanities and Social Sciences.”

Springer agreed to let me post the draft manuscript here (thank you, Springer), and my hope is that you will download the manuscript, take it for a test drive, and then send me your thoughts. I’m especially interested to hear about areas of confusion, places where you get lost, or where you feel your students might get lost. I’m also interested in hearing about the errors (hopefully not too many), and, naturally, I’ll be delighted to hear about anything you like.

I’m open to suggestions for new sections, but before you suggest that I include another chapter on “your favorite topic,” please read the Preface where I lay out the scope of the book. It’s a beginner’s book, and helping “literary folk” get started with R is my primary goal. This is not the place to get into debates or details about hyper parameter optimization or the relative merits of p-values.*

Please also read the Acknowledgements. It is there that I hint at the spirit and intent behind the book and behind this call for feedback. I did not learn R without help, and there is still a lot about R that I have to learn. I want to acknowledge both of these facts directly and specifically. Those who offer feedback will be added to a list of contributors to be included in the print and online editions of the final text. Feedback of a substantial nature will be acknowledged directly and specifically.

Book is now in production and draft has been removed.~~That’s it. Download Text Analysis with R for Students of Literature (1.3MB .pdf)~~

* Besides, that ground has been well-covered by Scott Weingart

Apr 12 2013

Academic, Text-Mining, Tips and Code

“Secret” Recipe for Topic Modeling Themes

The recently (yesterday) published issue of JDH is all about topic modeling. It’s a great issue, and it got me thinking about some of the lessons I have learned over seven or eight years of modeling literary corpora. One of the important things I have learned is that the quality of the final model (which is to say the coherence and usefulness of the topics) is largely dependent upon preprocessing. I know, I know: “that’s not much fun.”

Fun or no, it is the reality, and there’s no getting around it. One of the first things you discover when you begin modeling literary materials is that books have a lot of characters. And here I don’t mean the letters “A, B, C,” but actual literary characters as in “Ahab, Beowulf, and Copperfield.” These characters can cause no end of headaches in topic modeling. Let me explain. . .

As I write this blog post, I am running a smallish topic modeling job over a corpus of 50 novels that I have selected for use in a topic modeling workshop I am teaching next week in Milwaukee. Without any preprocessing I get topics that look like these two:

There is nothing wrong with these topics except that one is obviously a “Moby Dick” topic and the other a “Dracula” topic. A big part of the reason these topics formed in this way is because of the power of the character names (yes, “whale” is a character). The presence of the character names tends to bias the model and make it collect collocates that cluster around character names. Instead of getting a topic having to do with “seafaring” (a theme, by the way, that appears in both Moby Dick and Dracula) we get these broad novel-specific topics instead.

That is not what we want.

To deal with this character “problem,” I begin by expanding the usual topic modeling “stop list” from the 100 or so high frequency, closed class words (such as “the, of, a, and. . .”) to include about 5,600 common names, or “named entities.” I posted this “expanded stoplist” to my blog some months ago as ancillary material for my book; feel free to copy it for your own work. I built my expanded stop list through a combination of named entity recognition and the scraping of baby name web sites:-)

Using the exact same model parameters that produced the two topics above, but now with the expanded stop list, I get topics that are much less about individual novels and much more about themes that cross novels. Here are two examples.

A topic of words relating to Native Americans, but mostly from *Last of the Mohicans*?

The first topic cloud seems pretty good. In the previous run of the model, without the expanded stop list, there was no such topic. The second one; however, is still problematic, largely because my expanded stopwords list, even at 5,631 words, is still imperfect. “Heyward” is a character from Last of the Mohicans whose name is not in my stop list.

But in addition to this imperfection, I would argue that there are other problems as well, at least if our objective is to harvest topics of a thematic nature. Notice, for example, the word “continued” just to the left of “heyward” and then notice “demanded” near the bottom of the cloud. These words do not contribute very much at all to the thematic sense of the topic, so ideally they too should be stopped out.

As a next step in preprocessing, therefore, I employ Part-of-Speech tagging or “POS-Tagging” in order to identify and ultimately “stop out” all of the words that are not nouns! Since I can already hear my friend Ted Underwood screaming about “discourses,” let me justify this suggestion with a small but important caveat: I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation.

POS tagging is well documented, so I’m not going to foreground it here other than to say that it’s an imperfect method. It does make mistakes, but the best taggers (such as the Stanford Tagger that I usually use) have very (+97%) accuracy (see, for example Manning 2011).

After running a POS tagger, I have a simple little script that uses a simple little regular expression to change the following tagged sentences:

The/DT family/NN of/IN Dashwood/NNP had/VBD been/VBN long/RB settled/VBN in/IN Sussex./NNP Their/PRP$ estate/NN was/VBD large,/RB and/CC their/PRP$ residence/NN was/VBD at/IN Norland/NNP Park,/NNP in/IN the/DT centre/NN of/IN their/PRP$ property,/NN where,/, for/IN many/JJ generations,/NNS they/PRP had/VBD lived/VBN in/IN so/RB respectable/JJ a/DT manner,/JJ as/IN to/TO engage/VB the/DT general/JJ good/JJ opinion/NN of/IN their/PRP$ surrounding/VBG acquaintance./NN

into

family estate residence centre property generations opinion acquaintance

Just with this transformation to nouns alone, you can begin to see how a theme of “property” or “family estates” might eventually evolve from these words during the topic modeling process. But there is still one more preprocessing step before we can run the LDA. The next step (which can really be the first step) is text chunking or segmentation.

Topic models like to have lots of texts; or more precisely they like to have lots of bags of words. Topic models such as LDA do not take into account word order, they assume that each text or document is a bag of words. Novels are very big bags, and if we don’t chunk them up into smaller pieces we end up getting topics of a very general nature. By chunking each novel into smaller pieces, we allow the model to discover themes that occur only in specific places within novels and not just across entire novels. Consider the theme of death, for example. While there may be entire novels about death, more than likely death is going to pop up once or twice in every novel. In order for the topic model to detect or find a death topic, however, it needs to encounter bags of words that are largely about death. If the whole novel is a single bag of words, then death might not be prominent enough to rise to the level of “topicdom.”

I have found through lots and lots of experimentation that 500-1000 word chunks are pretty darn good when modeling novels. It might help to think in terms of pages: 500-1000 words is roughly 2-4 pages. The argument for this length goes something like this: a good death scene takes several pages to develop. . . etc.

Exactly what chunk length I choose is a matter of some witchcraft and alchemy; it is similar to the witchcraft and tarot involved in choosing the number of topics. I’ll not unpack either of those here, but you can read more in chapter 8 of my book (plug). Here the point is to simply say that some chunking needs to happen if you are working with big documents.

So here, finally, is my “secret” recipe in pseudo code:

for each novel as novel {
POS tag novel
split tagged novel into 1000 word chunks
for each chunk as chunk {
remove non-nouns from chunk
lowercase everything
remove stop list words from chunk
}
}
run LDA over chunks
analyze data

Of course, there is a lot more to it than this: you need to keep track of which chunks go with which novels and so on. But this is the general recipe.* Here are two topics derived from the same corpus of novels, now without character names and without non-nouns.

* The word “Secret” in my title is in quotes because there is nothing secret about the ingredients in this particular recipe. The idea of combining POS tagging, text chunking, and LDA is well established in various papers, including, for example, “TagLDA: Bringing document structure knowledge into topic models” (2006) and Reading Tea Leaves: How Humans Interpret Topic Models (2009).

Apr 27 2010

Academic, Commentary, Tips and Code

Stalker (R) and the journey of the Jockers iPhone

Lot’s of hoopla in the last few days over the discovery that the iPhone keeps a database of locations it has traveled. Wasn’t long before someone in the R community figured out how to tap into this file and with a mere two lines of code you can visualize where your phone has been on a map.

The code library comes complements of Drew Conway over on the r-bloggers page. I installed the app and within a few seconds had several maps of my recent travels. I attach two images below (don’t tell my mom).

Nov 11 2008

7 Comments

Academic, Tips and Code

Executing R in Php

For their final project, the students in my Introduction to Digital Humanities seminar decided to analyze narrative style in Faulkner’s Sound and the Fury. In addition to significant off-line analysis, we are building a web-based application that allows visitors to compare the different sections of the novel to each other and also to new, unseen texts that visitors to the site can enter themselves.

To achieve this end, the web application must be able to “ingest” a text sample, tokenize it, extract data about the various features that will be used in the comparison, and then prepare (organize, tabulate) those features in a manner that will allow for statistical analysis/comparison. Since my course is not a course in statistics, we decided that I would be responsible for the number crunching.

So, while my students work out the text ingestion, tokenization, and preparation part of the project, I was tasked with figuring out how to crunch the numbers. My first instinct (not good) was to begin thinking about how to do the required math in php, which is the language the students are using for the rest of the project. Writing a complex statistical algorithm in php did not sound like much fun. My facility with statistics is almost entirely limited to descriptive statistics and though I do employ more complex statistical procedures in my work, I can’t say that I fully understand the equations that underlie them. So I quickly found myself wishing for a web-based version of (or maybe an API for) the open-source stats application “R”, which I use frequently in my own work. It turns out, of course, that lots of other folks had thought of this before me, and there are all sorts of web implementations of R. This was good news, but unfortunately, it wasn’t exactly what I was after. I did not want to replicate the R interface online. Instead, I wanted to be able to utilize the power of R through a user friendly front end. After five or six hours of hammering away, I eventually got what I was after: a way to call R from within php and return the results to a web interface.

What follows here is a simple step by step for replicating what I did. I offer this not as an example of something revolutionary or unprecedented–others have figured this out and there is nothing exceptional here–but instead as a way of documenting, in one place, the process I discovered after scraping the R archives, the various php sites, and the brain of my more R-savvy colleague Claudia. Hopefully this will prove useful to other rookies who may want to take a stab at something similar.

The steps below outline how to set up a php web page that accesses the “R” statistics package and outputs a .jpg Cluster Dendrogram showing how the texts clustered. The steps assume that you have already developed a script that ingests and processes a user-submitted text file and then adds that data to an existing data file containing data for the “canned” texts you wish to compare the user-submitted file against. I also assume that you have php and “R” installed on your server.

It warrants noting that what I present here is not exactly how I finally implemented the solution. What I show here provides for the clearest explanation of the process and works perfectly well, but it is not the “streamlined” final version of the script

For this solution, I use four separate files:

form.html –a simple html form with two fields. The first field allows the user to enter a name for their file (e.g. “MySample”), and the second field is a “textarea” where the user can paste the text for comparison.
rtest.php –a result page for the form that gets executed after the user hits submit on the html form page. The php code in this file executes the R code on the server.
rCommands.txt –a text file containing a “canned” series of R commands.
cannedData.txt –a tab delimited text file containing rows of data for analysis. Each row contains three columns: “textId,” “word,” and “frequency” where textId is the name of the text sample (e.g. “Faulkner’s Benjy”) and frequency is a normalized (percentage) value representing the percentage frequency of the “word” in the given text sample.

Now the Steps:

The user submits a text and text name at form.html.
The text of the user submitted sample is processed into a term frequency array “$sampeData” using the built in php functions “str_word_count” and “array_count_values” to compute term-frequencies (relative word frequencies) for every word type in the file.
The contents of cannedData.txt is read into a new php variable.
A temporary file “data.txt” is created on the server and the contents of cannedData are written to “data.txt.”
The contents of $sampleData are appended to the end of “data.txt” in the format: “\”$sampleName\”\t\”$word\”\t\”$percent\”\n” where “$sampleName” is the user entered name for the the text sample, “$word” is a given word-type from the sample, and “$percent” is the normalized frequency of the word in the sample. Upon completion “data.txt” is closed.
Using the php “exec” function, the script executes the unix cmd: “cat rCommands.txt | /usr/bin/R –vanilla”, which launches “R” and executes the R commands found inside the file “rCommands.txt”

“rCommands.txt” is a canned sequence of commands that loads “data.txt” into a data frame, runs the cross tabulation function to create a term frequency matrix that can be processed with the dist and hclust functions as follows:

file<-read.csv("data.txt", header=T, sep="\t")
xt<-xtabs(freq ~ bookId+word, data=file)
cluster <- hclust(dist(xt))
jpeg("plot.jpg", width=8, height=6)
par(family="mono")
plot(cluster)
dev.off()
distanceMatrix <- as.matrix(dist(xt,  upper=T, diag=T))
x<-row.names(distanceMatrix)
write.table(distanceMatrix, file="dataMatrix.txt", sep="\t", eol = "\n", col.names=x, row.names=FALSE)
write.table(xt, file="xt.txt", sep="\t", eol = "\n")

The result is a .jpg file ("plot.jpg") of the resulting cluster dendrogram created in the current directory.
"plot.jpg is then called in simple html (e.g. "<img src="plot.jpg">") as a final step of the script

Readers can download all the source files here execRSourceFiles.zip.

Matthew L. Jockers

"Everything . . . in nature's vast workshop from the extinction of some remote sun to the blossoming of one of the countless flowers which beautify our public parks is subject to a law of numeration as yet unascertained.” (Joyce, Ulysses, 1922)

Category Archives: Tips and Code