Experimenting with “gender” package in R

Yesterday afternoon, Lincoln Mullen and Cameron Blevins released a new R package that is designed to guess (infer) the gender of a name. In my class on literary characterization at the macroscale, students are working on a project that involves a computational study of character genders. . . needless to say, the ‘gender‘ package couldn’t have come at a better time. I’ve only begun to experiment with the package this morning, but so far it looks very promising.

It doesn’t do everything that we need, but it’s a great addition to our workflow. I’ve copied below some R code that uses the gender package in combination with some named entity recognition in order to try and extract character names and genders in a small sample of prose from Twain’s Tom Sawyer. I tried a few other text samples and discovered some significant challenges (e.g. Mrs. Dashwood), but these have more to do with last names and the difficult problems of accurate NER than anything to do with the gender package.

Anyhow, I’ve just begun to experiment, so no big conclusions here, just some starter code to get folks thinking. Hopefully others will take this idea and run with it!

And here is the output:

Characterization in Literature and the Macroanalysis Lab

I have just posted the syllabus for my spring macroanalysis class focusing on Characterization in Literature. The class is experimental in many senses of the word. We will be experimenting in the class and the class will be an experiment. If all goes according to plan, the only thing about this class that will be different from a research lab is the grade I have to assign at the end—that is the one remaining bit about collaborative learning that still kicks me . . .

To be successful everyone is going to have to be high-performing and self-motivated, me included. For me, at least, the motivation comes from what I think is a really tough nut to crack: algorithmic detection and analysis of character and character types. So far the work in this area has been largely about character networks: how is Hamlet related to Gertrude, etc. That’s good work, but it depends heavily upon the human coding of character metadata before processing. That is precisely why our early experiments at the Stanford Literary Lab focused on Drama. . . the character names are already explicit in the speaker markup. Beyond drama, there have been some important steps taken in the direction of auto-detection of character in fiction, such as those by Graham Sack and Elson et. al, but I think we still have a lot more stepping to do, a whole lot more.

The work I envision for the course will include leveraging obvious tools such as those for named entity recognition and then thinking through and dealing with the more complicated problems of pronoun disambiguation. But my deeper interest here goes far beyond simple detection of entities. The holy grail that I see here lies not in detecting the presence or absence of individual characters but in detecting and tracking character archetypes on a grand macroscale. What if we could begin to answer questions such as these:

  • Are there different classes of villains in the 19th century novel?
  • Do we see a rise in the number of minor characters over the 20th century?
  • What are the qualities that define heroines?
  • How, if at all, do those qualities change/evolve over time? (think Jane Austen’s Emma vs. Stieg Larsson’s Lisbeth).
  • Etc.

We may get nowhere; we may fail miserably. (Of course if I did not already have a couple of pretty good ideas for how to get at these questions I would not be bothering. . . but that, for now, is the secret sauce 😉 )

At the more practical, “skills” level, I’m requiring students to learn and submit all their work using LaTeX! (This may prove to be controversial or crazy–I only learned LaTeX six months ago.) For that they will also be learning how to use the knitr package for R in order to embed R code directly into the LaTeX, and all of this work will take place inside the (awesome) R IDE, RStudio. Hold on to your hats; it’s going to be a wild ride!

A Festivus Miracle: Some R Bingo code

A few weeks ago my daughter’s class was gearing up to celebrate the Thanksgiving Holiday, and I was asked to help prepare some “holiday bingo cards” for the kid’s party. Naturally, I wrote a program in R for the job! (I know, I know, Maslow’s hammer)

Since I learned a few R tricks for making a grid and placing text, and since today is the first day of the Hour of Code, I’ve decided to release the code;-)

The 13 lines (excluding comments) of code below will produce 25 random Festivus Bingo Cards and write them out to a single pdf file for easy printing. You supply the Festivus Nog.

Text Analysis with R for Students of Literature

[Update (9/3/13 8:15 CST): Contributors list now active at the main Text Analysis with R for Students of Literature Resource Page]

Below this post you will find a link where you can download a draft of Text Analysis with R for Students of Literature. The book is under review with Springer as part of a new series titled “Quantitative Methods in the Humanities and Social Sciences.”

Springer agreed to let me post the draft manuscript here (thank you, Springer), and my hope is that you will download the manuscript, take it for a test drive, and then send me your thoughts. I’m especially interested to hear about areas of confusion, places where you get lost, or where you feel your students might get lost. I’m also interested in hearing about the errors (hopefully not too many), and, naturally, I’ll be delighted to hear about anything you like.

I’m open to suggestions for new sections, but before you suggest that I include another chapter on “your favorite topic,” please read the Preface where I lay out the scope of the book. It’s a beginner’s book, and helping “literary folk” get started with R is my primary goal. This is not the place to get into debates or details about hyper parameter optimization or the relative merits of p-values.*

Please also read the Acknowledgements. It is there that I hint at the spirit and intent behind the book and behind this call for feedback. I did not learn R without help, and there is still a lot about R that I have to learn. I want to acknowledge both of these facts directly and specifically. Those who offer feedback will be added to a list of contributors to be included in the print and online editions of the final text. Feedback of a substantial nature will be acknowledged directly and specifically.

Book is now in production and draft has been removed.That’s it. Download Text Analysis with R for Students of Literature (1.3MB .pdf)

* Besides, that ground has been well-covered by Scott Weingart

Obi Wan McCarty

[Below is the text of my introduction of Willard McCarty, winner of the 2013 Busa Award.]

As the chair of the awards committee that selected Prof. McCarty for this award it is my pleasure to offer a few words of introduction.

I’m going to go out on a limb this afternoon and assume that you already know that Willard McCarty is Professor of Humanities Computing and Director of the Doctoral Program in the Department of Digital Humanities at King’s College London, and that he is Professor in the Digital Humanities Research Group, University of Western Sydney and that he is a Fellow of the Royal Anthropological Institute (London). I’ll assume that you already know that he is Editor of the British journal, Interdisciplinary Science Reviews and that he’s founding Editor of the online seminar Humanist. And I am sure you know that Willard is recipient of the Canadian Award for Outstanding Achievement in Computing in the Arts and Humanities, and of the prestigious Richard W. Lyman Award of the National Humanities Center. You have probably already read his 2005 book titled Humanities Computing, and you know of his many, many other writings and musing.

So I’m not going to talk about any of that stuff.

And since I’m sure that everyone here knows that the Roberto Busa Award was established in 1998. I’m not going to explain how the Busa award was set up to recognize outstanding lifetime achievement in the application of information and communications technologies to humanities research.

No I’m not going to say anything about that either.

Instead, I wish to say a few words about this fellow here. Screen Shot 2013-07-19 at 7.23.12 AM

This is Obi-Wan McCarty. Long before I met him in person, he had become a virtual friend, model, and mentor.

I began computing in the humanities in 1993, and like so many of us in those early days I was a young maverick with little or no idea what had been done before. Those were the days before the rebellion, when the dark forces of the Empire were still quite strong. It was a time when an English major with a laptop was considered a dangerous rebel. At times I was scared, and I felt alone in a dark side of a galaxy far, far, away.

And then somewhere between 1993 and 2001 I began to sense a force in the galaxy.

One day, in early 2001, I was walking with my friend Glen Worthey, and I mentioned how I had recently discovered the Humanist list and how there had been this message posted by Willard McCarty with the cryptic subject line “14.”

“Ah yes,” Glen said, “Obi-Wan McCarty. The force is strong with him.”

Message 14 from Obi-Wan was a birthday message. Humanist was 14 that day and Willard began his message with a reflection on “repetition” and how frequently newcomers to the list would ask questions that had already been asked. Rather than chastise those newbies, and tell them to go STFA (search the freakin’ archive), Willard encouraged them. He wrote in that message of how “repetition is a means of maintaining group memory.” I was encouraged by those words and by Willard’s ongoing and relentless commitment not simply to deep, thoughtful, and challenging scholarship, but to nurturing, teaching, welcoming, and mentoring each new generation.

So Willard, thank you for your personal mentorship, thank you for continuing to demonstrate that scholarly excellence and generosity are kindred spirits. Congratulations on this award. May the force be with you.

25 days until the 2013 DH Fun Run

Below is the route/elevation for the July 18, 2013 Unofficial (as in run at your own risk this has nothing to do with the conference) DH 2013 Fun Run. The route begins and ends on the north side of the UNL Student Union (fountain area). From campus we will go a few blocks east to the Billy Wolff Trail. This is a paved bike/walk trail that runs southeast along Antelope Creek. We’ll run about 2.25 miles on Billy Wolff and then loop around the Lincoln Children’s Zoo briefly traveling the Rock Island trail before getting back on to Billy Wolff.

Please arrive at the start at 6:00AM; we will depart at exactly at 6:15AM. The run is 4.75 Miles (that’s 7.65 Kilometers). I’ll plan on setting a roughly ~9 minute mile pace making the run about 45 minutes. We should be back on campus at 7AM. Conference sessions begin at 8:30.

Morning temperatures in July are likely to be in mid 70s (that’s ~24 degrees centigrade) and it will be humid. A water bottle is recommended (and you’ll find one in your conference swag bag!).

If you are planning to run with us, it would be useful, though not required, if you would send me an email.

Create Maps or search from 80 million at MapMyRun

“Secret” Recipe for Topic Modeling Themes

The recently (yesterday) published issue of JDH is all about topic modeling. It’s a great issue, and it got me thinking about some of the lessons I have learned over seven or eight years of modeling literary corpora. One of the important things I have learned is that the quality of the final model (which is to say the coherence and usefulness of the topics) is largely dependent upon preprocessing. I know, I know: “that’s not much fun.”

Fun or no, it is the reality, and there’s no getting around it. One of the first things you discover when you begin modeling literary materials is that books have a lot of characters. And here I don’t mean the letters “A, B, C,” but actual literary characters as in “Ahab, Beowulf, and Copperfield.” These characters can cause no end of headaches in topic modeling. Let me explain. . .

As I write this blog post, I am running a smallish topic modeling job over a corpus of 50 novels that I have selected for use in a topic modeling workshop I am teaching next week in Milwaukee. Without any preprocessing I get topics that look like these two:

A topic of words from Moby Dick

A topic of words from Moby Dick

A topic of words from Dracula

A topic of words from Dracula

There is nothing wrong with these topics except that one is obviously a “Moby Dick” topic and the other a “Dracula” topic. A big part of the reason these topics formed in this way is because of the power of the character names (yes, “whale” is a character). The presence of the character names tends to bias the model and make it collect collocates that cluster around character names. Instead of getting a topic having to do with “seafaring” (a theme, by the way, that appears in both Moby Dick and Dracula) we get these broad novel-specific topics instead.

That is not what we want.

To deal with this character “problem,” I begin by expanding the usual topic modeling “stop list” from the 100 or so high frequency, closed class words (such as “the, of, a, and. . .”) to include about 5,600 common names, or “named entities.” I posted this “expanded stoplist” to my blog some months ago as ancillary material for my book; feel free to copy it for your own work. I built my expanded stop list through a combination of named entity recognition and the scraping of baby name web sites:-)

Using the exact same model parameters that produced the two topics above, but now with the expanded stop list, I get topics that are much less about individual novels and much more about themes that cross novels. Here are two examples.

A topic of seafaring words

A topic of seafaring words

A topic of words relating to Native Americans

A topic of words relating to Native Americans, but mostly from Last of the Mohicans?

The first topic cloud seems pretty good. In the previous run of the model, without the expanded stop list, there was no such topic. The second one; however, is still problematic, largely because my expanded stopwords list, even at 5,631 words, is still imperfect. “Heyward” is a character from Last of the Mohicans whose name is not in my stop list.

But in addition to this imperfection, I would argue that there are other problems as well, at least if our objective is to harvest topics of a thematic nature. Notice, for example, the word “continued” just to the left of “heyward” and then notice “demanded” near the bottom of the cloud. These words do not contribute very much at all to the thematic sense of the topic, so ideally they too should be stopped out.

As a next step in preprocessing, therefore, I employ Part-of-Speech tagging or “POS-Tagging” in order to identify and ultimately “stop out” all of the words that are not nouns! Since I can already hear my friend Ted Underwood screaming about “discourses,” let me justify this suggestion with a small but important caveat: I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation.

POS tagging is well documented, so I’m not going to foreground it here other than to say that it’s an imperfect method. It does make mistakes, but the best taggers (such as the Stanford Tagger that I usually use) have very (+97%) accuracy (see, for example Manning 2011).

After running a POS tagger, I have a simple little script that uses a simple little regular expression to change the following tagged sentences:

The/DT family/NN of/IN Dashwood/NNP had/VBD been/VBN long/RB settled/VBN in/IN Sussex./NNP Their/PRP$ estate/NN was/VBD large,/RB and/CC their/PRP$ residence/NN was/VBD at/IN Norland/NNP Park,/NNP in/IN the/DT centre/NN of/IN their/PRP$ property,/NN where,/, for/IN many/JJ generations,/NNS they/PRP had/VBD lived/VBN in/IN so/RB respectable/JJ a/DT manner,/JJ as/IN to/TO engage/VB the/DT general/JJ good/JJ opinion/NN of/IN their/PRP$ surrounding/VBG acquaintance./NN


family estate residence centre property generations opinion acquaintance

Just with this transformation to nouns alone, you can begin to see how a theme of “property” or “family estates” might eventually evolve from these words during the topic modeling process. But there is still one more preprocessing step before we can run the LDA. The next step (which can really be the first step) is text chunking or segmentation.

Topic models like to have lots of texts; or more precisely they like to have lots of bags of words. Topic models such as LDA do not take into account word order, they assume that each text or document is a bag of words. Novels are very big bags, and if we don’t chunk them up into smaller pieces we end up getting topics of a very general nature. By chunking each novel into smaller pieces, we allow the model to discover themes that occur only in specific places within novels and not just across entire novels. Consider the theme of death, for example. While there may be entire novels about death, more than likely death is going to pop up once or twice in every novel. In order for the topic model to detect or find a death topic, however, it needs to encounter bags of words that are largely about death. If the whole novel is a single bag of words, then death might not be prominent enough to rise to the level of “topicdom.”

I have found through lots and lots of experimentation that 500-1000 word chunks are pretty darn good when modeling novels. It might help to think in terms of pages: 500-1000 words is roughly 2-4 pages. The argument for this length goes something like this: a good death scene takes several pages to develop. . . etc.

Exactly what chunk length I choose is a matter of some witchcraft and alchemy; it is similar to the witchcraft and tarot involved in choosing the number of topics. I’ll not unpack either of those here, but you can read more in chapter 8 of my book (plug). Here the point is to simply say that some chunking needs to happen if you are working with big documents.

So here, finally, is my “secret” recipe in pseudo code:

Of course, there is a lot more to it than this: you need to keep track of which chunks go with which novels and so on. But this is the general recipe.* Here are two topics derived from the same corpus of novels, now without character names and without non-nouns.

Art and Music

Art and Music

Crime and Justice

Crime and Justice

* The word “Secret” in my title is in quotes because there is nothing secret about the ingredients in this particular recipe. The idea of combining POS tagging, text chunking, and LDA is well established in various papers, including, for example, “TagLDA: Bringing document structure knowledge into topic models” (2006) and Reading Tea Leaves: How Humans Interpret Topic Models (2009).

“A Matter of Scale”

Back in November, Julia Flanders and I were invited to stage a debate on the matter of “scale” in digital humanities research for the “Boston Area Days of DH” conference keynote: Julia was to represent the micro scale and I the macro.

Julia and I met up during the MLA conference in January and began sketching out how the talk might go. The first thing we discovered, of course, is that we did not in fact have a real difference of opinion on this matter of scale. Big data, small data, close reading and distant . . . these things matter much less than what a scholar actually decides to do and say. In other words, we were both ultimately interested in new knowledge and not too much concerned with the level of scale necessary to derive that new knowledge.

In other words, it’s a false and probably irrelevant debate. And while we agreed on this point in general terms, we discovered in the course of composing and editing the script for our mock debate that there were legitimate nuances that deserved to be put into the light of day. The script form our “debate” and all of the slides are now available via UNL’s open access repository as “A Matter of Scale.”

Julia has posted a few comments on the experience of co-authoring this presentation with me on her blog. Check it out at http://juliaflanders.wordpress.com/2013/03/28/a-matter-of-scale/.

Pronouns in 19th Century Fiction

Some folks I follow on Twitter (@scott_bot, @benmschmidt, @rayncordell, @foxyfolklorist, and others) were engaged in a conversation this week about the frequency of gendered pronouns in a corpus of 233 fairy tales from @foxyfolklorist’s dissertation. For a bit of literary contextualization, I tweeted a bar graph showing the frequency of 13 pronouns in a corpus of ~3,500 19th century novels. The bar graph (seen again here) breaks down pronoun usage by author gender (M, F, and U).

Pronoun Use by Gender in 19th C. Fiction

It is natural to wonder, as David Mimno (@dmimno) did this morning, if there is any significance to the gender results: is gender really correlated to these observed means or are the observed means just an artifact of messy data. One way to explore the extent to which these observed means really are an entailment of gender is to ask what the means would look like if gender were not a factor. In other words, what would happen if all the data about author gender were shuffled and the means then recalculated?*

If we do this shuffling and recalculating a whole bunch of times, say 100 times, we can then plot all the fake “genderless” permutations along side the actual observed means and thereby see whether the observed means are outside or inside what we would expect if gender were not a factor influencing pronoun use.**

Below are the plots for the 13 pronouns from my original bar graph (above). What you’ll see below is that for certain pronouns, such as “him,” “I,” “me,” “my” and “your”, the observed (“real”) means are within the range of “expected” values if gender were not a consideration. For other pronouns, however, such as “he,” “her,” “she” and “we,” the observed values are outside the values in the randomized “fake” data generated by taking gender out of the equation.

Another fascinating element of these graphs is found in the third “U” column. These are authors of unknown gender. It is hard not too look at these observed values and wonder about the most likely genders of those anonymous writers. . .














* [As it happens, this is precisely the approach that David Mimno suggested we take in some other work (under review) in which we assess the significance of topic use (rather than pronoun use) by male and female authors.]

** [Naturally, it could be that the determining factor here is not really gender at all. It could be that “we” (readers, editors, publishers, etc) have selected for books authored by men that express one set of linguistic qualities and books by women that express another set. In other words, these graphs don’t prove that women and men necessarily use pronouns differently, only that they do so (or don’t depending on the pronoun in question) in this particular corpus of 19th century fiction.]

Unfolding the Novel

I’m excited to announce a new research project dubbed “Unfolding the Novel” (which is a play on both “paper” and “protein” folding). In collaboration with colleagues from the Stanford Literary Lab and Arizona State University and in partnership with researchers of the Book Genome project of BookLamp.com we have begun work that traces stylistic and thematic change across 300 years of fiction, from 1700-2000! Today UNL posted a news release announcing the partnership and some of our goals.

The primary goal of the project is to map major stylistic and thematic trends over 300 years of creative literature. To facilitate this work, BookLamp is providing access to a large store of metadata pertaining to mostly 20th and 21st century works of fiction. This data will be combined with similar data we have already compiled from the 19th century and new data we are curating now from the 18th century. The research team will not access the actual books but will explore at the macroscale in ways that are similar to what one can do with the data provided to researchers at the Google Ngrams project. A major difference, however, is that the data in the “Unfolding” project is highly curated, limited to fiction in English, and enriched with additional metadata including information about both gender and genre distribution.

Our initial data set consists of token frequency information that has been aggregated across one or more global metadata facets including but not limited to publication year, author gender, and book genre. Such data includes, for example, a table containing the year-­to-­year mean relative frequencies of the most common words in the corpus (e.g the relative frequencies of the words “the, a, an, of, and” etc).

I’ll be reporting on the project here as things progress, but for now, it’s back to the drudgery of the text mines. . . 😉