Characterization in Literature and the Macroanalysis Lab

I have just posted the syllabus for my spring macroanalysis class focusing on Characterization in Literature. The class is experimental in many senses of the word. We will be experimenting in the class and the class will be an experiment. If all goes according to plan, the only thing about this class that will be different from a research lab is the grade I have to assign at the end—that is the one remaining bit about collaborative learning that still kicks me . . .

To be successful everyone is going to have to be high-performing and self-motivated, me included. For me, at least, the motivation comes from what I think is a really tough nut to crack: algorithmic detection and analysis of character and character types. So far the work in this area has been largely about character networks: how is Hamlet related to Gertrude, etc. That’s good work, but it depends heavily upon the human coding of character metadata before processing. That is precisely why our early experiments at the Stanford Literary Lab focused on Drama. . . the character names are already explicit in the speaker markup. Beyond drama, there have been some important steps taken in the direction of auto-detection of character in fiction, such as those by Graham Sack and Elson et. al, but I think we still have a lot more stepping to do, a whole lot more.

The work I envision for the course will include leveraging obvious tools such as those for named entity recognition and then thinking through and dealing with the more complicated problems of pronoun disambiguation. But my deeper interest here goes far beyond simple detection of entities. The holy grail that I see here lies not in detecting the presence or absence of individual characters but in detecting and tracking character archetypes on a grand macroscale. What if we could begin to answer questions such as these:

  • Are there different classes of villains in the 19th century novel?
  • Do we see a rise in the number of minor characters over the 20th century?
  • What are the qualities that define heroines?
  • How, if at all, do those qualities change/evolve over time? (think Jane Austen’s Emma vs. Stieg Larsson’s Lisbeth).
  • Etc.

We may get nowhere; we may fail miserably. (Of course if I did not already have a couple of pretty good ideas for how to get at these questions I would not be bothering. . . but that, for now, is the secret sauce ;-) )

At the more practical, “skills” level, I’m requiring students to learn and submit all their work using LaTeX! (This may prove to be controversial or crazy–I only learned LaTeX six months ago.) For that they will also be learning how to use the knitr package for R in order to embed R code directly into the LaTeX, and all of this work will take place inside the (awesome) R IDE, RStudio. Hold on to your hats; it’s going to be a wild ride!

A Festivus Miracle: Some R Bingo code

A few weeks ago my daughter’s class was gearing up to celebrate the Thanksgiving Holiday, and I was asked to help prepare some “holiday bingo cards” for the kid’s party. Naturally, I wrote a program in R for the job! (I know, I know, Maslow’s hammer)

Since I learned a few R tricks for making a grid and placing text, and since today is the first day of the Hour of Code, I’ve decided to release the code;-)

The 13 lines (excluding comments) of code below will produce 25 random Festivus Bingo Cards and write them out to a single pdf file for easy printing. You supply the Festivus Nog.

Text Analysis with R for Students of Literature

[Update (9/3/13 8:15 CST): Contributors list now active at the main Text Analysis with R for Students of Literature Resource Page]

Below this post you will find a link where you can download a draft of Text Analysis with R for Students of Literature. The book is under review with Springer as part of a new series titled “Quantitative Methods in the Humanities and Social Sciences.”

Springer agreed to let me post the draft manuscript here (thank you, Springer), and my hope is that you will download the manuscript, take it for a test drive, and then send me your thoughts. I’m especially interested to hear about areas of confusion, places where you get lost, or where you feel your students might get lost. I’m also interested in hearing about the errors (hopefully not too many), and, naturally, I’ll be delighted to hear about anything you like.

I’m open to suggestions for new sections, but before you suggest that I include another chapter on “your favorite topic,” please read the Preface where I lay out the scope of the book. It’s a beginner’s book, and helping “literary folk” get started with R is my primary goal. This is not the place to get into debates or details about hyper parameter optimization or the relative merits of p-values.*

Please also read the Acknowledgements. It is there that I hint at the spirit and intent behind the book and behind this call for feedback. I did not learn R without help, and there is still a lot about R that I have to learn. I want to acknowledge both of these facts directly and specifically. Those who offer feedback will be added to a list of contributors to be included in the print and online editions of the final text. Feedback of a substantial nature will be acknowledged directly and specifically.

Book is now in production and draft has been removed.That’s it. Download Text Analysis with R for Students of Literature (1.3MB .pdf)

* Besides, that ground has been well-covered by Scott Weingart

Obi Wan McCarty

[Below is the text of my introduction of Willard McCarty, winner of the 2013 Busa Award.]

As the chair of the awards committee that selected Prof. McCarty for this award it is my pleasure to offer a few words of introduction.

I’m going to go out on a limb this afternoon and assume that you already know that Willard McCarty is Professor of Humanities Computing and Director of the Doctoral Program in the Department of Digital Humanities at King’s College London, and that he is Professor in the Digital Humanities Research Group, University of Western Sydney and that he is a Fellow of the Royal Anthropological Institute (London). I’ll assume that you already know that he is Editor of the British journal, Interdisciplinary Science Reviews and that he’s founding Editor of the online seminar Humanist. And I am sure you know that Willard is recipient of the Canadian Award for Outstanding Achievement in Computing in the Arts and Humanities, and of the prestigious Richard W. Lyman Award of the National Humanities Center. You have probably already read his 2005 book titled Humanities Computing, and you know of his many, many other writings and musing.

So I’m not going to talk about any of that stuff.

And since I’m sure that everyone here knows that the Roberto Busa Award was established in 1998. I’m not going to explain how the Busa award was set up to recognize outstanding lifetime achievement in the application of information and communications technologies to humanities research.

No I’m not going to say anything about that either.

Instead, I wish to say a few words about this fellow here. Screen Shot 2013-07-19 at 7.23.12 AM

This is Obi-Wan McCarty. Long before I met him in person, he had become a virtual friend, model, and mentor.

I began computing in the humanities in 1993, and like so many of us in those early days I was a young maverick with little or no idea what had been done before. Those were the days before the rebellion, when the dark forces of the Empire were still quite strong. It was a time when an English major with a laptop was considered a dangerous rebel. At times I was scared, and I felt alone in a dark side of a galaxy far, far, away.

And then somewhere between 1993 and 2001 I began to sense a force in the galaxy.

One day, in early 2001, I was walking with my friend Glen Worthey, and I mentioned how I had recently discovered the Humanist list and how there had been this message posted by Willard McCarty with the cryptic subject line “14.”

“Ah yes,” Glen said, “Obi-Wan McCarty. The force is strong with him.”

Message 14 from Obi-Wan was a birthday message. Humanist was 14 that day and Willard began his message with a reflection on “repetition” and how frequently newcomers to the list would ask questions that had already been asked. Rather than chastise those newbies, and tell them to go STFA (search the freakin’ archive), Willard encouraged them. He wrote in that message of how “repetition is a means of maintaining group memory.” I was encouraged by those words and by Willard’s ongoing and relentless commitment not simply to deep, thoughtful, and challenging scholarship, but to nurturing, teaching, welcoming, and mentoring each new generation.

So Willard, thank you for your personal mentorship, thank you for continuing to demonstrate that scholarly excellence and generosity are kindred spirits. Congratulations on this award. May the force be with you.

25 days until the 2013 DH Fun Run

Below is the route/elevation for the July 18, 2013 Unofficial (as in run at your own risk this has nothing to do with the conference) DH 2013 Fun Run. The route begins and ends on the north side of the UNL Student Union (fountain area). From campus we will go a few blocks east to the Billy Wolff Trail. This is a paved bike/walk trail that runs southeast along Antelope Creek. We’ll run about 2.25 miles on Billy Wolff and then loop around the Lincoln Children’s Zoo briefly traveling the Rock Island trail before getting back on to Billy Wolff.

Please arrive at the start at 6:00AM; we will depart at exactly at 6:15AM. The run is 4.75 Miles (that’s 7.65 Kilometers). I’ll plan on setting a roughly ~9 minute mile pace making the run about 45 minutes. We should be back on campus at 7AM. Conference sessions begin at 8:30.

Morning temperatures in July are likely to be in mid 70s (that’s ~24 degrees centigrade) and it will be humid. A water bottle is recommended (and you’ll find one in your conference swag bag!).

If you are planning to run with us, it would be useful, though not required, if you would send me an email.

Create Maps or search from 80 million at MapMyRun

“Secret” Recipe for Topic Modeling Themes

The recently (yesterday) published issue of JDH is all about topic modeling. It’s a great issue, and it got me thinking about some of the lessons I have learned over seven or eight years of modeling literary corpora. One of the important things I have learned is that the quality of the final model (which is to say the coherence and usefulness of the topics) is largely dependent upon preprocessing. I know, I know: “that’s not much fun.”

Fun or no, it is the reality, and there’s no getting around it. One of the first things you discover when you begin modeling literary materials is that books have a lot of characters. And here I don’t mean the letters “A, B, C,” but actual literary characters as in “Ahab, Beowulf, and Copperfield.” These characters can cause no end of headaches in topic modeling. Let me explain. . .

As I write this blog post, I am running a smallish topic modeling job over a corpus of 50 novels that I have selected for use in a topic modeling workshop I am teaching next week in Milwaukee. Without any preprocessing I get topics that look like these two:

A topic of words from Moby Dick

A topic of words from Moby Dick

A topic of words from Dracula

A topic of words from Dracula

There is nothing wrong with these topics except that one is obviously a “Moby Dick” topic and the other a “Dracula” topic. A big part of the reason these topics formed in this way is because of the power of the character names (yes, “whale” is a character). The presence of the character names tends to bias the model and make it collect collocates that cluster around character names. Instead of getting a topic having to do with “seafaring” (a theme, by the way, that appears in both Moby Dick and Dracula) we get these broad novel-specific topics instead.

That is not what we want.

To deal with this character “problem,” I begin by expanding the usual topic modeling “stop list” from the 100 or so high frequency, closed class words (such as “the, of, a, and. . .”) to include about 5,600 common names, or “named entities.” I posted this “expanded stoplist” to my blog some months ago as ancillary material for my book; feel free to copy it for your own work. I built my expanded stop list through a combination of named entity recognition and the scraping of baby name web sites:-)

Using the exact same model parameters that produced the two topics above, but now with the expanded stop list, I get topics that are much less about individual novels and much more about themes that cross novels. Here are two examples.

A topic of seafaring words

A topic of seafaring words

A topic of words relating to Native Americans

A topic of words relating to Native Americans, but mostly from Last of the Mohicans?

The first topic cloud seems pretty good. In the previous run of the model, without the expanded stop list, there was no such topic. The second one; however, is still problematic, largely because my expanded stopwords list, even at 5,631 words, is still imperfect. “Heyward” is a character from Last of the Mohicans whose name is not in my stop list.

But in addition to this imperfection, I would argue that there are other problems as well, at least if our objective is to harvest topics of a thematic nature. Notice, for example, the word “continued” just to the left of “heyward” and then notice “demanded” near the bottom of the cloud. These words do not contribute very much at all to the thematic sense of the topic, so ideally they too should be stopped out.

As a next step in preprocessing, therefore, I employ Part-of-Speech tagging or “POS-Tagging” in order to identify and ultimately “stop out” all of the words that are not nouns! Since I can already hear my friend Ted Underwood screaming about “discourses,” let me justify this suggestion with a small but important caveat: I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation.

POS tagging is well documented, so I’m not going to foreground it here other than to say that it’s an imperfect method. It does make mistakes, but the best taggers (such as the Stanford Tagger that I usually use) have very (+97%) accuracy (see, for example Manning 2011).

After running a POS tagger, I have a simple little script that uses a simple little regular expression to change the following tagged sentences:

The/DT family/NN of/IN Dashwood/NNP had/VBD been/VBN long/RB settled/VBN in/IN Sussex./NNP Their/PRP$ estate/NN was/VBD large,/RB and/CC their/PRP$ residence/NN was/VBD at/IN Norland/NNP Park,/NNP in/IN the/DT centre/NN of/IN their/PRP$ property,/NN where,/, for/IN many/JJ generations,/NNS they/PRP had/VBD lived/VBN in/IN so/RB respectable/JJ a/DT manner,/JJ as/IN to/TO engage/VB the/DT general/JJ good/JJ opinion/NN of/IN their/PRP$ surrounding/VBG acquaintance./NN

into

family estate residence centre property generations opinion acquaintance

Just with this transformation to nouns alone, you can begin to see how a theme of “property” or “family estates” might eventually evolve from these words during the topic modeling process. But there is still one more preprocessing step before we can run the LDA. The next step (which can really be the first step) is text chunking or segmentation.

Topic models like to have lots of texts; or more precisely they like to have lots of bags of words. Topic models such as LDA do not take into account word order, they assume that each text or document is a bag of words. Novels are very big bags, and if we don’t chunk them up into smaller pieces we end up getting topics of a very general nature. By chunking each novel into smaller pieces, we allow the model to discover themes that occur only in specific places within novels and not just across entire novels. Consider the theme of death, for example. While there may be entire novels about death, more than likely death is going to pop up once or twice in every novel. In order for the topic model to detect or find a death topic, however, it needs to encounter bags of words that are largely about death. If the whole novel is a single bag of words, then death might not be prominent enough to rise to the level of “topicdom.”

I have found through lots and lots of experimentation that 500-1000 word chunks are pretty darn good when modeling novels. It might help to think in terms of pages: 500-1000 words is roughly 2-4 pages. The argument for this length goes something like this: a good death scene takes several pages to develop. . . etc.

Exactly what chunk length I choose is a matter of some witchcraft and alchemy; it is similar to the witchcraft and tarot involved in choosing the number of topics. I’ll not unpack either of those here, but you can read more in chapter 8 of my book (plug). Here the point is to simply say that some chunking needs to happen if you are working with big documents.

So here, finally, is my “secret” recipe in pseudo code:

Of course, there is a lot more to it than this: you need to keep track of which chunks go with which novels and so on. But this is the general recipe.* Here are two topics derived from the same corpus of novels, now without character names and without non-nouns.

Art and Music

Art and Music

Crime and Justice

Crime and Justice

* The word “Secret” in my title is in quotes because there is nothing secret about the ingredients in this particular recipe. The idea of combining POS tagging, text chunking, and LDA is well established in various papers, including, for example, “TagLDA: Bringing document structure knowledge into topic models” (2006) and Reading Tea Leaves: How Humans Interpret Topic Models (2009).

“A Matter of Scale”

Back in November, Julia Flanders and I were invited to stage a debate on the matter of “scale” in digital humanities research for the “Boston Area Days of DH” conference keynote: Julia was to represent the micro scale and I the macro.

Julia and I met up during the MLA conference in January and began sketching out how the talk might go. The first thing we discovered, of course, is that we did not in fact have a real difference of opinion on this matter of scale. Big data, small data, close reading and distant . . . these things matter much less than what a scholar actually decides to do and say. In other words, we were both ultimately interested in new knowledge and not too much concerned with the level of scale necessary to derive that new knowledge.

In other words, it’s a false and probably irrelevant debate. And while we agreed on this point in general terms, we discovered in the course of composing and editing the script for our mock debate that there were legitimate nuances that deserved to be put into the light of day. The script form our “debate” and all of the slides are now available via UNL’s open access repository as “A Matter of Scale.”

Julia has posted a few comments on the experience of co-authoring this presentation with me on her blog. Check it out at http://juliaflanders.wordpress.com/2013/03/28/a-matter-of-scale/.

Pronouns in 19th Century Fiction

Some folks I follow on Twitter (@scott_bot, @benmschmidt, @rayncordell, @foxyfolklorist, and others) were engaged in a conversation this week about the frequency of gendered pronouns in a corpus of 233 fairy tales from @foxyfolklorist’s dissertation. For a bit of literary contextualization, I tweeted a bar graph showing the frequency of 13 pronouns in a corpus of ~3,500 19th century novels. The bar graph (seen again here) breaks down pronoun usage by author gender (M, F, and U).

Pronoun Use by Gender in 19th C. Fiction

It is natural to wonder, as David Mimno (@dmimno) did this morning, if there is any significance to the gender results: is gender really correlated to these observed means or are the observed means just an artifact of messy data. One way to explore the extent to which these observed means really are an entailment of gender is to ask what the means would look like if gender were not a factor. In other words, what would happen if all the data about author gender were shuffled and the means then recalculated?*

If we do this shuffling and recalculating a whole bunch of times, say 100 times, we can then plot all the fake “genderless” permutations along side the actual observed means and thereby see whether the observed means are outside or inside what we would expect if gender were not a factor influencing pronoun use.**

Below are the plots for the 13 pronouns from my original bar graph (above). What you’ll see below is that for certain pronouns, such as “him,” “I,” “me,” “my” and “your”, the observed (“real”) means are within the range of “expected” values if gender were not a consideration. For other pronouns, however, such as “he,” “her,” “she” and “we,” the observed values are outside the values in the randomized “fake” data generated by taking gender out of the equation.

Another fascinating element of these graphs is found in the third “U” column. These are authors of unknown gender. It is hard not too look at these observed values and wonder about the most likely genders of those anonymous writers. . .

he_pronoun0

she_pronoun1

him_pronoun2

her_pronoun3

i_pronoun4

me_pronoun5

my_pronoun6

you_pronoun10

your_pronoun11

we_pronoun7

it_pronoun12

mrs_pronoun9

mr_pronoun8

* [As it happens, this is precisely the approach that David Mimno suggested we take in some other work (under review) in which we assess the significance of topic use (rather than pronoun use) by male and female authors.]

** [Naturally, it could be that the determining factor here is not really gender at all. It could be that “we” (readers, editors, publishers, etc) have selected for books authored by men that express one set of linguistic qualities and books by women that express another set. In other words, these graphs don’t prove that women and men necessarily use pronouns differently, only that they do so (or don’t depending on the pronoun in question) in this particular corpus of 19th century fiction.]

Unfolding the Novel

I’m excited to announce a new research project dubbed “Unfolding the Novel” (which is a play on both “paper” and “protein” folding). In collaboration with colleagues from the Stanford Literary Lab and Arizona State University and in partnership with researchers of the Book Genome project of BookLamp.com we have begun work that traces stylistic and thematic change across 300 years of fiction, from 1700-2000! Today UNL posted a news release announcing the partnership and some of our goals.

The primary goal of the project is to map major stylistic and thematic trends over 300 years of creative literature. To facilitate this work, BookLamp is providing access to a large store of metadata pertaining to mostly 20th and 21st century works of fiction. This data will be combined with similar data we have already compiled from the 19th century and new data we are curating now from the 18th century. The research team will not access the actual books but will explore at the macroscale in ways that are similar to what one can do with the data provided to researchers at the Google Ngrams project. A major difference, however, is that the data in the “Unfolding” project is highly curated, limited to fiction in English, and enriched with additional metadata including information about both gender and genre distribution.

Our initial data set consists of token frequency information that has been aggregated across one or more global metadata facets including but not limited to publication year, author gender, and book genre. Such data includes, for example, a table containing the year-­to-­year mean relative frequencies of the most common words in the corpus (e.g the relative frequencies of the words “the, a, an, of, and” etc).

I’ll be reporting on the project here as things progress, but for now, it’s back to the drudgery of the text mines. . . ;-)

Thoughts on a Literary Lab

[For the “Theories and Practices of the Literary Lab” roundtable at MLA yesterday, panelists were asked to speak for 5 minutes about their vision of a literary lab. Here are my remarks from that session–#147]

I take the descriptor “literary lab” literally, and to help explain my vision of a literary lab I want to describe how the Stanford Literary Lab that I founded with Franco Moretti came into being.

The Stanford Lab was born out of a class that I taught in the fall of 2009. In that course I assigned 1200 novels and challenged students to explore ways of reading, interpreting, and understanding literature at the macro-scale, as an aggregate system. Writing about the course and the lab that evolved from the course, Chronicle of Higher Ed reporter Marc Parry described it as being based on: “a controversial vision for changing a field still steeped in individual readers’ careful analyses of texts.” That may be how it looks from the outside, but there was no radical agenda then and no radical agenda today.

In the class, I asked the students to form into two research teams and to construct research projects around this corpus of 1200 novels. One group chose to investigate whether novel serialization in the 19th century had a detectable/measurable effect upon novelistic style. The other group pursued a project dealing with lexical change over the century, and they wrote a program called “the correlator” that was used to observe and measure semantic change.

After the class ended, two students, one from each group asked to continue their work as independent study; I agreed. Over the Christmas holiday, word spread to the other students from the seminar and by the New Year 13 of the original 14 in the seminar wanted to keep working. Instead of 13 independent studies, we formed an ad-hoc seminar group, and I found an empty office on the 4th floor where we began meeting, sometimes for several hours a day. We began calling this ugly, windowless room, the lab.

Several of the students in my fall class were also in a class with Franco Moretti and the crossover in terms of subject matter and methodology was fairly obvious. As the research deepened and became more nuanced, Franco began joining us for lab sessions and over the next few months other faculty and grad students were sucked into this evolving vortext. It was a very exciting time.

At some point, Franco and I (and perhaps a few of the students) began having conversations about formalizing this notion of a literary lab. I think at the time our motivation had more to do with the need to lobby for space and resources than anything else. As the projects grew and gained more steam, the room got smaller and smaller.

I mention all of this because I do not believe in the “if we build it they will come” notion of digital humanities labs. While it is true that they may come if we build them; it is also true, and I have seen this first hand, that they may come with absolutely no idea of what to do.

First and foremost a lab needs a real and specific research agenda. “Enabling Digital Humanities projects” is not a research agenda for a lab. Advancing or enabling digital humanities oriented research is an appropriate mission for a Center, such as our Center for Digital Humanities Research at Nebraska, but it is not the function of a lab, at least not in the limited literal sense that I imagine it. For me, a lab is not specifically an idea generator; a lab is a place in which ideas move from birth to maturation.

It would be incredible hyperbole to say that we formally articulated any of this in advance. Our lab was the opposite of premeditated. We did, however, have a loosely expressed set of core principles. We agreed that:

1. Our work would be narrowly focused on literary research of a quantitative nature.
2. All research would be collaborative, even when the outcome ends up having a single author.
3. All research would take the form of “experiments,” and we would be open to the possibilities of failure; indeed, we would see failure as new knowledge.
4. The lab would be open to students and faculty at all levels–and, on a more ad hoc basis, to students and faculty from other institutions.
5. In internal and external presentation and publication, we would favor the narrative genre of “lab reports” and attempt to show not only where we arrived, but how we got there.

I continue to believe that these were and are the right principles for a lab even while they conflict with much about the way Universities are organized.

In our lab we discovered that to focus, to really focus on the work, we had to resist and even reject some of the established standards of pedagogy, of academic hierarchy, and of publishing convention. We discovered that we needed to remove instructional barriers both internal and external in order to find and attract the right people and the right expertise. We did not do any of this in order to make a statement. We were not academic radicals bent on defying the establishment.

Nor should I leave you with the impression that we figured anything out. The lab remains an organic entity unified by what some might characterize as a monomaniacal focus on literary research. If there was any genius to what we did, it was in the decision to never compromise our focus, to do whatever was necessary to keep our focus on the literature.