25 days until the 2013 DH Fun Run

Below is the route/elevation for the July 18, 2013 Unofficial (as in run at your own risk this has nothing to do with the conference) DH 2013 Fun Run. The route begins and ends on the north side of the UNL Student Union (fountain area). From campus we will go a few blocks east to the Billy Wolff Trail. This is a paved bike/walk trail that runs southeast along Antelope Creek. We’ll run about 2.25 miles on Billy Wolff and then loop around the Lincoln Children’s Zoo briefly traveling the Rock Island trail before getting back on to Billy Wolff.

Please arrive at the start at 6:00AM; we will depart at exactly at 6:15AM. The run is 4.75 Miles (that’s 7.65 Kilometers). I’ll plan on setting a roughly ~9 minute mile pace making the run about 45 minutes. We should be back on campus at 7AM. Conference sessions begin at 8:30.

Morning temperatures in July are likely to be in mid 70s (that’s ~24 degrees centigrade) and it will be humid. A water bottle is recommended (and you’ll find one in your conference swag bag!).

If you are planning to run with us, it would be useful, though not required, if you would send me an email.

Create Maps or search from 80 million at MapMyRun

“Secret” Recipe for Topic Modeling Themes

The recently (yesterday) published issue of JDH is all about topic modeling. It’s a great issue, and it got me thinking about some of the lessons I have learned over seven or eight years of modeling literary corpora. One of the important things I have learned is that the quality of the final model (which is to say the coherence and usefulness of the topics) is largely dependent upon preprocessing. I know, I know: “that’s not much fun.”

Fun or no, it is the reality, and there’s no getting around it. One of the first things you discover when you begin modeling literary materials is that books have a lot of characters. And here I don’t mean the letters “A, B, C,” but actual literary characters as in “Ahab, Beowulf, and Copperfield.” These characters can cause no end of headaches in topic modeling. Let me explain. . .

As I write this blog post, I am running a smallish topic modeling job over a corpus of 50 novels that I have selected for use in a topic modeling workshop I am teaching next week in Milwaukee. Without any preprocessing I get topics that look like these two:

A topic of words from Moby Dick

A topic of words from Moby Dick

A topic of words from Dracula

A topic of words from Dracula

There is nothing wrong with these topics except that one is obviously a “Moby Dick” topic and the other a “Dracula” topic. A big part of the reason these topics formed in this way is because of the power of the character names (yes, “whale” is a character). The presence of the character names tends to bias the model and make it collect collocates that cluster around character names. Instead of getting a topic having to do with “seafaring” (a theme, by the way, that appears in both Moby Dick and Dracula) we get these broad novel-specific topics instead.

That is not what we want.

To deal with this character “problem,” I begin by expanding the usual topic modeling “stop list” from the 100 or so high frequency, closed class words (such as “the, of, a, and. . .”) to include about 5,600 common names, or “named entities.” I posted this “expanded stoplist” to my blog some months ago as ancillary material for my book; feel free to copy it for your own work. I built my expanded stop list through a combination of named entity recognition and the scraping of baby name web sites:-)

Using the exact same model parameters that produced the two topics above, but now with the expanded stop list, I get topics that are much less about individual novels and much more about themes that cross novels. Here are two examples.

A topic of seafaring words

A topic of seafaring words

A topic of words relating to Native Americans

A topic of words relating to Native Americans, but mostly from Last of the Mohicans?

The first topic cloud seems pretty good. In the previous run of the model, without the expanded stop list, there was no such topic. The second one; however, is still problematic, largely because my expanded stopwords list, even at 5,631 words, is still imperfect. “Heyward” is a character from Last of the Mohicans whose name is not in my stop list.

But in addition to this imperfection, I would argue that there are other problems as well, at least if our objective is to harvest topics of a thematic nature. Notice, for example, the word “continued” just to the left of “heyward” and then notice “demanded” near the bottom of the cloud. These words do not contribute very much at all to the thematic sense of the topic, so ideally they too should be stopped out.

As a next step in preprocessing, therefore, I employ Part-of-Speech tagging or “POS-Tagging” in order to identify and ultimately “stop out” all of the words that are not nouns! Since I can already hear my friend Ted Underwood screaming about “discourses,” let me justify this suggestion with a small but important caveat: I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation.

POS tagging is well documented, so I’m not going to foreground it here other than to say that it’s an imperfect method. It does make mistakes, but the best taggers (such as the Stanford Tagger that I usually use) have very (+97%) accuracy (see, for example Manning 2011).

After running a POS tagger, I have a simple little script that uses a simple little regular expression to change the following tagged sentences:

The/DT family/NN of/IN Dashwood/NNP had/VBD been/VBN long/RB settled/VBN in/IN Sussex./NNP Their/PRP$ estate/NN was/VBD large,/RB and/CC their/PRP$ residence/NN was/VBD at/IN Norland/NNP Park,/NNP in/IN the/DT centre/NN of/IN their/PRP$ property,/NN where,/, for/IN many/JJ generations,/NNS they/PRP had/VBD lived/VBN in/IN so/RB respectable/JJ a/DT manner,/JJ as/IN to/TO engage/VB the/DT general/JJ good/JJ opinion/NN of/IN their/PRP$ surrounding/VBG acquaintance./NN

into

family estate residence centre property generations opinion acquaintance

Just with this transformation to nouns alone, you can begin to see how a theme of “property” or “family estates” might eventually evolve from these words during the topic modeling process. But there is still one more preprocessing step before we can run the LDA. The next step (which can really be the first step) is text chunking or segmentation.

Topic models like to have lots of texts; or more precisely they like to have lots of bags of words. Topic models such as LDA do not take into account word order, they assume that each text or document is a bag of words. Novels are very big bags, and if we don’t chunk them up into smaller pieces we end up getting topics of a very general nature. By chunking each novel into smaller pieces, we allow the model to discover themes that occur only in specific places within novels and not just across entire novels. Consider the theme of death, for example. While there may be entire novels about death, more than likely death is going to pop up once or twice in every novel. In order for the topic model to detect or find a death topic, however, it needs to encounter bags of words that are largely about death. If the whole novel is a single bag of words, then death might not be prominent enough to rise to the level of “topicdom.”

I have found through lots and lots of experimentation that 500-1000 word chunks are pretty darn good when modeling novels. It might help to think in terms of pages: 500-1000 words is roughly 2-4 pages. The argument for this length goes something like this: a good death scene takes several pages to develop. . . etc.

Exactly what chunk length I choose is a matter of some witchcraft and alchemy; it is similar to the witchcraft and tarot involved in choosing the number of topics. I’ll not unpack either of those here, but you can read more in chapter 8 of my book (plug). Here the point is to simply say that some chunking needs to happen if you are working with big documents.

So here, finally, is my “secret” recipe in pseudo code:

Of course, there is a lot more to it than this: you need to keep track of which chunks go with which novels and so on. But this is the general recipe.* Here are two topics derived from the same corpus of novels, now without character names and without non-nouns.

Art and Music

Art and Music

Crime and Justice

Crime and Justice

* The word “Secret” in my title is in quotes because there is nothing secret about the ingredients in this particular recipe. The idea of combining POS tagging, text chunking, and LDA is well established in various papers, including, for example, “TagLDA: Bringing document structure knowledge into topic models” (2006) and Reading Tea Leaves: How Humans Interpret Topic Models (2009).

“A Matter of Scale”

Back in November, Julia Flanders and I were invited to stage a debate on the matter of “scale” in digital humanities research for the “Boston Area Days of DH” conference keynote: Julia was to represent the micro scale and I the macro.

Julia and I met up during the MLA conference in January and began sketching out how the talk might go. The first thing we discovered, of course, is that we did not in fact have a real difference of opinion on this matter of scale. Big data, small data, close reading and distant . . . these things matter much less than what a scholar actually decides to do and say. In other words, we were both ultimately interested in new knowledge and not too much concerned with the level of scale necessary to derive that new knowledge.

In other words, it’s a false and probably irrelevant debate. And while we agreed on this point in general terms, we discovered in the course of composing and editing the script for our mock debate that there were legitimate nuances that deserved to be put into the light of day. The script form our “debate” and all of the slides are now available via UNL’s open access repository as “A Matter of Scale.”

Julia has posted a few comments on the experience of co-authoring this presentation with me on her blog. Check it out at http://juliaflanders.wordpress.com/2013/03/28/a-matter-of-scale/.

Pronouns in 19th Century Fiction

Some folks I follow on Twitter (@scott_bot, @benmschmidt, @rayncordell, @foxyfolklorist, and others) were engaged in a conversation this week about the frequency of gendered pronouns in a corpus of 233 fairy tales from @foxyfolklorist’s dissertation. For a bit of literary contextualization, I tweeted a bar graph showing the frequency of 13 pronouns in a corpus of ~3,500 19th century novels. The bar graph (seen again here) breaks down pronoun usage by author gender (M, F, and U).

Pronoun Use by Gender in 19th C. Fiction

It is natural to wonder, as David Mimno (@dmimno) did this morning, if there is any significance to the gender results: is gender really correlated to these observed means or are the observed means just an artifact of messy data. One way to explore the extent to which these observed means really are an entailment of gender is to ask what the means would look like if gender were not a factor. In other words, what would happen if all the data about author gender were shuffled and the means then recalculated?*

If we do this shuffling and recalculating a whole bunch of times, say 100 times, we can then plot all the fake “genderless” permutations along side the actual observed means and thereby see whether the observed means are outside or inside what we would expect if gender were not a factor influencing pronoun use.**

Below are the plots for the 13 pronouns from my original bar graph (above). What you’ll see below is that for certain pronouns, such as “him,” “I,” “me,” “my” and “your”, the observed (“real”) means are within the range of “expected” values if gender were not a consideration. For other pronouns, however, such as “he,” “her,” “she” and “we,” the observed values are outside the values in the randomized “fake” data generated by taking gender out of the equation.

Another fascinating element of these graphs is found in the third “U” column. These are authors of unknown gender. It is hard not too look at these observed values and wonder about the most likely genders of those anonymous writers. . .

he_pronoun0

she_pronoun1

him_pronoun2

her_pronoun3

i_pronoun4

me_pronoun5

my_pronoun6

you_pronoun10

your_pronoun11

we_pronoun7

it_pronoun12

mrs_pronoun9

mr_pronoun8

* [As it happens, this is precisely the approach that David Mimno suggested we take in some other work (under review) in which we assess the significance of topic use (rather than pronoun use) by male and female authors.]

** [Naturally, it could be that the determining factor here is not really gender at all. It could be that “we” (readers, editors, publishers, etc) have selected for books authored by men that express one set of linguistic qualities and books by women that express another set. In other words, these graphs don’t prove that women and men necessarily use pronouns differently, only that they do so (or don’t depending on the pronoun in question) in this particular corpus of 19th century fiction.]

Unfolding the Novel

I’m excited to announce a new research project dubbed “Unfolding the Novel” (which is a play on both “paper” and “protein” folding). In collaboration with colleagues from the Stanford Literary Lab and Arizona State University and in partnership with researchers of the Book Genome project of BookLamp.com we have begun work that traces stylistic and thematic change across 300 years of fiction, from 1700-2000! Today UNL posted a news release announcing the partnership and some of our goals.

The primary goal of the project is to map major stylistic and thematic trends over 300 years of creative literature. To facilitate this work, BookLamp is providing access to a large store of metadata pertaining to mostly 20th and 21st century works of fiction. This data will be combined with similar data we have already compiled from the 19th century and new data we are curating now from the 18th century. The research team will not access the actual books but will explore at the macroscale in ways that are similar to what one can do with the data provided to researchers at the Google Ngrams project. A major difference, however, is that the data in the “Unfolding” project is highly curated, limited to fiction in English, and enriched with additional metadata including information about both gender and genre distribution.

Our initial data set consists of token frequency information that has been aggregated across one or more global metadata facets including but not limited to publication year, author gender, and book genre. Such data includes, for example, a table containing the year-­to-­year mean relative frequencies of the most common words in the corpus (e.g the relative frequencies of the words “the, a, an, of, and” etc).

I’ll be reporting on the project here as things progress, but for now, it’s back to the drudgery of the text mines. . . ;-)

Thoughts on a Literary Lab

[For the “Theories and Practices of the Literary Lab” roundtable at MLA yesterday, panelists were asked to speak for 5 minutes about their vision of a literary lab. Here are my remarks from that session–#147]

I take the descriptor “literary lab” literally, and to help explain my vision of a literary lab I want to describe how the Stanford Literary Lab that I founded with Franco Moretti came into being.

The Stanford Lab was born out of a class that I taught in the fall of 2009. In that course I assigned 1200 novels and challenged students to explore ways of reading, interpreting, and understanding literature at the macro-scale, as an aggregate system. Writing about the course and the lab that evolved from the course, Chronicle of Higher Ed reporter Marc Parry described it as being based on: “a controversial vision for changing a field still steeped in individual readers’ careful analyses of texts.” That may be how it looks from the outside, but there was no radical agenda then and no radical agenda today.

In the class, I asked the students to form into two research teams and to construct research projects around this corpus of 1200 novels. One group chose to investigate whether novel serialization in the 19th century had a detectable/measurable effect upon novelistic style. The other group pursued a project dealing with lexical change over the century, and they wrote a program called “the correlator” that was used to observe and measure semantic change.

After the class ended, two students, one from each group asked to continue their work as independent study; I agreed. Over the Christmas holiday, word spread to the other students from the seminar and by the New Year 13 of the original 14 in the seminar wanted to keep working. Instead of 13 independent studies, we formed an ad-hoc seminar group, and I found an empty office on the 4th floor where we began meeting, sometimes for several hours a day. We began calling this ugly, windowless room, the lab.

Several of the students in my fall class were also in a class with Franco Moretti and the crossover in terms of subject matter and methodology was fairly obvious. As the research deepened and became more nuanced, Franco began joining us for lab sessions and over the next few months other faculty and grad students were sucked into this evolving vortext. It was a very exciting time.

At some point, Franco and I (and perhaps a few of the students) began having conversations about formalizing this notion of a literary lab. I think at the time our motivation had more to do with the need to lobby for space and resources than anything else. As the projects grew and gained more steam, the room got smaller and smaller.

I mention all of this because I do not believe in the “if we build it they will come” notion of digital humanities labs. While it is true that they may come if we build them; it is also true, and I have seen this first hand, that they may come with absolutely no idea of what to do.

First and foremost a lab needs a real and specific research agenda. “Enabling Digital Humanities projects” is not a research agenda for a lab. Advancing or enabling digital humanities oriented research is an appropriate mission for a Center, such as our Center for Digital Humanities Research at Nebraska, but it is not the function of a lab, at least not in the limited literal sense that I imagine it. For me, a lab is not specifically an idea generator; a lab is a place in which ideas move from birth to maturation.

It would be incredible hyperbole to say that we formally articulated any of this in advance. Our lab was the opposite of premeditated. We did, however, have a loosely expressed set of core principles. We agreed that:

1. Our work would be narrowly focused on literary research of a quantitative nature.
2. All research would be collaborative, even when the outcome ends up having a single author.
3. All research would take the form of “experiments,” and we would be open to the possibilities of failure; indeed, we would see failure as new knowledge.
4. The lab would be open to students and faculty at all levels–and, on a more ad hoc basis, to students and faculty from other institutions.
5. In internal and external presentation and publication, we would favor the narrative genre of “lab reports” and attempt to show not only where we arrived, but how we got there.

I continue to believe that these were and are the right principles for a lab even while they conflict with much about the way Universities are organized.

In our lab we discovered that to focus, to really focus on the work, we had to resist and even reject some of the established standards of pedagogy, of academic hierarchy, and of publishing convention. We discovered that we needed to remove instructional barriers both internal and external in order to find and attract the right people and the right expertise. We did not do any of this in order to make a statement. We were not academic radicals bent on defying the establishment.

Nor should I leave you with the impression that we figured anything out. The lab remains an organic entity unified by what some might characterize as a monomaniacal focus on literary research. If there was any genius to what we did, it was in the decision to never compromise our focus, to do whatever was necessary to keep our focus on the literature.

Some Advice for DH Newbies

In preparation for a panel session at DH Commons today, I was asked to consider the question: “What one step would you recommend a newcomer to DH take in order to join current conversations in the field?” and then speak for 3 – 4 minutes. Below is the 5 minute version of my answer. . .

With all the folks assembled here today, I figured we’d get some pretty good advice about what constitutes DH and how to get started, so I decided that I ought to say something different from what I’d expect others to say. I have two specific bits of advice, and I suppose that the second bit will be a little more controversial.

But let me foreground that by going back to 2011 when my colleague Glen Worthey and I organized the annual Digital Humanities conference at Stanford around a big tent, summer of love theme. We flung open the flaps on the Big Tent and said come on in . . . We believed, and we continue to believe, that there is a wide range of very good and very interesting work being done in “digital humanities.” We felt that we needed a big tent to enclose all that good work. But let’s face it, inside the big tent it’s a freakin’ three ring circus. Some folks like clowns and others want to see the jugglers. The DH conference is not like a conference on Victorian Literature. And that, of course, is the charm and the curse.

While it probably makes sense for a newcomer to poke around and gain some sense of the “disciplinary” history of the “field.” I think the best advice I can give a newcomer is to spend very little time thinking about what DH is and spend as much time as possible doing DH.

It doesn’t really matter if the world looks at your research and says of it: “Ahhhh, that’s some good Digital Humanities, man.” What matters, of course, is if the world looks at it and says, “Holy cow, I never thought of Jane Austen in those terms” or “Wow, this is really strong evidence that the development of Roman road networks was entirely dependent upon seasonal shifts.” The bottom line is that it is the work you do that is important, not how it gets defined.

So I suppose that is a bit of advice for newcomers, but let me answer the question more concretely and more controversially by speaking as someone who hangs out in one particular ring of the DH Big Tent.

If you understand what I have said thus far, then you know that it is impossible to speak for the Digital Humanities as a group, so, for some, what I am going to say is going to sound controversial. And if I hear that one of you newcomers ran out at the end of this session yelling “Jockers thinks I need to learn a programming language to be a digital humanist,” then I’m going to have to kick your butt right out of the big tent!

Learning a programming language, though, is precisely what I am going to recommend. I’m even going to go a bit further and suggest a specific language called R.

By recommending that you learn R, I am also advocating learning some statistics. R is primarily a language used for statistical computing, which is more or less the flavor of Digital Humanities that I practice. If you want to be able to read and understand the work that we do in this particular ring of the big tent you will need some understanding of statistics; if you want to be able to replicate and expand upon this kind of work, you are going to need to know a programming language, so I recommend learning some R and killing two birds with one stone.

And for those of you who don’t get turned on by p-values, for loops, and latent dirichlet allocation, I think learning a programing language is still in your best interests. Even if you never write a single line of code, knowing a programming language will allow you to talk to the natives, that is, you will be able to converse with the non-humanities programmers and web masters and DBAs and systems administrators, who we so often collaborate with as digital humanists. Whether or not you program yourself, you will need to translate your humanistic questions into terms that a non-specialist in the humanities will understand. You may never write poetry in Italian, but if you are going to travel in Rome, you should at least know how to ask for directions to the coliseum.

DH2012 and the 2013 Busa Award

I could not make it to the DH conference in Hamburg this year (though I did manage to appear virtually). As chair of the Busa Award committee I had the pleasure of announcing that Willard McCarty had won the award. Willard will accept the award in 2013 when DH meets at the University of Nebraska. Here is the text of my announcement which was read today in Hamburg:

I was very pleased to serve as the Chair of the Busa Award committee this cycle, and though I am disappointed that I was unable to travel to Hamburg this year to make this announcement in person, I’m delighted with the end result. I am also delighted that the award will be given at the 2013 conference hosted by the University of Nebraska. Having recently joined the faculty there, I’m quite certain I will be attending next year’s meeting!

The winner of the 2013 Busa Award is a man of legendary kindness and generosity. His contributions to the growth and prominence of Digital Humanities will be familiar to us all. He is a gentleman, a scholar, a philosopher, and a long time fighter for the cause. He is, by one colleague’s accounting, the “Obi-Wan Kenobi” of Digital Humanities. And I must concur that “the force” is strong with this one. Please join me in congratulating Willard McCarty on his selection for the 2013 Busa Award.

Amicus Brief Filed

In the last chapter of forthcoming my book, I write about the challenges of copyright law and how many a digital humanist is destined to become a 19th-centuryist if the law isn’t reformed to specifically allow for and recognize the importance of “non-expressive” use of digitized content.*

This week the Amicus Brief that I co-authored with Matthew Sag and Jason Schultz was submitted. The brief (see Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild, Inc. Et Al V. Hathitrust Et Al.) includes official endorsement from the Association of Computers in the Humanities as well as the support and signature of many individual scholars working in the field.

* “Non-expressive use” is Matthew Sag’s far more pleasing formulation of what many have come to call “non-consumptive use.”