The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

For my forthcoming book, which includes a chapter on the uses of topic modeling in literary studies, I wrote the following vignette. It is my imperfect attempt at making the mathematical magic of LDA palatable to the average humanist. Imperfect, but hopefully more fun than plate notation. . .

. . . imagine a quaint town, somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

One afternoon Herman Melville bumps into Jane Austen at the bocce ball court, and they get to talking.

“You know,” says Austen, “I have not written a thing in weeks.”

“Arrrrgh,” Melville replies, “me neither.”

So hand in hand they stroll down Gibbs Lane to the LDA Buffet. Now, down at the LDA Buffet no one gets fat. The buffet only serves light (leit?) motifs, themes, topics, and tropes (seasonal). Melville hands a plate to Austen, grabs another for himself, and they begin walking down the buffet line. Austen is finicky; she spoons a dainty helping of words out of the bucket marked “dancing.” A slightly larger spoonful of words, she takes from the “gossip” bucket and then a good ladle’s worth of “courtship.”

Melville makes a bee line for the “whaling” trough, and after piling on an Ahab-sized handful of whaling words, he takes a smaller spoonful of “seafaring” and then just a smidgen of “cetological jargon.”

The two companions find a table where they sit and begin putting all the words from their plates into sentences, paragraphs, and chapters.

At one point, Austen interrupts this business: “Oh Herman, you must try a bit of this courtship.”

He takes a couple of words but is not really fond of the topic. Then Austen, to her credit, asks permission before reaching across the table and sticking her fork in Melville’s pile of seafaring words, “just a taste,” she says. This work goes on for a little while; they order a few drinks and after a few hours, voila! Moby Dick and Persuasion are written . . .

[Now, dear reader, our story thus far provides an approximation of the first assumption made in LDA. We assume that documents are constructed out of some finite set of available topics. It is in the next part that things become a little complicated, but fear not, for you shall sample themes both grand and beautiful.]

. . . Filled with a sense of deep satisfaction, the two begin walking back to the lodging house. Along the way, they bump into a blurry-eyed Hemingway, who is just then stumbling out of the Rising Sun Saloon.

Having taken on a bit too much cargo, Hemingway stops on the sidewalk in front of the two literati. Holding out a shaky pointer finger, and then feigning an English accent, Hemingway says: “Stand and Deliver!”

To this, Austen replies, “Oh come now, Mr. Hemingway, must we do this every season?”

More gentlemanly then, Hemingway replies, “My dear Jane, isn’t it pretty to think so. Now if you could please be so kind as to tell me what’s in the offing down at the LDA Buffet.”

Austen turns to Melville and the two writers frown at each other. Hemingway was recently banned from the LDA Buffet. Then Austen turns toward Hemingway and holds up six fingers, the sixth in front of her now pursed lips.

“Six topics!” Hemingway says with surprise, “but what are today’s themes?”

“Now wouldn’t you like to know that you old sot.” Says Melville.

The thousand injuries of Melville, Hemingway had borne as best he could, but when Melville ventured upon insult he vowed revenge. Grabbing their recently completed manuscripts, Hemingway turned and ran toward the South. Just before disappearing down an alleyway, he calls back to the dumbfounded writers: “All my life I’ve looked at words as though I were seeing them for the first time. . . tonight I will do so again! . . . ”

[Hemingway has thus overcome the first challenge of topic modeling. He has a corpus and a set number of topics to extract from it. In reality determining the number of topics to extract from a corpus is a bit trickier. If only we could ask the authors, as Hemingway has done here, things would be so much easier.]

. . . Armed with the manuscripts and the knowledge that there were six topics on the buffet, Hemingway goes to work.

After making backup copies of the manuscripts, he then pours all the words from the originals into a giant Italian-leather attache. He shakes the bag vigorously and then begins dividing its contents into six smaller ceramic bowls, one for each topic. When each of the six bowls is full, Hemingway gets a first glimpse of the topics that the authors might have found at the LDA Buffet. Regrettably, these topics are not very good at all; in fact, they are terrible, a jumble of random unrelated words . . .

[And now for the magic that is Gibbs Sampling.]

. . . Hemingway knows that the two manuscripts were written based on some mixture of topics available at the LDA Buffet. So to improve on this random assignment of words to topic bowls, he goes through the copied manuscripts that he kept as back ups. One at a time, he picks a manuscript and pulls out a word. He examines the word in the context of the other words that are distributed throughout each of the six bowls and in the context of the manuscript from which it was taken. The first word he selects is “heaven,” and at this word he pauses, and asks himself two questions:

  1. “How much of ‘Topic A,’ as it is presently represented in bowl A, is present in the current document?”
  2. “Which topic, of all of the topics, has the most ‘heaven’ in it?” . . .

[Here again dear reader, you must take with me a small leap of faith and engage in a bit of further make believe. There are some occult statistics here accessible only to the initiated. Nevertheless, the assumptions of Hemingway and of the topic model are not so far-fetched or hard to understand. A writer goes to his or her imaginary buffet of themes and pulls them out in different proportions. The writer then blends these themes together into a work of art. That we might now be able to discover the original themes by reading the book is not at all amazing. In fact we do it all the time--every time we say that such and such a book is about "whaling" or "courtship." The manner in which the computer (or dear Hemingway) does this is perhaps less elegant and involves a good degree of mathematical magic. Like all magic tricks, however, the explanation for the surprise at the end is actually quite simple: in this case our magician simply repeats the process 10 billion times! NOTE: The real magician behind this LDA story is David Mimno. I sent David a draft, and along with other constructive feedback, he supplied this beautiful line about computational magic.]

. . . As Hemingway examines each word in its turn, he decides based on the calculated probabilities whether that word would be more appropriately moved into one of the other topic bowls. So, if he were examining the word “whale” at a particular moment, he would assume that all of the words in the six bowls except for “whale” were correctly distributed. He’d now consider the words in each of those bowls and in the original manuscripts, and he would choose to move a certain number of occurrences of “whale” to one bowl or another.

Fortunately, Hemingway has by now bumped into James Joyce who arrives bearing a cup of coffee on which a spoon and napkin lay crossed. Joyce, no stranger to bags-of-words, asks with compassion: “Is this going to be a long night.”

“Yes,” Hemingway said, “yes it will, yes.”

Hemingway must now run through this whole process over and over again many times. Ultimately, his topic bowls reach a steady state where words are no longer needing to be being reassigned to other bowls; the words have found their proper context.

After pausing for a well-deserved smoke, Hemingway dumps out the contents of the first bowl and finds that it contains the following words:

“whale sea men ship whales penfon air side life bounty night oil natives shark seas beard sailors hands harpoon mast top feet arms teeth length voyage eye heart leviathan islanders flask soul ships fishery sailor sharks company. . . “

He peers into another bowl that looks more like this:

“marriage happiness daughter union fortune heart wife consent affection wishes life attachment lover family promise choice proposal hopes duty alliance affections feelings engagement conduct sacrifice passion parents bride misery reason fate letter mind resolution rank suit event object time wealth ceremony opposition age refusal result determination proposals. . .”

After consulting the contents of each bowl, Hemingway immediately knows what topics were on the menu at the LDA Buffet. And, not only this, Hemingway knows exactly what Melville and Austen selected from the Buffet and in what quantities. He discovers that Moby Dick is composed of 40% whaling, 18% seafaring and 2% gossip (from that little taste he got from Jane) and so on . . .

[Thus ends the fable.]

For the rest of the (LDA) story, see David Mimno’s Topic Modeling Bibliography

Aberrant Adjectives in 19th Century Novels

I created the visualization below using Many Eyes and a data set derived from part-of-speech tagged novels from 19th century Britain. Found here are the 100 most “aberrant adjectives.” Aberrant here is determined by selecting those words that have the greatest amount of usage deviation (measured by relative frequency) over a 13 decade time period. To qualify a word must also appear in every decade.

 

On Distant Reading and Macroanalysis

Earlier this week Kathryn Schultz of the New York Times published a rather provocative, challenging, and in my opinion under-researched and over-sensationalized article about my colleague Franco Morreti’s work theorizing a mode of literary analysis that he has termed “distant-reading.” Others have already pointed out some of the errors Schultz made, and I’m fairly certain Moretti would be happy to clarify any confusion Schultz may have about his work if she were to actually interview him (i.e. before paraphrasing him). My interest here is to offer some specific thoughts and some background on “distant-reading” or what I have preferred to call “macroanalysis.”[1]

The approach to the study of literature that I call macroanalysis, instead of distant-reading (for reasons explained below), is in general ways akin to the social-science of economics or, more specifically, macroeconomics. Before the 20th century there wasn’t a defined field of “Macroeconomics.” There was, however, microeconomics, which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to the study of individual texts via “close-readings” of the material. Macroeconomics, however, is about the study of the entire economy. It tends toward enumeration and quantification and is in this sense similar to literary inquiries that are not highly theorized: bibliographic studies, biographical studies, literary history, philology, and the enumerative analysis that is the foundation of humanities computing.

By way of an analogy, we might think about interpretive close-readings as corresponding to microeconomics while quantitative macroanalysis corresponds to macroeconomics. Consider, then, that in many ways the study of literary genres or literary periods is a type of macro approach to literature. Say, for example, a scholar specializes in early 20th century poetry. Presumably, this scholar could be called upon to provide sound generalizations, or “distant-readings” about 20th century poetry based on a broad reading of individual works within that period. This would be a sort of “macro-, or distant-, reading” of the period. But this parallel falls short of approximating for literature what macroeconomics is to economics, and it is in this context that I prefer the term macroanalysis over distant-reading. The former term places the emphasis on the quantifiable methodology over the more interpretive practice of “reading.” Broad attempts to generalize about a period or about a genre are frequently just another sort of micro-analysis, in which multiple “cases” or “close-readings” of individual texts are digested before generalizations about them are drawn in very qualitative ways. Macroeconomics, on the other hand, is a more number-based discipline, one grounded in quantitative analysis not qualitative assessments. Moreover, macroeconomics employs a number of quantitative benchmarks for assessing, scrutinizing, and even forecasting the macro-economy. While there is an inherent need for understanding the economy at the micro level, in order to contextualize the macro-results, macroeconomics does not directly involve itself in the specific cases, choosing instead to see the cases in the aggregate, looking to those elements of the specific cases that can be generalized, aggregated, and quantified.

Micro-oriented approaches to literature, highly interpretive readings of literature, remain fundamentally important. Just as microeconomics offers important perspectives on the economy. It is the exact interplay between the macro and micro scale that promises a new, enhanced, and perhaps even better understanding of the literary record. The two approaches work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro level, remains essential. While the methods of enquiry, of evidence gathering, are different, they are not antithetical, and they share the same ultimate goal of informing our understanding of the literary record, be it writ large or small. The most fundamental and important difference in the two approaches is that the macroanalytic approach reveals details about texts that are for all intents and purposes unavailable to close-readers of the texts. Writing of John Burrows’s study of Jane Austen’s oeuvre, Julia Flanders points out how Burrows’s computational study brings the most common words such as “the” and “of” into our field of view.

Flanders writes: “His [Burrows] effort, in other words, is to prove the stylistic and semantic significance of these words, to restore them to our field of view. Their absence from our field of view, their non-existence as facts for us, is precisely because they are so much there, so ubiquitous that they seem to make no difference.” (Flanders 2005)

At its most basic, the macroanalytic approach I’m advocating is simply another method of gathering information about texts, of accessing the details. The information is different from what is derived via close reading, but it not of lesser or greater value to scholars for being such.

Flanders goes on: “Burrows’ approach, although it wears its statistics prominently, foreshadows a subtle shift in the way the computer’s role vis-á-vis the detail is imagined. It foregrounds the computer not as a factual substantiator whose observations are different in kind from our own—because more trustworthy and objective—but as a device that extends the range of our perceptions to phenomena too minutely disseminated for out ordinary reading.” (Flanders 2005)

A macroanalytic approach not only helps us to see and understand the larger “literary economy” but, by means of its scope, to better see and understand the degree to which literature and the individual authors who manufacture the literature respond to or react against literary and cultural trends within their realm of experience. If authors are inevitably influenced by their predecessors, then we may even be able to chart and understand “anxieties of influence” in concrete, quantitative ways.

For historical and stylistic questions in particular, the macroanalytic approach has distinct advantages over the more traditional practice of studying literary periods and genres by means of a close study of “representative” texts. Speaking of his own efforts to provide a more encompassing view of literary history, Franco Moretti writes that “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as a whole . . .” (2005). To generalize about a “period” of literature based on a study of a relatively small number of books is to take a significant leap. It is less problematic, though, to consider how a macroanalytic study of several thousand texts might lead us to a better understanding of the individual texts. Until recently, we have not had the opportunity to even consider this later option, and it seems reasonable to imagine that we might, through the application of both approaches, reach a new and better informed understanding of our primary materials. This is what Juri Tynjanov imagined in 1927: “Strictly speaking”, writes Tynjanov, “one cannot study literary phenomena outside of their interrelationships.” Fortunately for me and for scholars such as Moretti, the multitude of interrelationships that overwhelmed and eluded Tynjanov and pushed the limits of close-reading lose can now be explored with the aid of computation, statistics and huge digital libraries.

My book on this subject, Literary Studies, the Digital Library, and the Inevitability of Influence, is now under contact with [Update: will be published in 2013 as Macroanalysis: Digital Methods and Literary History by University of Illinois Press.

[1] I began using the term macroanalysis in late 2003. At the time, Moretti and I were putting together plans for a co-taught course titled “Electronic Data and Literary Theory.” The course we imagined would be a research seminar in the full sense of the word and in our syllabus (dated 11/3/2003) we wrote: “the main purpose of this seminar is methodological rather than historical: learning how to use electronic search systems to analyze large quantities of data — and hence get a new, better understanding of literary and cultural history.” During the course I began work developing a text analysis toolkit that I later called CATools (for Corpus Analysis Tools). In terms of methodology, I was learning a lot at the time from work in corpus linguistics but also discovering that we (literary folks) have an entirely different set of questions. So it made sense to do at least a bit of wheel reinvention. My first experiments with the macroanalytic methodology were constructed around a corpus of Irish-American novels that I had been building since my dissertation research. I presented the first results of this work in Liverpool, at the 2004 meeting of the American Conference for Irish Studies. My paper, titled “Making and Mining a Digital Archive: the Case of the Irish-American West Project,” was part how-to and part results–I’d made one non-trivial discovery about Irish-American literary history based on this new methodology. In the spring of 2005, I offered a more detailed methodological overview of the toolkit at the inaugural meeting of the Text Analysis Developer’s Alliance. An overview of my project was documented on the TADA blog. Later that summer (2005), I presented a more generalized methodological paper titled “A Macro-Economic Model for Literary Research” at the joint meeting of the ACH and ALLC in Victoria, BC. It was there that I first articulated the economic analogy that I have come to find most useful for explaining Moretti’s idea of “distant-reading.” In 2006, while I was in residence as Research Scholar in the Digital Humanities at the Stanford Humanities Center, I spent a good deal of time thinking about macro-scale approaches to literature and then writing corpus analysis code . By the summer of 2007, I had developed a whole new toolkit and presented the first significant findings in a paper titled “Macro-Analysis (2.0)” which I delivered at the 2007 Digital Humanities meeting in Illinois. Coincidentally, this was the same conference at which Moretti presented the opening keynote lecture, a paper exploring a corpus of 19th century novel titles, which would eventually be published in Critical Inquiry. That research utilized software that I had developed in the CATools package.

Kansas Irish Reprint

Rowfont Press of Wichita, Kansas has just published a newly illustrated edition of Charles Driscoll’s memoir Kansas Irish (with my Critical Introduction). The book is available at Amazon. Kansas Irish and the two sequels that follow provide the most complete and authentic rendering of Irish life on the American prairie in the 19th Century.

On Pamphleteering and Pamphlet One

Several months ago, a group of us from the Stanford Literary Lab wrote and sent out for review the article that now appears in Pamphlet 1 of the Lab. The article, titled “Quantitative Formalism: an Experiment” was submitted, peer-reviewed, and approved for publication in a prestigious literary journal. There was, however, a catch. The editors of the journal asked that we trim the number of charts in the article and that we alter the tone and character of the article to make it less of a narrative. In other words, the article was of a style and content that the editors found to be too foreign to their traditions.

Rather than revise the article in ways we felt would misrepresent its function and intent, we turned to that most traditional of literary forms, the pamphlet. Considering the largely quantitative and digital methodology employed in our research, a pamphlet was a seemingly ironic choice. The most obvious venue would seem to have been the web. After all, aren’t the blog and the web site, the pamphlet forms of the digital age (see, for example, Pamphleteers and Web Sites)? We certainly considered an electronic format, and we have posted a pdf version of the essay on our web site, but we decided to print.

Why print the pamphlet? As a literary form, the pamphlet has a long tradition of going against the grain; it’s an alternative form that is malleable. In the pamphlet, George Orwell wrote, “one has complete freedom of expression, including, if one chooses, the freedom to be scurrilous, abusive, and seditious; or, on the other hand, to be more detailed, serious and ‘high-brow’ than is ever possible in a newspaper or in most kinds of periodicals.” It has been used in campaigning and marketing, but most famously, the pamphlet has been employed by political, religious, and social “provocateurs” and “radicals.” Many of these objects have been lost, but the best have been preserved, such as those crafted by talented satirists (Swift) sober thinkers (Paine) and social critics (Voltaire). And they are preserved because they are historical objects, “ephemera” as the librarians say. The pamphlet is an ephemeral object, an object “lasting only for a day.” As “an experiment” our “quantitative formalism” pamphlet is a middle point, a hash mark on the line of time, not an end point or destination not even a beginning.

It’s interesting to consider how the preferred citation style of literary scholarship, the MLA Style, places emphasis on page reference over time, over moment of publication. With some obvious exceptions, the basic logic here is that what someone says about Shakespeare today has the same validity, the same scholarly purchase, as something said fifty years ago. As a discipline, our “results” tend to be qualitative and interpretive, not bound in time or subject to revision based on the introduction of new evidence. They are, of course, subject to reinterpretation, but that is fundamentally different from the changes wrought by new discoveries. The date-based citation style, on the other hand, places emphasis on points in time and acknowledges the fast-paced and ephemeral nature of certain fields of research. This is most obvious in the sciences, in medicine, for example, where new experiments and new discoveries are constantly and quickly changing the field. One need only eavesdrop on the the Digital Humanities twittersphere for a few hours to note the similarities; the pace of change is rapid.

So why not a pamphlet? Why not recognize in both form, title, and narrative style that ours is an experiment? It is a bit of research that is useful for today but also something we entirely expect to change. Indeed, we are already working on the next iteration(s), the next experiments. Unlike the neatly closed arguments of our traditional work, these experiments of our Literary Lab open as many doors as they close. In fact, in the course of our research on novel genres, it became apparent that we could, must, go on forever. Each test led us to some new idea, some new direction to explore. There are some discoveries to be sure, and some of our results will likely, hopefully, stand the test of time. But my co-authors and I understand, or perhaps simply “believe,” that there is still much, much more work to be done. If it reads more like a lab report than a traditional essay, it’s because it is a lab report and self-consciously so, intentionally so.

Readers wishing to experience the full pleasure of a touchable paper pamphlet may contact me with their name and address. No charge, while supplies last:-)

Unigrams, and bigrams, and trigrams, oh my

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

  1. We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
  2. There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
  3. And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
  4. Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).

    chart.png

    And here’s another chart comparing the usage of “asdf” to “qwer.”

    chart-1.png
    And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.

  5. And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data

    In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).

SEASR Grant

This month a group of researchers at Stanford, University of Illinois, University of Maryland, and George Mason were awarded a $790,000 grant from the Mellon Foundation to advance the prior work of the SEASR project. I’ll be serving as the overall Project Director and as one of the researchers in the Stanford component of the grant. In this phase of the SEASR project, we will focus on leveraging the existing SEASR infrastructure in support of four “use cases.” But “use case” hardly describes the research intensive nature of the proposed work, nor does it capture the strongly humanistic bias of the work proposed. Each partner has committed to a specific research project and each has the expressed goal of advancing humanities research and publishing their results. I’d like to emphasize this point about advancing humanities research.

This grant represents an important step beyond the tool building, QA and UI testing stages of software development. All too often, it seems, our digital humanities projects devote a great deal of time, money, and labor to infrastructure and prototyping and then all too frequently the results languish in the great sea of hammers without a nail. Sure, a few journeymen carpenters stick these tools in their belts and hammer away, but all too often it seems that more effort goes into building the tools and then the resources sit around gathering dust while humanities research marches on in the time-tested modes with which we are most familiar.

Of course, I don’t mean this to be a criticism of the tool builders or the tools built. The TAPOR project, for example, offers many useful text analysis widgets, and I frequency send my colleagues and students there for quick and dirty text-analysis. And just last month I had occasion to use and cite Stefan Sinclair’s Voyeur application. I was thrilled to have Voyeur at my finger tips; it provided a quick and easy way to do exactly what I wanted.

But often, the analytic tasks involved in our projects are multifaceted and cannot be addressed by any one tool. Instead, these projects involve “flows” in which our “humanistic” data travels though a series of analytic “filters” and comes out on the other end in some altered form. The TAPOR project attempts to be a virtual text analysis “workbench” in which the craftsman can slide a project around the bench from one tool to the next. This model works well for smallish projects but is not robust enough for large scale projects and, despite some significant interface improvements over the years, remains, for me at least, a bit clunky. I find it great for quick tasks with one or two texts, but inefficient for processing multiple texts or multiple processes. Part of the TAPOR mission was to develop a suite of tools that could be used by the average, ordinary humanist: which is to say, the humanist without any real technical chops. It succeeds on that front to be sure.

SEASR offers an alternative approach and what it provides in terms of processing power and computational elegance it gives up in terms of ease of use and transparency. The SEASR “interface” is one that involves constructing modular “workflows” in which each module corresponds to some computational task. These modules are linked together such that one process feeds into the next and the business of “sliding” a project around from one tool to another on the virtual workbench is taken over by the workflow manager.

In this grant we have specifically deemphasized UI development in favor of output, in favor of “results” in the humanities sense of the word. As we write in the proposal, “The main emphasis of the project will be on developing, coordinating, and investigating the research questions posed by the participating humanities scholars.” The scholars in this project include myself and Franco Moretti at Stanford, Dan Cohen at GMU, Tanya Clement at University of Maryland, Ted Underwood and John Unsworth both of UIUC. On the technical end, we have Michael Welge and Loretta Auvil of the Automated Learning Group, of the National Center for Supercomputing Applications.

As the project gets rolling, I will have more to post about the specific research questions we are each addressing and the ongoing results of our work. . .

On Collaboration

I’ve been hearing a lot about “collaboration,” especially in the digital humanities. Lisa Spiro at Rice University has written a very informative post about Collaborative Authorship in the Humanities as well as another post providing Examples of Collaborative Digital Humanities Projects. Both of these posts are worth reading, and Spiro offers some well-thought out and researched perspectives.

My own experiences with collaboration include both research and authorship. I have seen first hand how fruitful collaboration, especially interdisciplinary collaboration, can be. It is safe to say that I’m a believer. In fact the course I have been teaching for the last two years, Literary Studies and the Digital Library, is designed entirely around collaborative research projects. And yet I have to say that I’m am entirely suspicious of the current rage for “collaboration.”

No doubt the current popularity of collaboration, at least in the humanities, is a natural extension of the movement toward interdisciplinary studies. Through collaboration with people outside our individual disciplines has led to fruitful work, there seems to be an unnatural desire on the part of some administrators and even some colleagues to “foster collaboration,” as if collaboration were something that occurs in a petri dish, something that needs only to be “fostered” in order to evolve.

But collaboration does not arise out of a petri dish, it arises out of need. Sure, there are serendipitous collaborations that arise out of proximity: X bumps into Y at the water cooler and they get to talking . . . but more often successful collaboration arises out of need: X wishes to investigate a topic but requires the skills of Y in order to do a good job.

Failed collaborations, on the other hand, are all too often the result of good intentioned but overly forced attempts to bring people together. I attended a seminar a couple of years ago on the subject of “fostering collaboration in the humanities.” The organizers of the meeting certainly understood the promise of new knowledge that might be derived through interaction, but they entirely miscalculated when it came to individual motivation to collaborate. It’s a classic case of putting the cart before the horse. In my experience, fruitful collaboration evolves organically and is motivated by the underlying research questions, questions that are always too big and too complex to be addressed by a single researcher.

Auto Converting Project Gutenberg Text to TEI

Those who do corpus level computational text analysis are always hungry for more and more texts to analyze. Though we’ve become adept at locating texts from a wide range of sources (our own institutional repositories as well as a number of other places including Google Books, the Internet Archive, and Project Gutenberg), we still face a number of preprocessing tasks to bring those various files into some standard format. The texts found at these resources are not always in a format friendly to the tools we use for processing those texts. For example, I’ve developed lots of processing scripts that are designed to leverage the metadata that is frequently encoded into TEI-based xml. A text from Project Gutenberg, however, is not only just plain text, but it has a lot of boilerplate text at the beginning and end of each file that needs to be removed prior to text analysis.

I’m currently building a corpus of 19th century novels and discovered that many of the texts I would like to include have already been digitized by Project Gutenberg. This, of course, was great news. But, the system I have developed for ingesting texts into my corpus assumes that the texts will all be in TEI-XML with markup indicating such important things as “author,” “title”, and “date” of publication. I downloaded about 100 novels and was about to begin opening them up one by one and adding the metadata. . .eek! I quickly realized the mundanity of the task and thought, “hmm, I bet someone has written a nice regex script for doing this sort of thing.” A quick trolling of the web led me to the web page of Michiel Overtoom who had developed some python scripts for downloading and cleaning up (“beautifying” in his language) Dutch Gutenberg texts for his eBook Reader. Overtoom’s process is mainly designed to strip out the boilerplate and then rename the files with naming conventions that reflect the author and title of the books.

With Overtoom’s script as a base, I reengineered the code to convert a Gutenberg text into a minimally encoded and TEI-compliant XML file. The script builds a teiHeader that includes the author and title of the work (unfortunately, Project Gutenberg texts do not include publication dates, why?) and then adds “text”, “body”, div, and all the p tags. The final result is a document that meets basic TEI requirements. The script is copied below, but since the all important python spacing may be destroyed by this posting, it’s better to download it here and then change the file extension from .txt. to “.py”. Enjoy!

Panning for Memes

Over in the English Department Literature Lab, we have been experimenting with Topic Modeling as a means of discovering latent themes (aka topics) in a corpus of 19th century novels. Topic Modeling is an unsupervised machine learning process that employs Latent Dirichlet allocation. “It posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.”

We’ve been experimenting using the Java-Based MAchine Learning for LanguagE Toolkit (Mallet) from UMASS Amherst and a corpus of British and American novels from the 19th century. In one experiment we ran the topic modeler over just the British corpus, in another over just the American corpus. But when we combined the two collections and ran the model over the whole corpus, we discovered that certain topics showed up in only one or the other corpus. For example, one solely American topic was composed of words related to slavery and words written in southern dialect. And there was a strictly British topic clearly indicative of the royalty and aristocracy: words such as “lord,” “king”, “duke,” “sir”, “lady.” This was an interesting result and not simply because it provides a quantitative way of distinguishing topics or themes that are distinct to one nation or another, but also because the topics themselves could be read and interpreted in context.

More interesting for me, however, were two topics that appeared in both corpora. The first, which appeared more in the British corpus was related to “soldiering.” A second topic, which was more common in the American corpus, has to do with Indian wars. The “soldiering” topic was composed of the following words:

“men,” “general,” “captain,” “colonel,” “army,” “horse,” “sir,” “enemy,” “soldier,” “battle,” “day,” “war,” “officer,” “great,” “country,” “house,” “time,” “head,” “left,” “road,” “british,” “soldiers,” “washington,” “night,” “fire,” “father,” “officers,” “heard,” “moment.”

The Indians topic included:

“indian,” “men,” “indians,” “great,” “time,” “chief,” “river,” “party,” “red,” “white,” “place,” “savages,” “woods,” “day,” “side,” “fire,” “war,” “savage,” “water,” “canoe,” “rifle,” “people,” “warriors,” “returned,” “feet,” “friends,” “tree,” “night,” “distance.”

What was most fascinating, however, was that when the soldiering topic was found in the American corpus it usually had to do with Indians, and when the Indian topic appeared in the British corpus it was almost completely in the context of the Irish! As an Irish-Studies scholar, who wrote a theses on the role of the American West in Irish and Irish-American literature, this was an incredibly rich discovery. The literature of the Irish and the Irish Diaspora is filled with comparisons between the Irish situation vis-à-vis the British and the Native American situation vis-à-vis what one Irish American author described as the “Tide of Empire.”

Reader’s wishing to follow this line of comparison in some more contemporary works might want to have a look at Joyce’s short story “An Encounter,” Flann O’Brien’s book At Swim Two Birds, Paul Muldoon’s Madoc and Patrick McCabe’sThe Butcher Boy.