Category Archives: Commentary

Rethinking Range in the Age of Generative AI 

I recently reread David Epstein’s Range (2019), a book I first encountered a few years ago when it seemed every leadership forum was extolling the virtues of grit, 10,000 hours, and early specialization. Epstein pushed back, persuasively arguing that generalists, not specialists, are better equipped to solve complex problems, especially in domains where rules are unclear and outcomes are unpredictable. His thesis struck me as a welcome corrective and a fitting principle for the Dean of a College of Arts and Sciences (which I was at the time) to embrace. Reading it again now, in the post–generative AI world, I find it more than just persuasive; I find it essential.

Epstein’s central claim is that those who explore broadly, delay specialization, and learn through analogy and synthesis are better prepared for the “wicked” problems of the world—problems that don’t come with tidy instructions or immediate feedback. That idea was always relevant. But as generative AI takes on more and more of the tasks traditionally associated with specialized expertise (e.g. software programming, legal research, medical diagnostics, financial analysis, language translation, writing etc.), Epstein’s argument takes on new urgency. We need a different kind of pedagogy now, one that privileges range, depth, and judgment over memorization and narrow skill-building.

Memorization Is Obsolete. But Thinking Isn’t.

Let’s be honest…if you want fast facts, crisp summaries, or a list of references, a large language model can do the job faster and more reliably than most humans. The days when being able to recall information conferred professional advantage are behind us. But here’s the rub: what AI cannot do, at least not yet, is to make meaningful analogies across domains, or to recognize when a familiar pattern no longer applies (AI clings to statistical priors and probabilities), or to ask truly generative questions, which is to say questions that open new avenues of inquiry rather than simply remixing what’s already known.

Those capacities are learned not through repetition or drill, but through what Epstein calls “sampling”: through exposure to different ways of thinking, working, and seeing. This is precisely what a traditional liberal arts education aimed to foster. In fact, I’d argue that the habits of mind developed through broad study of mathematics and science, but also of literature, history, philosophy, and the arts (disciplines too often marginalized in the STEM-obsessed discourse) are exactly what we need to cultivate in students if we want them to thrive alongside AI. And I say this as someone who has invested heavily in STEM, both personally and professionally.

The more things change the more they stay the same

When my father was considering college, a liberal arts major was seen as the doorway to anything. Higher Ed was still a rather elite pursuit: medicine, teaching, and law were represented to him as “respectable” pursuits, but, then again, so was classics. He majored in English and math and felt well-prepared for a variety of roles in business. By the time I was in high school, the conventional wisdom about a broad foundation had shifted.  My father still valued his liberal arts foundation, but he advised me to specialize and pursue finance, accounting, business, or possibly law.  I did not, but I was convinced I needed a “practical” degree and spent one year as an architecture student before decamping to the liberal arts and an English major with several minors.  

By the time I was graduating from college, the conventional advice was shifting again, this time in a big way toward computer science and engineering.  I caught that wave and became an “early adopter” of programming, but mine was intended as a hobbyist’s pursuit, definitely not a career. Or so I thought. 

By the 2010s, CS and engineering had broadened to anything STEM, and quantitative degrees were touted as the surefire and sensible choice for job security in the modern world. Healthcare, especially “Pre-Med,” was an emerging area of attention and received honorable mention. Meanwhile the edusphere was rife with jokes about the most effective way to get an English major’s attention: just yell “waiter.”

The pendulum of conventional wisdom swung wide in the direction of increased specialization. Computer science and engineering came to dominate the conversation.  But soon a problem surfaced.  Higher ed was producing a lot of experts, but these experts weren’t very well rounded.  In 2019, the Wall Street Journal profiled how Northeastern University began requiring CS majors to take theater classes (specifically improv!) in order to “sharpen uniquely human skills” and build “empathy, creativity and teamwork [as a] competitive advantage over machines in the era of artificial intelligence.”1  Prescient?

Wicked Learning Environments Are the Norm Now

Epstein draws a sharp contrast between “kind” environments (like chess or golf) where patterns repeat and feedback is immediate, and “wicked” environments where feedback is sparse, misleading, or delayed. The world of work is not kind, and the world of generative AI is wicked in spades. These models are probabilistic, opaque, and massively influential. They’re already reshaping industries and knowledge work, and their decisions are often unexplainable—even to their creators.

Navigating this world demands not just technical fluency, but epistemic humility and conceptual agility. It requires the ability to think critically about systems, to understand where they might go wrong, and to imagine alternative futures. These are not traits we cultivate by marching students through test prep or narrow curricula. They’re cultivated through play, analogy, experimentation, and yes, through wandering widely around the course catalog and thinking deeply.

AI Is a Specialist. We Can Be Generalists.

Ironically, the very thing that makes AI powerful, especially when the models are fine-tuned to a particular task or adapted for a specific domain, is also a potential blind spot. Generative models are trained on what already exists. They can remix, but they can’t reimagine, not really. They can simulate reasoning, but they don’t have perspective. They can write beautifully fluent text, but they don’t have skin in the game or any real sense of how the words on the page convey meaning(s). That’s our job.

In a recent article for The Atlantic, Matteo Wong recounts a conversation with an AI researcher who was “rethinking the value of school.”2 Wong writes: “One entrepreneur told me that today’s bots may already be more scholastically capable than his teenage son will ever be, leading him to doubt the value of a traditional education.”  I can’t help wondering what that entrepreneur was thinking when using the word “traditional.”

If anything, the rise of generative AI reopens space for the (very traditional) Renaissance mind, for thinkers who can roam across domains, connect unlikely dots, and bring ethical insight to technical problems. The human edge isn’t in being faster or more encyclopedic or more “scholastically capable”; it’s in being wiser. That’s a distinctly generalist strength.

Toward a Post-AI Pedagogy

So what does this mean for teaching and learning? Arguably, it means we need to stop confusing learning with content acquisition and specialization. When it comes to content acquisition, the AIs will beat us every time. This means doubling down on slow learning, on open-ended inquiry, on the value of taking time to understand why something matters, not just how to do it. It means encouraging students to read outside their major, to embrace intellectual detours, and to reflect on what they know and don’t know. 

To be clear, I’m not suggesting we abandon technical training or STEM. But we need to reframe its purpose. In a world where tools evolve faster than syllabi, the lasting value of higher education lies not in tool mastery but in the transferability of judgment, in the ability to reason analogically and ethically under conditions of uncertainty. 

Reading Range again has reminded me that the best preparation for a world shaped by AI might not be more AI—but more humanity. More slow thinking, more curiosity, and more range.


  1. Castellanos, Sara. “‘Oh, My God, Where Is This Going?’ When Computer-Science Majors Take Improv.” Wall Street Journal. May 14, 2019. (https://www.wsj.com/articles/oh-my-god-where-is-this-going-when-computer-science-majors-take-improv-11557846729)
  2. Wong, Matteo. “The AI Industry is Radicalizing.” The Atlantic. July 8, 2025. (https://www.theatlantic.com/technology/archive/2025/07/ai-radicalization-civil-war/683460/)

Revisiting Chapter Nine of Macroanalysis

Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.

Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.

Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.

If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.

For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.

In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.

I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.

For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.

For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).

Figure 2: Using 167,770 Edges
Figure 3: Using 167,770 Edges

Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.

Figure 4: Using 1,000,000 Edges
Figure 5: Using 1,000,000 Edges

One might wonder if what’s being graphed here is obvious? After all wouldn’t we expect topics to be time sensitive, faddish, and wouldn’t we expect style to be likewise? Well, I suppose expectations are a matter of personal opinion.

What my data show are that some topics appear and disappear over time (e.g. vampires) in what seem to be faddish ways, others seem to appear with regularity and even predictability (love), and some are just downright odd, appearing and disappearing in no recognizable pattern (animals). Such is also the case with the word frequencies that we often speak of as a proxy for “style.” In the 19th century, for example, use of the word “like” in English fiction was fairly consistent and flat compared to other frequent words that fluctuate more from year to year or decade to decade: e.g. “of” and “it”.

So, I don’t think it is a foregone conclusion that novels published in a particular time period are necessarily similar. It is possible that a particularly popular topic might catch on or that a powerful writer’s style might get imitated. It is equally plausible that in a race to “make it new” writers would intentionally avoid working with popular topics or imitating a typical style.

And when it comes to author gender/sex, I don’t think it is obvious that male writers will write like other males and females like other females. The data reveal that even while the majority (roughly 80%) in each class write more like members of their class, many women (~20%) write more like men and many men (~20%) write more like women. Which is to say, there are central tendencies and there are outliers. When it comes to author gender, study after study indicate that the central tendency is about 80% of writers. Looking at how these distributions evolve over time, seems to me a especially interesting place for ongoing research.

But what we are ultimately dealing with here, in these graphs, are the central tendencies. I continue to believe, as I have argued in Macroanalysis and in The Bestseller Code, that it is only through an understanding of the central tendencies that we can begin to understand and appreciate what it means to be an outlier.

Syuzhet 1.0.4 now on CRAN

On Friday I posted an updated version of Syuzhet (1.0.4) to CRAN. This version has been available over on GitHub for a while now. In version 1.0.4, support for sentiment detection in several languages was added by using the expanded NRC lexicon from Saif Mohammed. The lexicon includes sentiment values for 13,901 words in each of the following languages: Arabic, Basque, Bengali, Catalan, Chinese_simplified, Chinese_traditional, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Irish, Italian, Japanese, Latin, Marathi, Persian, Portuguese, Romanian, Russian, Somali, Spanish, Sudanese, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish, Zulu.

At the time of this release, however, Syuzhet will only work with languages that use Latin character sets. This effectively means that “Arabic”, “Bengali”, “Chinese_simplified”, “Chinese_traditional”, “Greek”, “Gujarati”, “Hebrew”, “Hindi”, “Japanese”, “Marathi”, “Persian”, “Russian”, “Tamil”, “Telugu”, “Thai”, “Ukranian”, “Urdu”, “Yiddish” are not supported even though these languages are part of the extended NRC dictionary and can be accessed via the get_sentiment_dictionary() function. I have heard from several of my non-English native speaking students and a few others on Twitter that the German, French, and Spanish results seem to be good. Your mileage may vary. For details on the lexicon, see NRC Emotion Lexicon.

Also in this release is support for user created lexicons. To work, users create their own custom lexicon as a data frame with at least two columns named “word” and “value.” Here is a simplified example:

my_text <- "I love when I see something beautiful. I hate it when ugly feelings creep into my head."
char_v <- get_sentences(my_text)
custom_lexicon <- data.frame(word=c("love", "hate", "beautiful", "ugly"), value=c(1,-1,1, -1))
method <- "custom"
my_custom_values <- get_sentiment(char_v, method = method, lexicon = custom_lexicon)
my_custom_values

With contributions from Philip Bulsink, support for parallel processing was added so that one can call get_sentiment() and provide cluster information from parallel::makeCluster() to achieve results quicker on systems with multiple cores.  

Thanks also to Jennifer Isasi, Tyler Rinker, “amrrs,” and Oliver Keyes for recent suggestions/contributions/QA.

Examples of how to use these new functions and languages are in the updated vignette.

More Syuzhet Validation

Back in December I posted results from a human validation experiment in which machine extracted sentiment values were compared to human coded values. The results were encouraging. In the spring, we mined the human coded sentences to help create a new sentiment dictionary that would, in theory, be more sensitive to the sort of sentiment words common to fiction (whereas existing sentiment dictionaries tend to be derived from movie and/or product review corpora). This dictionary was implemented as the default in the latest release of the Syuzhet R package (2016-04-28).

Over the summer, a new group of six human-coders was hired to read novels and score the sentiment of every sentence. Each novel was read by three human-coders. In the graphs that follow below, a simple moving average is used to plot the mean sentiment of the three students (black line) along side the values derived from the new “Syuzhet” dictionary (red line). Each graph reports the Pearson product-moment correlation coefficient.

This fall we will continue gathering human data by reading additional books. Once we have a few more books read, we’ll post a more detailed report, including data about inter-coder agreement and which machine methods produced results closest to the humans.

train

alex

bernadette

circle

Requiem for a low pass filter

Ben Schmidt’s and Scott Enderle’s recent entries into the syuzhet discussion have beaten the last of the low pass filter out of me. I’m not entirely ready to concede that Fourier is useless for the larger problem, but they have convinced me that a better solution than the low pass is possible and probably warranted. What that better solution is remains an open question, but Ben has given us some things to consider.

In a nutshell, there were two essential elements to Vonnegut’s challenge that the low pass method seemed to be solving.  According to Vonnegut, this business of story shape “is an exercise in relativity” in which “it is the shape of the curves that matter and not their point of origin.”  Vonnegut imagined a system of plot in which the high and lows of good fortune and ill fortune are internally relative.  In this way, a very negative book such as Blood Meridian will have an absolute high and an absolute low that can be compared to another book that, though more positive on a whole, will also have an absolute high and an absolute low. The object of analysis is not the degree of positive or negative valence but the location of the spikes and troughs of that valence relative to the beginning and end of the book.  When conceived of in these terms, the ringing artifacts of the low pass filter seem rather trivial because the objective was not to perfectly represent the valence but to dramatize the shifts in valence.

As Ben has pointed out, however, the edges of the Fourier method present a different sort of problem; they assume that story plots are periodic, repeating signals.  The problem, as Ben puts it, is that the method “imposes an assumption that the start of [a] plot lines up with the end of a plot.”

Over the weekend, Ben and I exchanged a few emails, and I acknowledged that I had been overlooking these edge distortions in favor of a big picture perspective of the general shape.  Some amount of distortion, after all, must be tolerated if we want to produce a smooth shape.  As Israel Arroyo pointed out in a tweet, “endpoints are problematic in most smoothers and filters.”  With a simple rolling window, for example, the averaging can’t start until we are already half the distance of the window into the sequence.  Figure 1, which shows four options for smoothing Portrait of the Artist, highlights the moving average problem in blue.[1]

portrait

Figure 1

Looking only at figure one, it would be hard to argue against Fourier as a beautiful representation of the plot shape.  Figure 2 shows the same four methods applied to Dorian Gray.  Here again, the Fourier method seems to provide a fair representation.  In this case, however, we begin to see a problem forming at the end of the book.  The red lowess line is trending down while the green Fourier is reaching up in order to complete its cycle.  The beginning still looks good, and perhaps the distortion at the end can be tolerated, but it’s certainly not ideal.

dorian_w_4

Figure 2

Unfortunately, some sentiment trajectories appear to create a far more pronounced problem.  At Ben’s suggestion, I ran the same experiments with Madame Bovary.  The resulting plot is shown in figure 3.  I’ve not read Bovary in many years, so I can’t recall too many details about plot, but I do remember that it does not end well for anyone.  The shape of the green Fourier line at the end of figure 3, however, suggests some sort of uptick in positive sentiment that I suspect is not present in the text. The start of the shape, on the left, also looks problematic compared to the other smoothers.

bovary2

Figure 3

With the first two figures, I think a case can be made that the Fourier line offers a fair representation of the emotional trajectory.  Making such a case for Bovary is not inconceivable if we ignore the edges, but it is clearly a stretch, and there is no denying that the lowess smoother does a better job.

In our email exchange about these different options, Ben included a graphic showing how various methods model four different books.  At least in these examples, loess (fifth row of figure 4) appears to be the top contender if we seek a representation that is both maximally smooth and maximally approximate.

methods

Figure 4

In order to fully solve Vonnegut’s challenge, an alternative to percentage chunking is still necessary.  Longer segments in longer books will tend toward a neutral valence.  Figuring that out is work for the future.  For now, the Bovary example provides precisely the sort of validation/invalidation I was hoping to elicit by putting the package online.

RIP low-pass filter.[2]

FOOTNOTES:

[1] There are some more elegant ways to deal with filling in the flat edges, but keeping it simple here for illustration.

[2] I’m grateful to everyone who has engaged in this discussion, especially Annie Swafford, Daniel Lepage, Ted Underwood, Andrew Piper, David Bamman, Scott Enderle, and Ben Schmidt.  It has been a very engaging couple of weeks, and along the way I could not help but think of what this discussion might have looked like in print: it would have taken years to unfold!  Despite some emotional high and lows of its own, this has been a productive exercise and a great example of how valuable open code and the digital commons can be for progress.

My Sentiments (Exactly?)

While developing the Syuzhet package–a tool for tracking relative shifts in narrative sentiment–I spent a fair amount of time gut-checking whether the sentiment values returned by the machine methods were a good match for my own sense of the narrative sentiment.  Between 70% and 80% of the time, they were what I considered to be good sentence level matches. . . but sentences were not my primary unit of interest.

Rather, I wanted a way to assess whether the story shapes that the tool produced by tracking changes in sentiment were a good approximation of central shifts in the “emotional trajectory” of a narrative.  This emotional trajectory was something that Kurt Vonnegut had described in a lecture about the simple shapes of stories.  On a chalkboard, Vonnegut graphed stories of good fortune and ill fortune in a demonstration that he calls “an exercise in relativity.”  He was not interested in the precise high and lows in a given book, but instead with the highs and lows of the book relative to each other.

Blood Meridian and The Devil Wears Prada are two very different books. The former is way, way more negative.  What Vonnegut was interested in understanding was not whether McCarthy’s book was more wholly negative than Weisberger’s, he was interested in understanding the internal dynamics of shifting sentiment: where in a book would we find the lowest low relative to the highest high. Implied in Vonnegut’s lecture was the idea that this tracking of relative high and lows could serve as a proxy for something like “plot structure” or “syuzhet.”

This was an interesting idea, and sentiment analysis offered a possible way forward.  Unfortunately, the best work in sentiment analysis has been in very different domains.  Could sentiment analysis tools and dictionaries that were designed to assess sentiment in movie reviews also detect subtle shifts in the language of prose fiction? Could these methods handle irony, metaphor, and so forth?  Some people, especially if they looked only at the results of a few sentences, might reject the whole idea out of hand. Movie reviews and fiction, hogwash!  Instead of rejecting the idea, I sat down and human coded the sentiment of every sentence in Joyce’s Portrait of the Artist. I then developed Syuzhet so that I could apply and compare four different sentiment detection techniques to my own human codings.

This human coding business is nuanced.  Some sentences are tricky.  But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment. Consider this contrived example:

“I hated the way he looked at me that morning, and I was glad that he had become my friend.”

Is that a positive or negative sentence?  Given the coordinating “and” perhaps the second half is more important than the first part?  I coded sentences such as this as neutral, and thankfully these were the outliers and not the norm. Most of the time–even in a complex novel like Portrait where the style and complexity of the sentences are both evolving with the maturation of the protagonist–it was fairly easy to make a determination of positive, negative, or neutral.

It turns out that when you do this sort of close reading you learn a lot about the way that authors write/express/manipulate “sentiment.”  One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky.  In fact, in many random passages that I examined from other books, and in the entirety of Portrait, tricky sentences were usually followed or preceded by other simple sentences that would clarify the sentiment of the larger passage.  This is an important observation because at the level of an individual sentence, we know that the various machine methods are not super effective.[1]  That said, I was pretty surprised by the amount of sentence level agreement in my ad hoc test.  On a sentence by sentence basis, here is how the four methods in the package performed:[2]

Bing 84% agreement
Afinn 80% agreement
Stanford 60% agreement
NRC 50% agreement

These results surprised me.  I was shocked that the more awesome Stanford method did not outperform the others. I was so shocked, in fact, that I figured I must have done something wrong.  The Stanford sentiment tagger, for example, thinks that the following sentence from Joyces Portrait is negative.

“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo.”

It was a “very good time.” How could that be negative?  I think “a very good time” is positive and so do the other methods. The Stanford tagger also indicated that the sentence “He sang that song” is slightly negative.  All of the other methods scored it as neutral, and so did I.

I’m a huge fan of the Stanford tagger; I’ve been impressed by the way that it handles negation, but perhaps when all is said and done it is simply not well-suited to literary prose where the syntactical constructions can be far more complicated than typical utilitarian prose? I need more time to study how the Stanford tagger behaved on this problem, so I’m just going to exclude it from the rest of this report.  My hypothesis, however, is that it is far more sensitive to register/genre than the dictionary based methods.

So, as I was saying, what happens with sentiment in actual prose fiction is usually achieved over a series of sentences. That simile, that bit of irony, that negated sentence is typically followed and/or preceded by a series of more direct sentences expressing the sentiment of the passage.  For example,

“She was not ugly.  She was exceedingly beautiful.”
“I watched him with disgust. He ate like a pig.”

Prose, at least the prose that I studied in this experiment, is rarely composed of sustained irony, sustained negation, sustained metaphor, etc.  Usually authors provide us with lots of clues about the sentiment we are meant to experience, and over the course of several sentences, a paragraph, or a page, the sentiment tends to become less ambiguous.

So instead of just testing the machine methods against my human sentiments on a sentence by sentence basis, I split Joyce’s portrait into 20 equally sized chunks, and calculated the mean sentiment of each.  I then compared those means to the means of my own human coded sentiments.  These were the results:

Bing 80% agreement
Afinn 85% agreement
NRC 90% agreement

Not bad.  But of course any time we apply a chunking method like this we risk breaking the text right in the middle of a key passage.  And, as we increase the number of chunks and effectively decrease the size of each passage, the values tend to decrease. I ran the same test using 100 segments and saw this:

Bing 73% agreement
Afinn 77% agreement
NRC 58% agreement (ouch)

Figure 1 graphs how the AFinn method (with 77% agreement over 100 segments) tracked the sentiment compared to my human sentiments.

afinn_v_human

Figure 1

Next I transformed all of the sentiment vectors (machine and human) using the get_transformed_values function.  I then calculated the amount of agreement. With the low pass filter set to the default of 3, I observed the following agreement:

Bing 73% agreement
Afinn 74% agreement
NRC 86% agreement

With the low pass filter set to 5, I observed the following agreement:

Bing 87% agreement
Afinn 93% agreement
NRC 90% agreement

Figure 2 graphs how the transformed AFinn method tracked narrative changes in sentiment compared to my human sentiments.[3]

afinn_v_human_trans_scaled

Figure 2

As I have said elsewhere, my primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories.  If you do that, and you have successes or failure, I’d be very interested in hearing from you (please send me an email).

Given all of the above, I suppose my current working benchmark for human to machine accuracy is something like ~80%.  Frankly, though, I’m more interested in the big picture and whether or not the overall shapes produced by this method map well onto our human sense of a book’s emotional trajectory.  They certainly do seem to map well with my sense of Portrait of the Artist, and with many other books in my library, but what about your favorite novel?

FOOTNOTES:
[1] For what it is worth, the same can probably be said about us, the human beings.  Given a single sentence with no context, we could probably argue about its positive or negativeness.
[2] Each method uses a slightly different value range, so when I write of “agreement,”  I mean only that the machine method agreed with the human (me) that a given sentence was positively or negatively charged.  My rating scale consisted of three values: 1, 0, -1 (positive, neutral, negative). I did not test the extent of the positiveness or the degree of negativeness.
[3] I explored low-pass values in increments of 5 all the way to 100.  The percentages of agreement were consistently between 70 and 90.

A Ringing Endorsement of Smoothing

On March 7, Annie Swafford posted an interesting critique of the transformation method implemented in Syuzhet.  Her basic argument is that setting the low-pass filter too low may result in misleading ringing artifacts.[1]  This post takes up the issue of ringing artifacts more directly and explains how Annie’s clever method of neutralizing values actually demonstrates just how effective the Syuzhet tool is in doing what it was designed to do!   But lest we begin chasing any red herring, let me be very clear about the objectives of the software.

  1. The tool is meant to reveal the simple (and latent) shape of stories, not the complex shape of stories, not the perfect shape of stories, not the absolute shape of stories, just the simple foundational shapes.[2]  This was the challenge that Vonnegut put forth when he said “There is no reason why the simple shape of stories cannot be fed into computers.”
  2. The tool uses sentiment, as detected by four possible methods, as a proxy for “plot.”  This is in keeping with Vonnegut’s conception of “plot” as a movement between what he called “good fortune” and “ill fortune.”  The gamble Syuzhet makes is that the sentiment detection methods are both “good enough” and also may serve as a satisfying proxy for the “good” and “ill” fortune Vonnegut describes in his essay and lecture.
  3. Despite some complex mathematics, there is an interpretive dimension to this work. I suppose this is why folks call it “digital humanities” instead of physics. Syuzhet was designed to estimate and smooth the emotional highs and lows of a narrative; it was not designed to provide a perfect mapping of emotional valence. I don’t think such perfect mapping is computationally possible; if you want/need that kind of experience, then you need to read the book (some of ’em are even worth it).  I’m interested in detecting/revealing the simple shape of stories by approximating the fundamental highs and lows of emotional valence. I believe that this is what Vonnegut had in mind.
  4. Finally, when examining the shapes produced by graphing the Syuzhet values, we must remember what Vonnegut said: “This is an exercise in relativity, really. It is the shape of the curves that matters and not their origins.”  When Vonnegut speaks of the shapes, he speaks of them as “simple” shapes.

In her critique of the software, Annie expresses concern over the potential for ringing artifacts when using a Fourier transformation and a very low, low-pass filter.  She introduces an innovative method for detecting this possible ringing.  To demonstrate the efficacy of her method, she “neutralizes” one third of the sentiment values in Joyce’s Portrait of the Artist as a Young Man and then retransforms and graphs the new neutralized shape against the original foundation shape of the story.

Annie posits that if the Syuzhet tool is working as she thinks it should, then the last third of the foundational shape should change in reaction to this neutralization.  In Annie’s example, however, no significant change is observed, and she concludes that this must be due to a ringing artifact.  Figure 1 (below) is the evidence she presents on her blog.

portrait_no_last_third1
Figure 1: last third neutralized

For what it is worth, we do see some minor differences between the blue and the orange lines, but really, these look like the same “Man in Hole” plot shapes.  Ouch, this does look like a bad ringing artifact.  But could there be another explanation?

There may, indeed, be some ringing here, but it’s not nearly so extreme as Figure 1 suggests.  An alternative conclusion is that the similarity we observe in the two lines is due to a similarity between the actual values and the neutralized values.  As it happens, the last third of the novel is already pretty neutral compared to the rest of the novel.  In fact, the mean valence for the entire last third of the novel is -0.05.  So all we have really achieved in this test is to replace a section of relatively neutral valence with another segment of totally neutral valence.

This is not, therefore, a very good book in which to test for the presence of a ringing artifacts using this particular method of neutralization.  What we see here is a case of the right result but the wrong conclusion.  Which is not to say that there is not some ringing present; I’ll get to that in a moment.  But first another experiment.

If, instead of resetting those values to zero, we set them to 3 (making Portrait end on a very happy note indeed), we get a much different shape (blue line in figure 3).  The earlier parts of the novel are now represented as comparatively less negative and the end of the novel is mucho positive.

last_third_happy

Figure 3: Portrait with artificial positive ending

And, naturally, we can also set those values very negative and produce the graph seen in figure 4.  Oh, poor Stephen.

very_negative

Figure 4: Portrait with artificial negative ending

“But wait, Jockers, you can’t deny that there is still an artificial “hump” there at the end of figure 3 and an artificial trough at the end of figure 4.”   Nope, I won’t deny it, there really can be ringing artifacts.  Let’s see if we can find some that actually matter . . .

First let’s test the beginning of the novel using the Swafford method.  We neutralize the beginning third of the novel and graph it against the original shape (figure 5).  Hmm, again it seems that the foundation shape is pretty close to the original.  Is this a ringing artifact?

first_third

Figure 5: first third neutralized

Could be, but in this case it is probably just another false ringer.  Guess what, the beginning of Joyce’s novel is also comparatively neutral.  This is why the Swafford method results in something similar when we neutralize the first third of the book.  Do note that the first third is a little bit less neutral than the last third.  This is why we see a slightly larger difference between the blue and orange lines in figure 5 compared to figure 1.

But what about the middle section?

If we set the middle third of the novel to neutral, what we get is a very different shape (and a very different novel)!  Figure 6 uses the Swafford method to remove the central crisis of the novel. This is no longer a “man in hole” story, and the resulting shape is precisely what we would expect.  Make no mistake, that hump of happiness is not a ringing artifact.  That hump in the middle is now the most sustained non-negative moment in the book.  We have replaced hell with limbo (not heaven because these are neutral values), and in comparison to the other parts of the book, limbo looks pretty good!  Keep in mind Vonnegut’s message from #4 above: “This is an exercise in relativity.”  Also keep in mind that there is some scaling going on over the y-axis; in other words, we should not get too hung up on the precise position on the y-axis at the expense of seeing the simple shape.

In the new graph, the deepest trough has now shifted to the early part of the novel, which is now the location of the greatest negative valence in the story (it’s the section where Stephen gets sick and is then beaten by father Dolan). The end of the book now looks relatively darker since we no longer have the depths of hell from the midsection for comparison, but the end third of Portrait is definitely not as negative as the beginning third and this is reflected nicely in figure 6.  (This more positive ending is also evident, by the way, in the original shape–orange line–where the hump in the last third is slightly higher than the early hump.)

neutral_middle

Figure 6: Portrait with Swaffordized Middle

So, the Swafford method proves to be a very useful tool for testing and confirming our expectations.  If we remove the most negative section of the novel, then we should see the nadir of the simple shape shift to the next most negative section.  That is precisely what we see.  I have tested this over a series of other novels, and the effect is the same (see figure 9 below, for example).  This is a great method for validating that the tool is working as expected. Thanks Annie Swafford!

“But wait a second Jockers, what about those rascally ringing artifacts you promised.”

Yes, yes, there can indeed be ringing artifacts.  Let’s go get some. . . .

Annie follows her previous analysis with what seems like an even more extreme example.  She neutralizes everything in Joyce’s Portrait except for the middle 20 sentences of the novel.[3] When the resulting graph looks a lot like the original man-in-hole graph, she says, in essence: “Busted! there is your ringing artifact Dr. J!”  Figure 7 is the graphic from her blog.

portrait_middle_201

Figure 7: Only 20 (sic) sentences of Portrait

Busted indeed!  Those positive valence humps, peaking at 25 and 75 on the x-axis are dead ringers for ringers.  We know from constructing the experiment in this manner, that everything from 0 to ~49 and everything from ~51 to 100 on the x-axis is perfectly neutral, and yet the tool, the visualization, is revealing two positive humps before and after the middle section: horrible, happy, phantom humps that do not exist in the story!

But wait. . .

With all smoothing methods some degree of imprecision is to be expected.  Remember what Vonnegut points out: this is “an exercise in relativity.”  Relatively speaking, even the extreme example in figure 7 is, in my opinion, not too bad.  Just imagine a hypothetical protagonist cruising along in a hypothetical novel such as the one Annie has written with her neutral values.  This protagonist is feeling pretty good about all that neutrality; she ain’t feeling great, but she’s better than bad.  Then she hits that negative section . . . as Vonnegut would say, “oh, God damn it.”[4]  But then things get better, or more precisely, things get comparatively better.  So, the blue line is not a great representation of the narrative, but it’s not a bad approximation either.

But look, I understand my colleague’s concern for more precision, and I don’t want it to appear that I’m taking this ringing business too lightly.  Figure 8 (below) was produced using precisely the same data that Annie used in her two-sentence example; everything is neutralized except for those two sentences from the exact middle of the novel.  This time, however,  I have used a low pass filter set at 100.  Voila!  The new shape (blue) is nothing at all like the original (orange), and the new shape also provides the deep level of detail–and lack of ringing–that some users may desire.[5]  Unfortunately, using such a high, low-pass filter does not usually produce easily interpretable graphs such as seen in figure 8.

low_pass_100

Figure 8: Original shape with neutralized “Swafford Shape” using 100 components

In this very simple example, turning the low-pass filter up to 100 produces a graph that is very easy to read/interpret.   When we begin looking at real novels, however, a low-pass of 100 does not result in shapes that are very easy to visually interpret, and it becomes necessary to smooth them.  I think that is what visualization is all about, which is to say, simplifying the complex so that we can get the gist.  One way to simplify these emotional trajectories is to use a low, low pass filter.  Given that going low may cause more ringing, you need to decide just how low you can go.  Another option, that I demonstrated in my previous post, is to use a high value for the low pass filter (to avoid potential ringing) and then apply a lowess smoother (or your own favorite smoother) in order to reveal the “simple shape” (see figure 1 of http://www.matthewjockers.net/2015/03/09/is-that-your-syuzhet-ringing/).

In a future post, I’ll explore something I mentioned to Annie in our email exchange (prior to her public critique): an ad hoc method I’ve been working on that seeks to identify an “ideal” number of components for the low pass filter.

dorian_neutral

Figure 9: Dorian Gray behaving exactly as we would expect with last third neutralized

FOOTNOTES:

[1] Annie does not actually explain that the low-pass filter is a user controlled parameter or that what she is actually testing is the efficacy of the default value.  Users of the tool are welcome to experiment with different values for the low pass filter as I have done here: Is that your Syuzhet Ringing.

[2] I’ve been calling these simple shapes “emotional trajectories” and “plot.” Plot is potentially controversial here, so if folks would like to argue that point, I’m sympathetic.  For the first year of this research, I never used the word “plot,” choosing instead “emotional trajectory” or “simple shape,” which is Vonnegut’s term.  I realize plot is a loaded and nuanced word, but “emotional trajectory” and “simple shape” are just not really part of our nomenclature, so plot is my default term.

[3] There is a small discrepancy between Annie’s blog and her code.  Correction: Annie writes about and includes a graph showing the middle “20” sentences, but then provides code for retaining both the middle 2 and the middle 20 sentences.  Either way the point is the same.

[4] The two negative valence sentences from the middle of Portrait are as follows: “Nay, things which are good in themselves become evil in hell. Company, elsewhere a source of comfort to the afflicted, will be there a continual torment: knowledge, so much longed for as the chief good of the intellect, will there be hated worse than ignorance: light, so much coveted by all creatures from the lord of creation down to the humblest plant in the forest, will be loathed intensely.”

[5]  Annie has written that “Syuzhet computes foundation shapes by discarding all but the lowest terms of the Fourier transform.” That is a rather misleading comment. The low-pass-filter is set to 3 by default, but it is a user tunable parameter.  I explained my reasons for choosing 3 as the default in my email exchange with Annie prior to her critique.   It is unclear to me why Annie does not to mention my explanation, so here it is from our email exchange:

“. . . The short and perhaps unsatisfying answer is that I selected 3 based on a good deal of trial and error and several attempts to employ some standard filters that seek to identify a cutoff / threshold by examining the frequencies (ideal, butterworth, and several others that I don’t remember any more).  The trouble with these, and why I selected 3 as the default, is that once you go higher than 3 the resulting plots get rather more complicated, and the goal, of course, is to do the opposite, which is to say that I seek to reduce the plot to a simple base form (along the lines of what Vonnegut is suggesting).  Three isn’t magic, but it does seem to work well at rooting out the foundational shape of the story.  Does it miss some of the subtitles, yep, but again, that is the point, in part.  The longer answer is that is that this is something I’m still experimenting with.  I have one idea that I’m working with now…”

Is that Your Syuzhet Ringing?

Over the weekend, Annie Swafford published another installment in her ongoing critique of Syuzhet, the R package that I released in early February. In her recent blog post, an interesting approach for testing the get_transformed_values function is proposed[1].

Previously Annie had noted how using the default values for the low-pass filter may result in too much information loss, to which I replied that that is the point.  (Readers hung up on this point are advised to go back and watch the Vonnegut video again.) With any kind of smoothing, there is going to be information loss.  The function is designed to allow the user to tune the low pass filter for greater or lesser degrees of noise (an important point that I shall return to in a moment).

In the new post, Annie explores the efficacy of leaving the low pass filter at its default value of 3; she demonstrates how this value appears to produce a ringing artifact.  This is something that the two of us had discussed at some length in an email correspondence prior to this blogging frenzy.  In that correspondence, I promised to explore adding a gaussian filter to the package, a filter she believes would be more appropriate. Based on her advice, I have explored that option, and will do so further, but for now I remain unconvinced that there is a problem for Gauss to solve.[2]

As I said in my previous post, I believe the true test of the method lies in assessing whether or not the shapes produced by the transformation are a good approximation of the shape of the story. But remember too, that the primary point of the transformation function is to solve the problem of length; it is hard to compare the plot shape of a long novel to a short one.  The low-pass argument is essentially a visualization and noise reduction parameter.   Users who want a closer, scene by scene or sentence by sentence representation of the sentiment data, will likely gravitate to the get_percentage_values function (and a very large number of bins) as, for example, Lincoln Mullen has done on Rpubs.[3]

The downside to that approach, of course, is that you cannot compare two sentiment arcs mathematically; you can only do so by eye.  You cannot compare them mathematically because the amount of text inside each percentage segment will be quite different if the novels are of different lengths, and that would not be a fair comparison.  The transformation function is my attempt at solving this time domain conundrum.  While I believe that it solves the problem well, I’m certainly open to other options.  If we decide that the transformation function is no good, that it produces too much ringing, etc. then we should look for a more attractive alternative.  Until an alternative is found and demonstrated, I’m not going to allow the perfect to become the enemy of the good.

But, alas, here we are once again on the question of defining what is “good” and what is “good enough.”  So let us turn now to that question and this matter of ringing artifacts.

The problem of ringing artifacts is well understood in the signal processing literature if a bit less so in the narratological literature:-)  Annie has done a fine job of explicating the nature of this problem, and I can’t help thinking that this is a very clever idea of hers.  In fact, I wrote to Annie acknowledging this and noting how I wish I had thought of it myself.

But after repeating her experiment a number of times, with greater and lesser degrees of success, I decided that this exercise is ultimately a bit of a red herring.  Among other things, there are no books with zero neutral values for an entire third, but more importantly the exercise has more to do with the setting of a particular user parameter than it does with the package.

I’d like to now offer a bit of cake and eat it too.  This most recent criticism has focused on the default values for the low-pass filter that I set for the function. There is, of course, nothing preventing adjustment of that parameter by those with a taste for adventure.  The higher the number, the greater the number of components that are retained; the more components we retain, the less ringing and the closer we get to reproducing the original signal.

So let us assume for a moment that the sentiment detection methods all work perfectly. We know as a matter of fact that they don’t work perfectly (you know, like human beings), but this matter of imprecision is something we have already covered in a previous post where I showed that the three dictionary based methods tend to agree with each other and with the more sophisticated Stanford method.  So even though we know we are not getting every sentence’s sentiment just right, let’s pretend that we are, if only for a moment.

With that assumed, let us now recall the primary rationale for the Fourier transformation: to normalize the length of the x-axis.  As it happens, we can do that normalization (the cake) and also retain a great many more components than the 3 default components (eating it).  Figure 1 shows Joyce’s Portrait of the Artist transformed using a low pass filter size of 100.

This produces a graph with a lot more noise, but we have effectively eliminated any objectionable ringing.  With the addition of a smoothing line (lowess function in R), what we see once again (ta da) is a beautiful, if rather less dramatic, example of Vonnegut’s Man in Hole!  And this is precisely the goal, to reveal the plot shape latent in the noise.  The smaller low-pass filter accentuates this effect, the higher low-pass filter provides more information: both show the same essential shape.

Figure 4: Portrait with low pass at 100

Figure 1: Portrait with low pass at 100

foundation

Figure 2: Portrait with low pass at 3

low_pass_20

Figure 3: Portrait with low pass at 20

In the course of this research, I have hand examined the transformed shapes for several dozen novels.  The number of novels I have examined corresponds to the number that I feel I know well enough to assess (and also happen to possess in digital form).  These include such old and new favorites as:

  • Portrait of the Artist
  • Picture of Dorian Grey
  • Ulysses
  • Blood Meridian
  • Gone Girl
  • Finnegans Wake (nah, just kidding)
  • . . .
  • And many more.

As I noted in my previous post, the only way to determine the efficacy of this model is to see if it approximates reality.  We have now plotted Portrait of the Artist six ways to Sunday, and every time we have seen a version of the same man in hole shape.  I’ve read this book 20 times, I have taught this book a dozen times.  It is a man in hole plot.

In my (admittedly) anecdotal evaluations, I have continued to see convincing graphs, such as the one above (and the one below in figure 4).  I have found a few special books that don’t do very well, but that is a story you will have to wait for (spoiler alert, they are not works of satire or dark humor, but they are multi-plot novels involving parallel stories).

Still, I am open to the possibility that there is some confirmation bias possible here.  And this is why I wanted to release the package in the first place.  I had hoped that putting the code on gitHub would entice others toward innovation within the code, but the unexpected criticism has certainly been healthy too, and this conversation has certainly made me think of ways that the functions could be improved.

In retrospect, it may have been better to wait until the full paper was complete before distributing the code.  Most of the things we have covered in the last few weeks on this blog are things that get discussed in finer detail in the paper. Despite more details to come, I believe, as Dryden might say, that the last (plot) line is now sufficiently explicated.

Bonus Images:

dorian_100

Figure 4

In terms of basic shape, Figure 4 is remarkably similar to the more dramatized version seen in figure 5 below.  If you can’t see it, you aren’t reading enough Vonnegut.

dorian_3

Figure 5

[1] How’s that for some awkward passive voice? A few on Twitter have expressed some thoughts on my use of Annie’s first name in my earlier response.  Regular readers of this blog will know that I am consistent in referring to people by their full names upon first mention and by their first names thereafter.  Previous victims of my “house style” have included David Mimno, David;  Dana Mackenzie, Dana; Ben Schmidt, Ben; Franco Moretti, Franco, and Julia Flanders, Julia.  There are probably others.

[2] Anyone losing sleep over this gaussian filter business is welcome to grab the code and give it a whirl.

[3] In the essay I am writing about this work, I address a number of the nuances that I have skipped over in these blog posts.  One of the nuances I discuss is an automated process for the selection of a low-pass filter size.

Some thoughts on Annie’s thoughts . . . about Syuzhet

Annie Swafford has raised a couple of interesting points about how the syuzhet package works to estimate the emotional trajectory in a novel, a trajectory which I have suggested serves as a handy proxy for plot (in the spirit of Kurt Vonnegut).

Annie expresses some concern about the level of precision the tool provides and suggest that dictionary based methods (such as the three I include as options in syuzhet) are not reliable. She writes “Sentiment analysis based solely on word-by-word lexicon lookups is really not state-of-the-art at all.” That’s fair, I suppose, but those three lexicons are benchmarks of some importance, and they deserve to be included in the package if for no other reason than for comparison.  Frankly, I don’t think any of the current sentiment detection methods are especially reliable. The Stanford tagger has a reputation for being the main contender for the title of “best in the open source market,” but even it hovers around 80 – 83% accuracy.  My own tests have shown that performance depends a good deal on genre/register.

But Annie seems especially concerned about the three dictionary methods in the package. She writes “sentiment analysis as it is implemented in the syuzhet package does not correctly identify the sentiment of sentences.” Given that sentiment is a subtle and nuanced thing, I’m not sure that “correct” is the right word here. I’m not convinced there is a “correct” answer when it comes to this question of valence. I do agree, however, that some answers are more or less correct than others and that to be useful we need to be on the closer side. The question to address, then, is whether we are close enough, and that’s a hard one. We would probably find a good deal of human agreement when it comes to the extremes of sentiment, but there are a lot of tricky cases, grey areas where I’m not sure we would all agree.  We certainly cannot expect the tool to perform better than a person, so we need some flexibility in our definition of “correct.”

Take, for example, the sentence “I studied at Leland Stanford Junior University.” The state-of-the-art Stanford sentiment parser scores this sentence as “negative.” I think that is incorrect (you are welcome to disagree;-). The “bing” method, that I have implemented as the default in syuzhet, scores this sentence as neutral, as does the “afinn” method (also in the package). The NRC method scores it as slightly positive. So, which one is correct? We could go all Derrida on this sentence and deconstruct each word, unpack what “junior” really means. We could probably even “problematize” it! . . . But let’s not.

What Annie writes about dictionary based methods not being the most state-of-the-art is true from a technical standpoint but sophisticated methods and complexity do not necessarily correlate with results.  Annie suggest that “getting the Stanford package to work consistently would go a long way towards addressing some of these issues,” but as we saw with the sentence above, simple beat sophisticated, hands down[1].

Consider another sentence: “Syuzhet is not beautiful.” All four methods score this sentence as positive, even the Stanford tool, which tends to do a better job with negation, says “positive.”

It is easy to find opposite cases where sophisticated wins the day. Consider this more complex sentence: “He was not the sort of man that one would describe as especially handsome.” Both NRC and Afinn score this sentence as neutral, Bing scores it slightly positive and Stanford scores it slightly negative. When it comes to negation, the Stanford tool tends to perform a bit better, but not always. The very similar sentence “She was not the sort of woman that one would describe as beautiful” is scored slightly positive by all four methods.

What I have found in my testing is that these four methods usually agree with each other, not exactly but close enough. Because the Stanford parser is very computationally expensive and requires special installation, I focused the examples in the Syuzhet Package Vignette on the three dictionary based methods. All three are lightning fast by comparison, and all three have the benefit of simplicity.

But, are they good enough compared to the more sophisticated Stanford parser?

Below are two graphics showing how the methods stack up over a longer piece of text. The first image shows sentiment using percentage based segmentation as implemented in the get_percentage_values() function.

percent_based

Four Methods Compared using Percentage Segmentation

The three dictionary methods appear to be a bit closer, but all four methods do create the same basic shape.  The next image shows the same data after normalization using the get_transformed_values() function.  Here the similarity is even more pronounced.

four_methods

Four Methods Compared Using Transformed Values

While we could legitimately argue about the accuracy of one sentence here or one sentence there, as Annie has done, that is not the point. The point is to reveal a latent emotional trajectory that represents the general sense of the novel’s plot. In this example, all four methods make it pretty clear what that shape is: This is what Vonnegut called “Man in Hole.”

The sentence level precision that Annie wants is probably not possible, at least not right now.  While I am sympathetic to the position, I would argue that for this particular use case, it really does not matter.  The tool simply has to be good enough, not perfect.  If the overall shape mirrors our sense of the novel’s plot, then the tool is working, and this is the area where I think there is still a lot of validation work to do.  Part of the impetus for releasing the package was to allow other people to experiment and report results.  I’ve looked at a lot of graphs, but there is a limit to the number of books that I know well enough to be able to make an objective comparison between the Syuzhet graph and my close reading of the book.

This is another place where Annie raises some red flags.  Annie calls attention to these two images (below) from my earlier post and complains that the transformed graph is not a good representation of the noisy raw data.  She writes:

The full trajectory opens with a largely flat stretch and a strong negative spike around x=1100 that then rises back to be neutral by about x=1500. The foundation shape, on the other hand, opens with a rise, and in fact peaks in positivity right around where the original signal peaks in negativity. In other words, the foundation shape for the first part of the book is not merely inaccurate, but in fact exactly opposite the actual shape of the original graph.

Annie’s reading of the graphs, though, is inconsistent with the overall plot of the novel, whereas the transformed plot is perfectly consistent with the novel. What Annie calls a “strong negative spike” is the scene in which Stephen is pandied by Father Arnell.  It is an important negative moment, to be sure, but not nearly as important, or as negative, as the major dip that occurs midway through the novel–when Stephen experiences Hell. The scene with Arnell is a minor blip compared to the pages and pages of hell and the pages and pages of anguish Stephen experiences before his confession.

noisy foundation

Annie is absolutely correct in noting that there is information loss, but wrong in arguing that the graph fails to represent the novel.  The tool has done what it was designed to do: it successfully reveals the overall shape of the narrative.  The first third of the novel and the last third of the novel are considerably more positive than the middle section.  But this is not meant to say or imply that the beginning and end are without negative moments.

It is perfectly reasonable to want to see more of the page to page, or scene by scene fluctuations in sentiment, and that can be easily achieved by using the percentage segmentation method or by altering the low-pass filter size.  Changing the filter size to retain five components instead of three results in the graph below.  This new graph captures that “strong negative spike” (not so “strong” compared to hell) and reveals more of the novel’s ups and downs.  This graph also provides more detail about the end of the novel where Stephen comes down off his bird-girl high and moves toward a more sober perspective for his future.

Portrait with Five Components

Portrait with Five Components

Of course, the other reason for releasing the code is so that I can get suggestions for improvements. Annie (and a few others) have already propelled me to tweak several functions.  Annie found (and reported on her blog) some legitimate flaws in the openNLP sentence parser. When it comes to passages with dialog, the openNLP parser falls down on the job. I ran a few dialog tests (including Annie’s example) and was able to fix the great majority of the sentence parsing errors by simply stripping out the quotation marks in advance. Based on Annie’s feedback, I’ve added a “quote stripping” parameter to the get_sentences() function. It’s all freshly baked and updated on github.

But finally, I want to comment on Annie’s suggestion that

some texts use irony and dark humor for more extended periods than you [that’s me] suggest in that footnote—an assumption that can be tested by comparing human-annotated texts with the Syuzhet package.

I think that would be a great test, and I hope that Annie will consider working with me, or in parallel, to test it.  If anyone has any human annotated novels, please send them my/our way!

Things like irony, metaphor, and dark humor are the monsters under the bed that keep me up at night. Still, I would not have released this code without doing a little bit of testing:-). These monsters can indeed wreak a bit of havoc, but usually they are all shadow and no teeth. Take the colloquial expression “That’s some bad R code, man.” This sentence is supposed to mean the opposite, as in “That is a fine bit of R coding, sir.”  This is a sentence the tool is not likely to get right; but, then again, this sentence also messes up my young daughter, and it tends to confuse English language learners. I have yet to find any sustained examples of this sort of construction in typical prose fiction, and I have made a fairly careful study of the emotional outliers in my corpus.

Satire, extended satire in particular, is probably a more serious monster.  Still, I would argue that the sentiment tools performs exactly as expected; they just don’t understand what they are “reading” in the way that we do.  Then again, and this is no fabrication, I have had some (as in too many) college students over the years who haven’t understood what they were reading and thought that Swift was being serious about eating succulent little babies in his Modest Proposal (those kooky Irish)!

So, some human beings interpret the sentiment in Modest Proposal exactly as the sentiment parser does, which is to say, literally! (Check out the special bonus material at the bottom of this post for a graph of Modest Proposal.) I’d love to have a tool that could detect satire, irony, dark humor and the like, but such a tool is still a good ways off.  In the meantime, we can take comfort in incremental progress.

Special thanks to Annie Swafford for prompting a stimulating discussion.  Here is all the code necessary to repeat the experiments discussed above. . .

library(syuzhet)
path_to_a_text_file <- system.file("extdata", "portrait.txt",
package = "syuzhet")
joyces_portrait <- get_text_as_string(path_to_a_text_file)
poa_v <- get_sentences(joyces_portrait)

# Get the four sentiment vectors
stanford_sent <- get_sentiment(poa_v, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(poa_v, method="bing")
afinn_sent <- get_sentiment(poa_v, method="afinn")
nrc_sent <- get_sentiment(poa_v, method="nrc")

######################################################
# Plot them using percentage segmentation
######################################################
plot(
  scale(get_percentage_values(stanford_sent, 10)), 
  type = "l", 
  main = "Joyce's Portrait Using All Four Methods\n and Percentage Based Segmentation", 
  xlab = "Narrative Time", 
  ylab = "Emotional Valence",
  ylim = c(-3, 3)
)
lines(
  scale(get_percentage_values(bing_sent, 10)),
  col = "red", 
  lwd = 2
)
lines(
  scale(get_percentage_values(afinn_sent, 10)),
  col = "blue", 
  lwd = 2
)
lines(
  scale(get_percentage_values(nrc_sent, 10)),
  col = "green", 
  lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)

######################################################
# Transform the Sentiments
######################################################
stan_trans <- get_transformed_values(
  stanford_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)
bing_trans <- get_transformed_values(
  bing_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)
afinn_trans <- get_transformed_values(
  afinn_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)

nrc_trans <- get_transformed_values(
  nrc_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)

######################################################
# Plot them all
######################################################
plot(
  stan_trans, 
  type = "l", 
  main = "Joyce's Portrait Using All Four Methods", 
  xlab = "Narrative Time", 
  ylab = "Emotional Valence",
  ylim = c(-2, 2)
)

lines(
  bing_trans,
  col = "red", 
  lwd = 2
)
lines(
  afinn_trans,
  col = "blue", 
  lwd = 2
)
lines(
  nrc_trans,
  col = "green", 
  lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)


######################################################
# Sentence Parsing Annie's Example
######################################################
annies_sentences_w_quotes <- '"Mrs. Rachael, I needn’t inform you who were acquainted with the late Miss Barbary’s affairs, that her means die with her and that this young lady, now her aunt is dead–" "My aunt, sir!" "It is really of no use carrying on a deception when no object is to be gained by it," said Mr. Kenge smoothly, "Aunt in fact, though not in law."'

# Strip out the quotation marks
annies_sentences_no_quotes <- gsub("\"", "", annies_sentences)

# With quotes, Not Very Good:
s_v <- get_sentences(annies_sentences_w_quotes)
s_v

# Without quotes, Better:
s_v_nq <- get_sentences(annies_sentences_no_quotes)
s_v_nq

######################################################
# Some Sentence Comparisons
######################################################
# Test one
test <- "He was not the sort of man that one would describe as especially handsome."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

# test 2
test <- "She was not the sort of woman that one would describe as beautiful."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

# test 3
test <- "That's some bad R code, man."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

SPECIAL BONUS MATERIAL

Swift’s classic satire presents some sentiment challenges.  There is disagreement between the Stanford method and the other three in segment four where the sentiments move in opposite directions.

modest_percent

FOOTNOTE

[1] By the way, I’m not sure if Annie was suggesting that the Stanford parser was not working because she could not get it to work (the NAs) or because there was something wrong in the syuzhet package code. The code, as written, works just fine on the two machines I have available for testing. I’d appreciate hearing from others who are having problems; my implementation definitely qualifies as a first class hack.

The Rest of the Story

My blog on February 2, about the Syuzhet package I developed for R (now available on CRAN), generated some nice press that I was not expecting: Motherboard, then The Paris Review, and several R blogs (Revolutions, R-Bloggersinside-R) all featured the work.  The press was nice, but I was not at all prepared for the focus to be placed on the one piece of the story that I had yet to explain, namely, how I used the Syuzhet code and some unsupervised machine clustering to identify what seem to be six, or possibly seven, archetypal plot shapes.  So, here now is the rest of the story. . .

In brief: (A Plot Modeling Recipe)

  1. Apply functions available in the Syuzhet package, to generate a generalized a plot shape for every book in a corpus of 41,383 novels.[1]
  2. Employ euclidean distance to build a large distance matrix by computing the similarity between every pair of novels.
  3. Use unsupervised hierarchical clustering to group books based on the similarity of their plot shape.
  4. Examine the resulting clusters with furrowed brow and say “hmmmm.”
  5. Test several methods of cluster identification (silhouette, gap statistic, elbow).
  6. Develop ad-hoc cluster identification algorithm.
  7. Observe that there are six, or maybe seven, fundamental plot shapes.
  8. Repeat everything over and over again for 12 months while worrying a lot about observing six or seven plots.

Caveats:

Before I reveal the six/seven plots (scroll down if you can’t wait), it’s important to point out that what I offer here is the result of two particular methods of analysis.  If you don’t like the plot shapes that these methods reveal, then you’ll be free to take issue with the methods and try a different approach.  You could, for example,

  1. Read 41,383 novels and sketch the plots of each using Vonnegut’s chalkboard. You could then spend a few decades organizing and classifying them into some sort of taxonomy.  You could then work on clustering them into a finite set of foundational shapes.  This is more or less the method Vonnegut employed, excepting, of course, that he probably only read a few hundred stories and probably only sketched out a few dozen on his chalk board.
  2. You could use another method, such as the one that Benjamin Schmidt has proposed over at his Sapping Attention blog.

Background:

In my previous post, I explained how I developed some software (named “Suyzhet” in homage to Propp) to extract plot shapes from novels based on sentiment analysis.  In order to understand how I derive the six/seven plot archetypes, we need to understand a little bit about Euclidean distance and hierarchical clustering.  The former provides a mathematical way of computing the similarity or distance between two points in space.  When that space is two dimensional, it’s pretty easy to visualize what is going on: we plot two points on an x-y grid and then measure the distance between them.  When the space is three dimensional, it gets a bit harder, but you can still imagine measuring the distance between some point about three feet off the floor in your kitchen and some point about five feet off the floor in your living room.  Once we go beyond the third dimension things get downright tricky, and we have to rely on the mathematics of the Euclidean metric. Regardless of the dimensions, though, the fundamental idea is the same: we are measuring the distance between points and the shorter that distance the more similar the points are.  In this case the points are books, and the feature that determines their point in space is their “plot shape” as derived from Syuzhet.

Once the distances between all the points are measured, we construct a “distance matrix.”  This distance matrix is just a big spread sheet where we can look up the distance from any one point to any other point.  It might look something like Figure 1.  According to this matrix, the distance between Book 1 and Book 3 is “0.5” whereas the distance between Book 2 and Book 3 is “0.25.”

Distance Matrix

Figure 1: A Distance Matrix

Hierarchical clustering methods use this distance matrix as a foundation upon which to build a hierarchy of similarities. This hierarchy is often visualized as a dendrogram such as seen in Figure 2.

Figure 2: Dendrogram

Figure 2: Dendrogram

Figure 2 is a bit like a tree (upside down); it has branches.   At any vertical point, we can cut this tree and the result would be to separate it into two or more branches, or clusters.  For example, cutting the tree in Figure 2 at a height of 225, would result in four primary clusters.  The trick with this sort of tree cutting, is identifying an “ideal” vertical position to insert the saw.  Before I get to that, though, we need to step back for a moment to those plots created with the Syuzhet software.

The Plot Thickens

In my previous post, I showed what the plots of Joyce’s Portrait and Wilde’s Dorian Grey look like when graphed using Suyzhet.  Underneath each plot graph, is a sequence of 100 numbers from which the shape of the plot is derived.  I have collected these sequences for 41,383 novels, and when I average them, I get the “super average plot archetype” seen in Figure 3.

The Super Average Plot

Figure 3: The Super Average Plot

That is kind of interesting, but things get a lot more interesting after a bit of tree cutting. If you look at the dendrogram in Figure 2 again, you see that cutting the tree just below 250 will result in two primary clusters.  After cutting the tree at that point, it is then possible to calculate a mean shape for all the books in each cluster. The result is seen in Figure 4.

Figure 4: Two Primary Plots

Figure 4: Two Primary Plots

In homage to Vonnegut, I have titled the shape on the left “man in hole.” 46% of the books in this corpus fall into this cluster.  The remaining 54% are more similar to the plot on the right, which I have named “man on hill.”  At this point, I’d encourage you to take a quick peek Maya Eilam’s very nice visualization of Vonnegut’s archetypal plot shapes.  The plots I’ll show here are not going to look quite the same, but there will be some resonance.

Looking again at the dendrogram, you can see that the two primary clusters (MOH and MIH), can be split fairly easily into a set of four clusters.  When the tree is cut in this manner, the two plots shown in Figure 4, split into four.

Figure 5: MIH Types I and II

Figure 5: MIH Types I and II

Figure 5 shows the derivatives of the man in hole plot shape.  The man in hole plot splits into one shape (“Type I”) that looks a lot like classical tragedy and another (“Type II”) that looks more like comedy.  Whatever the case, one has a much happier ending than the other.  Figure 6 shows the derivatives of the man on hill.

Figure 6: Man on Hill Types I and II

Figure 6: Man on Hill Types I and II

Here again, one plot leads us to a happy ending and the other to a rather dark conclusion.

Cutting the tree beyond these four shapes gets trickier.  It is difficult to know where precisely to stop and cut.  Move the cut point just a little bit, and we could go from having 10 clusters to 20; it is possible, in fact, to keep moving the the cut point further and further down the tree until a point at which every book is its own cluster!  Doing that, however, would be rather silly (see “Caveats” item 1 above).  So the objective is to find an “ideal” place to cut the tree such that the resulting clusters have  the greatest amount of internal homogeneity while simultaneously being as different from each other as possible.

My solution to this problem involves iterating through a series of possible cut points and then taking two measures after each cutting.  The first is a measure of cluster homogeneity the second is a measure of cluster dissimilarity.  This process is more easily described in pseudocode:

Let K be a number of possible clusters from 2 to 50.

for(K in 2:50){
- cut the tree such that there are K clusters
- calculate the amount of in-cluster homogeneity
- calculate the dissimilarity between the K clusters
}

With each iteration, I store the resulting values so that I can compare them and identify a value of K that best fulfills the objectives described above.  In order to make this test more robust, I opted to randomly select a subset of one half of the books in the corpus (roughly 20K) and run this test over and over again (each time with a new random sample).  When I did this, I found that the method identified six as the ideal number of clusters about 90% of the time.  The other 10% of the time, it said that seven or eight was a better choice.[2]

In addition to this mathematical approach, I also employed good old subjective evaluation.  The tool suggested six or seven, but this number (six, seven) would be rather useless if the resulting shapes did not make any sense to those of us who actually read the books.  So, I looked at a lot of plots; everything from two to twenty.  After twenty, I figure there is not much point because the shapes get so similar to each other that it would be rather hard to make the case that plot 19 is really all that different from plot 20.  With six and with seven, however, there remains good deal of variation.

We saw above how MIH and MOH both split into sub types.  These I labeled as MIH Type I, MIH Type II, MOH Type I, and MOH Type II.  At the cut point that results in six plots, MIH Type I and MOH Type II stay as we saw them above in figures 5 and 6, but MIH II and MOH I both split resulting in the shapes seen in Figure 7.

Figure 7: Level Six

Figure 7: Level Six

Already we can begin to see some shape repetition.  The variant of MIH seen in the lower right, is ultimately a steeper, or more extreme, version of the basic MIH.  The other three, though, appear rather more distinct.

At level seven, MOH II splits in two resulting in the shapes shown in Figure 8. After seven, we begin to see a lot more shape repetition, and though each of these shapes is unique in terms of its precise placement on the y axis, i.e. some are more happy others more dark, the arcs are generally similar.

Obviously, there is a great deal more interpretive work to be done here.  Many of these shapes, I think, can be further classified according to their “affects” and “effects.” What, for example, is the overall impression one gets from a book that takes a character to great heights (MOH) and then plunges him/her into a pit of despair from which there is no exit (as is seen in Figure 8 left).

Figure 8: Seven Plots

Figure 8: Seven Plots

But perhaps even more interesting than any of this is the possibility for movement between scales.  Scale hopping is something I advocate in Macroanalysis.  The great power of big(ish) data is that it allows us to contextualize our small reading.  Joyce’s Portrait of the Artist (Figure 9) is a type of MIH.  What other books are MIHs?  Are they popular books?  Are they classics?  Best sellers?  Can we find another telling of the same story?  This is the work that I am doing now, moving from the large to the small and back again. Figures 10-15 (below) present six popular/well-known novels and their corresponding plot types for consideration.

[Update March 2: Annie Swafford offers an interesting critique of this work on her blog.  Her post includes some comments from me in response.]

poa

Figure 9: Joyce’s Portrait

 

Figure 10

Figure 10

Figure 11

Figure 11

Figure 12

Figure 12

Figure 13

Figure 13

Figure 14

Figure 14

Figure 15

Figure 15

Footnotes:

[1] The Suyzhet package performs a certain type of text analysis, and I’m claiming that the results of this analysis may serve as a pretty darn good proxy for plot.  That said, I’ve been working on this problem for two years, and I know some specific places where it fails.  The most spectacular example of failure was discovered by my son. He’d just finished reading one of the books in my corpus, and I showed him the plot shape from the book and asked him it it made sense. He said, “well, yes, mostly.  But this spike here is all wrong.”  It was a spike in good fortune, positive valence, at precisely the place in the novel where the villains had scored a major victory.  The positive valence was associated with a several page long section in which the bad guys were having a very good time. Readers, of course, would see this as a negative moment in the text, Suyzhet does not.  Nor does Suyzhet understand irony and dark humor and so on.  On a whole, however, Suyzhet gets it right, and that’s because most books are not sustained satire, or sustained irony.  Most books end up using emotional markers in a fairly consistent and conventional way.  Indeed, even for an experimental novel such as Joyce’s Ulysses, Suyzhet produces a plot shape that I consider to be a good match to the ebbs and flows of the text.

[2] In a longer, less blog friendly version of this research that is to appear in a collection of essays on digital literary studies, I explain the mathematics in precise detail.

Revealing Sentiment and Plot Arcs with the Syuzhet Package

Introduction

This post is a followup to A Novel Method for Detecting Plot posted June 15, 2014.

For the past few years, I have been exploring the relationship between sentiment and plot shape in fiction. Earlier today I posted an R package titled “syuzhet” to github. The package is designed to extract sentiment and plot information from prose. Methods for text import, sentiment extraction, and plot arc modeling are described in the documentation and in the package vignette. What follows below is a blog-friendly version of a longer academic paper describing how I employed this package to study plot in a corpus of ~50,000 novels.

noisy

Background

When I began the research that lead to this package, my goal was to study positive and negative emotions in literature across time, much in the same way that I had studied style and theme in Macroanalysis. Along the way, however, I discovered that fluctuations in sentiment can serve as a rather natural proxy for fluctuations in plot movement. Studying plot shifts via sentiment analysis turned out to be a far more interesting project than the simple study of sentiment, and my research got a huge boost when I stumbled upon a video of Kurt Vonnegut describing plot in precisely these terms.

After seeing the video and hearing Vonnegut’s opening challenge (“There’s no reason why the simple shapes of stories can’t be fed into computers”), I set out to develop a systematic way of extracting plot arcs from fiction. I felt this might help me to better understand and visualize how narrative is constructed. The fundamental idea, of course, was nothing new. What I was after is what the Russian formalist Vladimir Propp had defined as the narrative’s syuzhet (the organization of the narrative) as opposed to its fabula (raw elements of the story).

Syuzhet is concerned with the linear progression of narrative from beginning (first page) to the end (last page), whereas fabula is concerned with the specific events of a story, events which may or may not be related in chronological order. When we study fabula, which is what we typically do in literature courses, we mentally reconstruct the events into chronological order. We hope that this reconstruction of the fabula will help us understand the experience of the characters, the core story, etc. When we study the syuzhet, we are not so much concerned with the order of the fictional events but specifically interested in the manner in which the author presents those events to readers.

Consider the technique that radio personality Paul Harvey used in his iconic radio show “The Rest of the Story.” In each story, Harvey would hold back certain key elements until the very end of the program. The narrative would appear to have reached its conclusion, and then Harvey would say, “and now, the rest of the story.” At this point, he would reveal the held back information and the listener would reconstruct the entire fabula. The effect (and affect) of Harvey’s technique, the syuzhet, was usually stunning and pleasantly surprising. Had the story been told in simple chronological order, it would have been bland, perhaps even boring. What gave Harvey’s show power was his narrative technique.

This power was largely derived from the organization of the narrative elements and the manner in which Harvey offered listeners clues and then used narrative and language to evoke both curiosity and emotional response. What Harvey said and how he said it, were critical elements to the overall effect of the story. Harvey’s success was in finding and mastering a particular style of plot, a plot that has much in common with those found in mystery and detective fiction. A series of clues is presented along side a series of misdirections and the mystery is ultimately resolved in some grand reveal that defies expectations.

A Finite Number of Plots

But this Harvey method is just one among many possible plots. Countless scholars and non-scholars have pontificated about the possibility of a finite set of fundamental or archetypal plot shapes.

One of the more recent and famous/infamous of these scholars is Christopher Booker, whose 2004 book, titled The Seven Basic Plots: Why We Tell Stories, argues for a Jungian inspired understanding of plot in terms of seven basic archetypes. Booker’s work appears to be strongly influenced by prior work describing plot in terms of conflict. These core conflicts will be familiar to students of literature: such constructions were once taught to us under the headings of “man vs. man,” “man against nature,” “man vs. society,” and so on.

Other scholars have offered other numbers. William Foster-Harris has argued in favor of three basic patterns of plot The Basic Patterns of Plot (Foster-Harris. University of Oklahoma Press, 1959.); Ronald B. Tobias has argued for twenty (Tobias, Ronald B. 20 Master Plots. Cincinnati: Writer’s Digest Books, 1993.), and Georges Polti claims that there are thirty six (The Thirty-Six Dramatic Situations. trans. Lucille Ray). So the story goes.

All of these discussions about plot typically involve some discussion of a story’s central conflict. But discussions of conflict are more appropriately classified as fabula. Nevertheless, many of these same discussions also explore the flow, or trajectory, of the narrative, and these I consider to be appropriately categorized as syuzhet. Often these discussions of plot engage visualization in order to convey the “movement” of the narrative. Perhaps the best example of this is the one offered by Vonnegut.

poa

A Significant Problem

Still, all of these explanations of plot suffer from a significant problem: a lack of data. Each of these proposed taxonomies suffers from anecdotalism. Vonnegut draws the plot of Cinderella for us on his chalk board, and we can imagine a handful of similar plot shapes. He describes another plot and names it “man in hole,” and we can imagine a few similar stories. But our imaginations are limited.

This limitation led me to think hard about the problem of how to compare, mathematically and computationally, the shape of one story to another. Assuming I could use computers and some NLP magic to extract plot shape from narrative (see A Novel Method for Detecting Plot), it would still be impossible to compare one shape to another because of the simple fact that stories are not the same length. Vonnegut solved this problem by creating an x-axis that runs from B to E, that is, from beginning to end. What Vonnegut did not solve, however, was the real computational problem of text length.

It was tempting to consider simply breaking each book into ten or one-hundred equally sized pieces and then taking measurements of the mean emotional valence in each chunk.

poa.percent

Unfortunately, some of the books would have much larger chunks and with larger chunks would come the possibility of more and more diverse valence markers. What happens, in fact, is that larger chunks of text tend to have a preponderance of both positive and negative valence markers. The end result is that all the means end up very close to neutral on the y-axis of emotional valence. Indeed, books as a whole tend to have a mean valence close to zero on a scale of -1 to 1. (I tested this by calculating the mean valence for 3500 novels in my nineteenth century novels corpus and then plotting the results as a histogram. The distribution showed a clustering around zero with very few books on the extremes.)

So, I needed a way to deal with length. I needed a way to compare the shapes of the stories regardless of the length of the novels. Luckily, since coming to UNL, I’ve become acquainted with a physicist who is one of the team of scientists who recently discovered the Higgs Boson at CERN. Over coffee one afternoon, this physicist, Aaron Dominguez, helped me figure out how to travel through narrative time.

A Solution

Aaron introduced me to a mathematical formula from signal processing called the Fourier transformation. The Fourier transformation provides a way of decomposing a time based signal and reconstituting it in the frequency domain. A complex signal (such as the one seen above in the first figure in this post) can be decomposed into series of symmetrical waves of varying frequencies. And one of the magical things about the Fourier equation is that these decomposed component sine waves can be added back together (summed) in order to reproduce the original wave form–this is called a backward or reverse transformation. Fourier provides a way of transforming the sentiment-based plot trajectories into an equivalent data form that is independent of the length of the trajectory from beginning to end. The frequency domain begins to solve the book length problem.

It turns out that not all of these sine waves in the frequency domain are created equal; some play a bigger role in the construction of the original signal. In signal processing, a low-pass filter can be used to remove the background “hiss” in an audio recording, and a similar approach can be used to filter out the extremes in the sentiment trajectories. When a low-pass filter is applied to the sentiment data, it’s possible to filter and thereby smooth out a great deal of the affectual noise.

The filtered data from the frequency domain can then be reconstituted back into the time domain using the reverse transformation. At the same time, the x-axis can be normalized and the foundation shape of the story revealed.

foundation

Above you can see the core shape of Joyce’s Portrait revealed using the “bing” method of the get_sentiment function in the syuzhet package. (Check the package documentation and vignette for details on the various options and methods.)

Once a book’s plot trajectory is converted into this normalized space, we no longer have the problem of comparing books of different lengths. Compare the foundation shape of Joyce’s Portrait (above) to Wilde’s Picture of Dorain Grey (below).

wilde

The models reflect the key narrative movements in both of these plots. Young Stephen reaches a low point during and just after the sermon on hell which occurs midway through the narrative. Dorian’s life takes a dark turn as the reality of the portrait becomes apparent. But the full power of these transformed plots does not sit simply in visualization. The values that inform these visualizations can now be compared. In a follow up post, I’ll discuss how I measured and compared 40,000+ plot shapes and then clustered the resulting data in order to reveal six common, perhaps archetypal, plot shapes. . .

Plot Arcs (Schmidt Style)

A few weeks ago Ben Schmidt posted a provocative blog entry titled “Typical TV episodes: visualizing topics in screen time.” It’s worth a careful read. . .

Ben began by topic modeling the closed captioning data from a series of popular TV series and then visualizing the ten most common topics over the time span of each episode. In other words, the x-axis is time, and the y-axis is a measure of topical presence. The end result is something that begins to look a bit like what we could call plot.

Ben followed this post with an even more provocative one on 12/16/14 “Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts“. This post led a number of us (Underwood, Mimno, Cherny, etc.) to question what the approach might reveal if applied to novels . . .

In my own recent work, I have been attempting to model plot movement in narrative fiction by analyzing the rise and fall of emotional valence across narrative time. It has been clear to me, however, that my method is somewhat impoverished by a lack context for the emotions I am measuring; Ben’s topic-based approach to plot structure might be just the context I’m missing, and some correlation analysis might be just the right recipe . . . as usual, Ben has given us a lot to think about—i.e. Happy Holidays!

After following the discussion on Twitter and on Ben’s blog, David Mimno wrote to me about whipping up some of these topical plot lines based on the 500 Topic model that I had built for Macroanalysis. Needless to say, I thought this was a great idea. (David and I had previously revisited my topical data for an article in Poetics.) Within a few hours, David had run the entire collection of 500 topics and produced 500 graphs showing the general behavior of each topic across all of the 3,500 texts in my corpus. You will find the output of David’s work here: http://mimno.infosci.cornell.edu/novels/plot.html

In David’s short introductory paragraph, he calls our attention to two specific topic graphs, one for the topic labeled “school” and another labeled “punishment.” You can find my graphs for these two topics here (school) and here (punishment). In referencing these two plots, David calls our attention to one topic (school) that appears prominently at the beginnings of novels in this corpus (think Bildungsroman, perhaps?) and another topic (punishment) that tends to be prominent at the end of novels (think Newgate novels or Oliver Twist, perhaps?).

Like the data from Ben, this data David has mined from my 19th century novels topic model is incredibly rich and demands deeper inspection. I’ve only begun to digest it in bits, but I do observe that a lot of topics carrying negative valence seem to rise over the course of narrative time. This makes intuitive sense if we believe that the central conflict of a novel must grow more intense as the novel progresses. The exciting thing to do ext is to move from the macro to the micro scale and look at the individual novels within this larger context. Perhaps we’ll be able to identify archetypal patterns and then observe which novels stick to the archetypes and which digress. . . what a feast!

Luckily we have a whole new year to indulge!

NHC Summer Institutes in Digital Humanities

I’m pleased to announce that Willard McCarty and I are leading a two-year set of summer institutes in digital humanities at the National Humanities Center. Here is the official announcement:

“The first of the National Humanities Center’s summer institutes in digital humanities, devoted to digital textual studies, will convene for two one-week sessions, first in June 2015 and again in 2016. The objective of the Institute in Digital Textual Studies is to develop participants’ technological and scholarly imaginations and to combine them into a powerful investigative instrument. Led by Willard McCarty (King’s College London and University of Western Sydney) and Matthew Jockers (University of Nebraska), the Institute aims to further the development of individual as well as collaborative projects in literary and textual studies. The Institute will take place in Chapel Hill, North Carolina, in 2015 and at the National Humanities Center in Research Triangle Park, North Carolina, in 2016.”

The first workshop will take place June 8 – 12. Applications are now open. See http://nationalhumanitiescenter.org/digital-humanities/application.html

NHC Flyer

Reading Macroanalysis: The Hard Way!

This past November, Judge Denny Chin ruled to dismiss the Authors Guild’s case against Google; the Guild vowed they would appeal the decision and two months ago their appeal was submitted. I’ll leave it to my legal colleagues to discuss the merit (or lack) in the Guild’s various arguments, but one thing I found curious was the Guild’s assertion that 78% of every book is available, for free, to visitors to the Google Books pages.

According to the Guild’s appeal:

Since 2005, Google has displayed verbatim text from copyrighted books on these pages. . . Google generally divides each page image into eighths, which it calls “snippets.”. . . Once a user retrieves a book through her initial search, she can enter any other search terms she chooses, and the author’s verbatim words will be displayed in three snippets for each search. Although Google has stated that any given search by a user “only” displays three snippets of each book, a single user can view far more than three snippets from a Library Project book by performing multiple searches using different terms, including terms suggested by Google. . . Even minor variations in search terms will yield different displays of text. . . Google displays snippets from each book, except that it withholds display of 10% of the pages in each book and of one snippet per page. . .Thus, Google makes the vast majority of the text of these books—in all, 78% of each work—available for display to its users.

I decided to test the Guild’s assertion, and what better book to use than my own: Macroanalysis: Digital Methods and Literary History.

In the “Preview,” Google displays the front matter (table of contents, acknowledgements, etc) followed by the first 16 pages of my text. I consider this tempting pabulum for would be readers and within the bounds of fair use, not to mention free advertising for me. The last sentence in the displayed preview is cut off; it ends as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now. . . ” Thus ends page 16 and thus ends Google’s preview.

According to the author’s guild, however, a visitor to this book page can access much more of the book by using a clever method of keyword searching. What the Guild does not tell us, however, is just how impractical and ridiculous such searching is. But that is my conclusion and I’m getting ahead of myself here. . .

To test the guild’s assertion, I decided to read my book for free via Google books. I began by reading the material just described above, the front matter and the first 16 pages (very exciting stuff, BTW). At the end of this last sentence, it is pretty easy to figure out what the next word would be; surely any reader of English could guess that the next word, after “. . .scaling of digital content that is now. . . ” would be the word “available.”

Just to be sure, though, I double-checked that I was guessing correctly by consulting the print copy of the book. Crap! The next word was not “available.” The full sentence reads as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now held in twenty-first-century digital libraries.”

Now why is this mistake of mine important to note? Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book. When I entered the word “available” into the search field, I was hoping to get a snippet of text from the next page, a snippet that would allow me to read the rest of the sentence. But because I guessed wrong, I in fact got non-contiguous snippets from pages 77, 174, 72, 9, 56, 15, 37, 162, 8, 4, 80, 120, 154, 46, 133, 79, 27, 97, 147, and 17, in that order. These are all the pages in the book where I use the word “available” but none include the rest of the sentence I want to read. Ugh.

Fortunately, I have a copy of the full text on my desk. So I turn to page 17 and read the sentence. Aha! I now conduct a search for the word “held.” This search results in eight snippets; the last of these, as it happens, is the snippet I want from page 17. This new snippet contains the next 42 words. The snippet is in fact just the end of the incomplete sentence from page 16 followed by another incomplete sentence ending with the words: “but we have not yet fully articulated or explored the ways in which. . . ”

So here I have to admit that I’m the author of this book, and I have no idea what follows. I go back to my hard copy to find that the sentence ends as follows: “. . . these massive corpora offer new avenues for research and new ways of thinking about our literary subject.”

Without the full text by my side, I’d be hard pressed to come up with the right search terms to get the next snippet. Luckily I have the original text, so I enter the word “massive” hoping to get the next contiguous snippet. Six snippets are revealed, the last of these includes the sentence I was hoping to find and read. After the word “which,” I am rewarded with “these massive corpora offer new avenues for” and then the snippet ends! Crap, I really want to read this book for free!

So I think to myself, “what if instead of trying to guess a keyword from the next sentence, I just use a keyword from the last part of the snippet. “avenues” seems like a good candidate, so I plug it in. Crap! The same snippet is show again. Looks like I’m going to have to keep guessing. . .

Let’s see, “new avenues for. . . ” perhaps new avenues for “research”? (Ok, I’m cheating again by going back to the hard copy on my desk, but I think a savvy user determined to read this book for free might guess the word “research”). I plug it in. . . 38 snippets are returned! I scroll though them and find the one from page 17. The key snippet now includes the end of the sentence: “research and new ways of thinking about our literary subject.”

Now I’m making progress. Unfortunately, I have no idea what comes next. Not only is this the end of a sentence, but it looks like it might be the end of a paragraph. How to read the next sentence? I try the word “subject” and Google simply returns the same snippet again (along with assorted others from elsewhere in the book). So I cheat again and look at my copy of the book. I enter the word “extent” which appears in the next sentence. My cheating is rewarded and I get most of the next sentence: “To some extent, our thus-far limited use of digital content is a result of a disciplinary habit of thinking small: the traditionally minded scholar recognizes value in digital texts because they are individually searchable, but this same scholar, as a. . . ”

Thank goodness I have tenure and nothing better to do!

The next word is surely the word “result,” which I now dutifully enter into the search field. Among the 32 snippets that the search returns, I find my target snippet. I am rewarded with a copy of the exact same snippet I just saw with no additional words. Crap! I’m going to have to be even more cleaver if I’m going to game this system.

Back to my copy of the book I turn. The sentence continues “as a result of a traditional training,” so I enter the word “traditional,” and I’m rewarded with . . . the same damn passage again! I have already seen it twice, now thrice. My search for the term “traditional” returns a hit for “traditionally” in the passage I have already seen and, importantly, no hit for the instance of “traditional” that I know (from reading the copy of the book on my desk) appears in the next line. How about “training,” I wonder. Nothing! Clearly Google is on to me now. I get results for other instances of the word “training” but not for the one that I know appears in the continuation of the sentence I have already seen.

Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.

I get it. Times are tough and some folks simply need to steal books from snippet view because they can’t afford to buy them. I’m sympathetic to these folks; they need to satisfy their intense passion for reading and knowledge and who could blame them? Then again, if we consider the opportunity cost at $7.25 per hour (the current minimum wage), then stealing this book from snippet view would cost a savvy text pirate $2,218.75 in lost wages. The eBook version of my text, linked to from the Google Books web page, sells for $14.95. Hmmm?

A Novel Method for Detecting Plot

While studying anthropology at the University of Chicago, Kurt Vonnegut proposed writing a master’s thesis on the shape of narratives. He argued that “the fundamental idea is that stories have shapes which can be drawn on graph paper, and that the shape of a given society’s stories is at least as interesting as the shape of its pots or spearheads.” The idea was rejected.

In 2011, Open Culture featured a video in which Vonnegut expanded on this idea and suggested that computers might someday be able to model the shape of stories, that is, the movement of the narratives, the plots. The video is about four minutes long; it’s worth watching.

About the same time that I discovered this video, I was working on a project in which I was applying the tools and techniques of sentiment analysis to works of fiction.[1] Initially I was interested in tracing the evolution of emotional content in novels over the course of the 19th century. By accident I discovered that the sentiment I was detecting and measuring in the fiction could be used as a highly accurate proxy for plot movement.

Joyce’s Portrait of the Artist as a Young Man is a story that I know fairly well. Once upon a time a moo cow came down along the road. . .and so on . . .

Here is the shape of Portrait of the Artist as a Young Man that my computer drew based on an analysis of the sentiment markers in the text:

poa1

If you are familiar with the plot, you’ll readily see that the computer’s version of the story is accurate. As it happens, I was teaching Portrait last fall, so I projected this image onto the white board and asked my students to annotate it. Here are a few of the high (and low) points that we identified.

poa2

Because the x-axis represents the progress of the narrative as a percentage, it is easy to move from the graph to the actual pages in the text, regardless of the edition one happens to be using. That’s precisely what we did in the class. We matched our human reading of the book with the points on the graph on a page-by-page basis.

Here is a graph from another Irish novel that you might know; this is Wilde’s Picture of Dorian Gray.

dorian1

If you remember the story, you’ll see how well this plot line models the movement of the story. Discovering the accuracy of these graphs was quite thrilling.

This next image shows Dan Brown’s blockbuster novel The Da Vinci Code. Notice how much more regular the fluctuations are. This is the profile of a page turner. Notice too how the more generalized blue trend line hovers above neutral in terms of its emotional valence. Dan Brown never lets the plot become too troubled or too much of a downer. He baits us and teases us with fluctuating emotion.

brown1

Now compare Da Vinci Code to one of my favorite contemporary novels, Cormac McCarthy’s Blood Meridian. Blood Meridian is a dark book and the more generalized blue trend line lingers in the realms of negative emotion throughout the text; it is a very different book from The Da Vinci Code.[2]

mccarthy1

I won’t get into the precise details of how I am measuring emotional valence in these books here.[3] It’s a bit too complicated for an already too long blog post. I will note, however, that the process involves two major components: a controlled vocabulary of positive and negative sentiment markers collected by Bing Liu of the University of Illinois at Chicago and a machine model that I trained to identify and score passages as positive or negative.

In a follow-up post, I’ll describe how I normalized the plot shapes in 40,000 novels in order to compare the shapes and discover what appear to be six archetypal plots!

NOTES:
[1] In the field natural language processing there is an area of research known as sentiment analysis or, sometimes, opinion mining. And when our colleagues engage in this kind of work, they very often focus their study on a highly stylized genre of non-fiction: the review, specifically movie reviews and product reviews. The idea behind this work is to develop computational methods for detecting what we, literary folk, might call mood, or tone, or sentiment, or perhaps even refer to as affect. The psychologists prefer the word valence, and valence seems most appropriate to this research of mine because the psychologists also like to measure degrees of positive and negative valence. I am not aware of anyone working in sentiment analysis who is specifically interested in modeling emotional valence in fiction. In fact, the great majority of work in this field is so far removed from what we care about in literary studies that I spent about six months simply wondering whether or not the methods developed by folks trying to gauge opinions in movie reviews could even be usefully employed in studies of literature.
[2] I gained access to some of these novels through a data transfer agreement made between the University of Nebraska and a private company that is no longer in business. See Unfolding the Novel.
[3] I’m working on a longer and more formal version of this research report for publication. The longer version will include all the details of the methodology. Stay Tuned:-)

So What?

Over the past few days, several people have written to ask what I thought about the article by Adam Kirsch in New Republic (“Technology Is Taking Over English Departments The false promise of the digital humanities.”) In short, I think it lacks insight and new knowledge. But, of course, that is precisely the complaint that Kirsch levels against the digital humanities. . .

Several months ago, I was interviewed for a story about topic modeling to appear in the web publication Nautilus. The journalist, Dana Mackenzie, wanted to dive into the “so what” question and ask how my quantitative and empirical methods were being received by literary scholars and other humanists. He asked the question bluntly because he’d read the Stanley Fish blog in the NYT and knew already that there was some push back from the more traditional among us. But honestly, this is not a question I spend much time thinking about, so I referred Dana to my UNL colleague Steve Ramsay and to Matthew Kirshenbaum at the University of Maryland. They have each addressed this question formally and are far more eloquent on the subject than I am.

What matters to me, and I think what should matter to most of us is the work itself, and I believe, perhaps naively, that the value of the work is, or should be, self-evident. The answer to the question of “so what?” should be obvious. Unfortunately, it is not always obvious, especially to readers like Kirsch who are not working in the sub fields of this massive big tent we have come to call “digital humanities” (and for the record, I do despise that term for its lack of specificity). Kirsch and others inevitably gravitate to the most easily accessible and generalized resources often avoiding or missing some of the best work in the field.

“So what?” is, of course, the more informal and less tactful way of asking what one sometimes hears (or might wish to hear) asked after an academic paper given at the Digital Humanities conference, e.g. “I was struck by your use of latent Dirichlet allocation, but where is the new knowledge gained from your analysis?”

But questions such as this are not specific to digital humanities (I was struck by your use of Derrida, but where is the new knowledge gained from your analysis). In a famous essay, Claude Levi-Strauss asked “so what” after reading Vladimir Propp’s Morphology of the Folktale. If I understand Levi-Strauss correctly the beef with Propp is that he never gets beyond the model; Propp fails to answer the “so what” question. To his credit, Levi-Strauss gives Propp props for revealing the formal model of the folktale when he writes that: “Before the epoch of formalism we were indeed unaware of what these tales had in common.”

But then, in the very next sentence, Levi-Strauss complains that Propp’s model fails to account for content and context, and so we are “deprived of any means of understanding how they differ.”

“The error of formalism” Levi-Strauss writes, is “the belief that grammar can be tackled at once and vocabulary later.” In short, the critique of Propp is just simply that Propp did not move beyond observation of what is and into interpretation of what that thing that is, means (Propp 1984).

To be fair, I think that Levi-Strauss gave Propp some credit and took Propp’s work as a foundation upon which to build more nuanced layers of meaning. Propp identified a finite set of 31 functions that could be identified across narratives; Levi-Strauss wished to say something about narratives within their cultural and historical context. . .

This is, I suppose, the difference between discovering DNA and making DNA useful. But bear in mind that the one ever depends upon the other. Leslie Pray writes about the history of DNA in a Nature article from 2008:

Many people believe that American biologist James Watson and English physicist Francis Crick discovered DNA in the 1950s. In reality, this is not the case. Rather, DNA was first identified in the late 1860s by a Swiss chemist. . . and other scientists . . . carried out . . . research . . . that revealed additional details about the DNA molecule . . . Without the scientific foundation provided by these pioneers, Watson and Crick may never have reached their groundbreaking conclusion of 1953.

(Pray 2008)

I suppose I take exception to the idea that the kind of work I am engaged in, because it is quantitative and methodological, because it seeks first to define what is, and only then to describe why that which is matters, must meet some additional criteria of relevance.

There is often a double standard at work here. The use of numbers (computers, quantification, etc.) in literary studies often triggers a knee jerk reaction. When the numbers come out, the gloves come off.

When discussing my work, I am sometimes asked whether the methods and approaches I advocate and employ succeed in bringing new knowledge to our study of literature. My answer is a firm and resounding “yes.” At the same time, I need to emphasize that computational work in the humanities can be simply about testing, rejecting, or reconfirming, what we think we already know. And I think that is a good thing!

During a lecture about macro-patters of literary style in the 19th century novel, I used the example of Moby Dick. I reported how in terms of style and theme Moby Dick is a statistical mutant among a corpus of 1000 other 19th century American novels. A colleague raised his hand and pointed out that literary scholars already know that Moby Dick is an aberration. Why bother computing a new answer to a question for which we already have an answer?

My colleague’s question says something about our scholarly traditions in the humanities. It is not the sort of question that one would ask a physicist after a lecture confirming the existence of the Higgs Boson! It is, at the same time, an ironic question; we humanists have tended to favor a notion that literary arguments are never closed!

In other words, do we really know that Moby Dick is an aberration? Could a skillful scholar/humanist/rhetorician argue the counter point? I think that the answer to the first question is “no” and the second is “yes.” Maybe Moby Dick is only an outlier in comparison to the other twenty or thirty American novels that we have traditionally studied along side Moby Dick?

My point in using Moby Dick was not to pretend that I had discovered something new about the position of the novel in the American literary tradition, but rather to bring new evidence and a new perspective to the matter and in this case fortify the existing hypothesis.

If quantitative evidence happens to confirm what we have come to believe using far more qualitative methods, I think that new evidence should be viewed as a good thing. If the latest Mars rover returns more evidence that the planet could have once supported life, that new evidence would be important and welcomed. True, it would not be as shocking or exciting as the first discovery of microbes on Mars, or the first discovery of ice on Mars, but it would be viewed as important evidence nevertheless, and it would add one more piece to a larger puzzle. Why should a discussion of Moby Dick’s place in literary history be any different?

In short computational approaches to literary study can provide complementary evidence, and I think that is a good thing.

Computational approaches can also provide contradictory evidence, evidence that challenges our traditional, impressionistic, or anecdotal theories.

In 1990 my dissertation adviser, Charles Fanning, published an excellent book titled The Irish Voice in America. It remains the definitive text in the field. In that book he argued for what he called a “lost generation” of Irish-American writers in the period from 1900 to 1930. His research suggested that Irish-American writing in this period declined, and so he formed a theory about this lost generation and argued that adverse social forces led Irish-Americans away from writing about the Irish experience.

In 2004, I presented new evidence about this period in Irish-American literary history. It was quantitative evidence showing not just why Fanning had observed what he had observed but also why his generalizations from those observations were problematic. Charlie was in the audience that day and after my lecture he came up to say hello. It was an awkward moment, but to my delight, Charlie smiled and said, “it was just an idea.” His social theory was his best guess given the evidence available in 1990, and he understood that.

My point is to say that in this case, computational and quantitative methods provided an opportunity for falsification. But just because such methods can provide contradiction or falsification, we must not get caught up in a numbers game where we only value the testable ideas. Some problems lend themselves to computational or quantitative testing; others do not, and I think that is a fine thing. There is a lot of room under the big tent we call the humanities.

And finally, these methods I find useful to employ can lead to genuinely new discoveries. Computational text analysis has a way of bringing into our field of view certain details and qualities of texts that we would miss with just the naked eye (as John Burrows and Julia Flanders have made clear). I like to think that the “Analysis” section of Macroanalysis offers a few such discoveries, but maybe Mr. Kirsch already knew all that? For a much simpler example, consider Patrick Juola’s recent discovery that J. K. Rowling was the author of The Cuckoo’s Calling, a book Rowling wrote under the pseudonym Robert Galbraith. I think Joula’s discovery is a very good thing, and it is not something that we already knew. I could cite a number of similar examples from research in stylometry, but this example happens to be accessible and appealing to a wide range of non-specialists: just the sort of simple folk I assume Kirsch is attempting to persuade in his polemic against the digital humanities.

Works Cited:

Propp, Vladimir. Theory and History of the Folktale. Trans. Ariadna Y. Martin and Richard Martin. Edited by Anatoly Liberman. University of Minnesota Press, 1984. 180

Pray, L. (2008) Discovery of DNA structure and function: Watson and Crick. Nature

Characterization in Literature and the Macroanalysis Lab

I have just posted the syllabus for my spring macroanalysis class focusing on Characterization in Literature. The class is experimental in many senses of the word. We will be experimenting in the class and the class will be an experiment. If all goes according to plan, the only thing about this class that will be different from a research lab is the grade I have to assign at the end—that is the one remaining bit about collaborative learning that still kicks me . . .

To be successful everyone is going to have to be high-performing and self-motivated, me included. For me, at least, the motivation comes from what I think is a really tough nut to crack: algorithmic detection and analysis of character and character types. So far the work in this area has been largely about character networks: how is Hamlet related to Gertrude, etc. That’s good work, but it depends heavily upon the human coding of character metadata before processing. That is precisely why our early experiments at the Stanford Literary Lab focused on Drama. . . the character names are already explicit in the speaker markup. Beyond drama, there have been some important steps taken in the direction of auto-detection of character in fiction, such as those by Graham Sack and Elson et. al, but I think we still have a lot more stepping to do, a whole lot more.

The work I envision for the course will include leveraging obvious tools such as those for named entity recognition and then thinking through and dealing with the more complicated problems of pronoun disambiguation. But my deeper interest here goes far beyond simple detection of entities. The holy grail that I see here lies not in detecting the presence or absence of individual characters but in detecting and tracking character archetypes on a grand macroscale. What if we could begin to answer questions such as these:

  • Are there different classes of villains in the 19th century novel?
  • Do we see a rise in the number of minor characters over the 20th century?
  • What are the qualities that define heroines?
  • How, if at all, do those qualities change/evolve over time? (think Jane Austen’s Emma vs. Stieg Larsson’s Lisbeth).
  • Etc.

We may get nowhere; we may fail miserably. (Of course if I did not already have a couple of pretty good ideas for how to get at these questions I would not be bothering. . . but that, for now, is the secret sauce 😉 )

At the more practical, “skills” level, I’m requiring students to learn and submit all their work using LaTeX! (This may prove to be controversial or crazy–I only learned LaTeX six months ago.) For that they will also be learning how to use the knitr package for R in order to embed R code directly into the LaTeX, and all of this work will take place inside the (awesome) R IDE, RStudio. Hold on to your hats; it’s going to be a wild ride!

Obi Wan McCarty

[Below is the text of my introduction of Willard McCarty, winner of the 2013 Busa Award.]

As the chair of the awards committee that selected Prof. McCarty for this award it is my pleasure to offer a few words of introduction.

I’m going to go out on a limb this afternoon and assume that you already know that Willard McCarty is Professor of Humanities Computing and Director of the Doctoral Program in the Department of Digital Humanities at King’s College London, and that he is Professor in the Digital Humanities Research Group, University of Western Sydney and that he is a Fellow of the Royal Anthropological Institute (London). I’ll assume that you already know that he is Editor of the British journal, Interdisciplinary Science Reviews and that he’s founding Editor of the online seminar Humanist. And I am sure you know that Willard is recipient of the Canadian Award for Outstanding Achievement in Computing in the Arts and Humanities, and of the prestigious Richard W. Lyman Award of the National Humanities Center. You have probably already read his 2005 book titled Humanities Computing, and you know of his many, many other writings and musing.

So I’m not going to talk about any of that stuff.

And since I’m sure that everyone here knows that the Roberto Busa Award was established in 1998. I’m not going to explain how the Busa award was set up to recognize outstanding lifetime achievement in the application of information and communications technologies to humanities research.

No I’m not going to say anything about that either.

Instead, I wish to say a few words about this fellow here. Screen Shot 2013-07-19 at 7.23.12 AM

This is Obi-Wan McCarty. Long before I met him in person, he had become a virtual friend, model, and mentor.

I began computing in the humanities in 1993, and like so many of us in those early days I was a young maverick with little or no idea what had been done before. Those were the days before the rebellion, when the dark forces of the Empire were still quite strong. It was a time when an English major with a laptop was considered a dangerous rebel. At times I was scared, and I felt alone in a dark side of a galaxy far, far, away.

And then somewhere between 1993 and 2001 I began to sense a force in the galaxy.

One day, in early 2001, I was walking with my friend Glen Worthey, and I mentioned how I had recently discovered the Humanist list and how there had been this message posted by Willard McCarty with the cryptic subject line “14.”

“Ah yes,” Glen said, “Obi-Wan McCarty. The force is strong with him.”

Message 14 from Obi-Wan was a birthday message. Humanist was 14 that day and Willard began his message with a reflection on “repetition” and how frequently newcomers to the list would ask questions that had already been asked. Rather than chastise those newbies, and tell them to go STFA (search the freakin’ archive), Willard encouraged them. He wrote in that message of how “repetition is a means of maintaining group memory.” I was encouraged by those words and by Willard’s ongoing and relentless commitment not simply to deep, thoughtful, and challenging scholarship, but to nurturing, teaching, welcoming, and mentoring each new generation.

So Willard, thank you for your personal mentorship, thank you for continuing to demonstrate that scholarly excellence and generosity are kindred spirits. Congratulations on this award. May the force be with you.

“A Matter of Scale”

Back in November, Julia Flanders and I were invited to stage a debate on the matter of “scale” in digital humanities research for the “Boston Area Days of DH” conference keynote: Julia was to represent the micro scale and I the macro.

Julia and I met up during the MLA conference in January and began sketching out how the talk might go. The first thing we discovered, of course, is that we did not in fact have a real difference of opinion on this matter of scale. Big data, small data, close reading and distant . . . these things matter much less than what a scholar actually decides to do and say. In other words, we were both ultimately interested in new knowledge and not too much concerned with the level of scale necessary to derive that new knowledge.

In other words, it’s a false and probably irrelevant debate. And while we agreed on this point in general terms, we discovered in the course of composing and editing the script for our mock debate that there were legitimate nuances that deserved to be put into the light of day. The script form our “debate” and all of the slides are now available via UNL’s open access repository as “A Matter of Scale.”

Julia has posted a few comments on the experience of co-authoring this presentation with me on her blog. Check it out at http://juliaflanders.wordpress.com/2013/03/28/a-matter-of-scale/.

Thoughts on a Literary Lab

[For the “Theories and Practices of the Literary Lab” roundtable at MLA yesterday, panelists were asked to speak for 5 minutes about their vision of a literary lab. Here are my remarks from that session–#147]

I take the descriptor “literary lab” literally, and to help explain my vision of a literary lab I want to describe how the Stanford Literary Lab that I founded with Franco Moretti came into being.

The Stanford Lab was born out of a class that I taught in the fall of 2009. In that course I assigned 1200 novels and challenged students to explore ways of reading, interpreting, and understanding literature at the macro-scale, as an aggregate system. Writing about the course and the lab that evolved from the course, Chronicle of Higher Ed reporter Marc Parry described it as being based on: “a controversial vision for changing a field still steeped in individual readers’ careful analyses of texts.” That may be how it looks from the outside, but there was no radical agenda then and no radical agenda today.

In the class, I asked the students to form into two research teams and to construct research projects around this corpus of 1200 novels. One group chose to investigate whether novel serialization in the 19th century had a detectable/measurable effect upon novelistic style. The other group pursued a project dealing with lexical change over the century, and they wrote a program called “the correlator” that was used to observe and measure semantic change.

After the class ended, two students, one from each group asked to continue their work as independent study; I agreed. Over the Christmas holiday, word spread to the other students from the seminar and by the New Year 13 of the original 14 in the seminar wanted to keep working. Instead of 13 independent studies, we formed an ad-hoc seminar group, and I found an empty office on the 4th floor where we began meeting, sometimes for several hours a day. We began calling this ugly, windowless room, the lab.

Several of the students in my fall class were also in a class with Franco Moretti and the crossover in terms of subject matter and methodology was fairly obvious. As the research deepened and became more nuanced, Franco began joining us for lab sessions and over the next few months other faculty and grad students were sucked into this evolving vortext. It was a very exciting time.

At some point, Franco and I (and perhaps a few of the students) began having conversations about formalizing this notion of a literary lab. I think at the time our motivation had more to do with the need to lobby for space and resources than anything else. As the projects grew and gained more steam, the room got smaller and smaller.

I mention all of this because I do not believe in the “if we build it they will come” notion of digital humanities labs. While it is true that they may come if we build them; it is also true, and I have seen this first hand, that they may come with absolutely no idea of what to do.

First and foremost a lab needs a real and specific research agenda. “Enabling Digital Humanities projects” is not a research agenda for a lab. Advancing or enabling digital humanities oriented research is an appropriate mission for a Center, such as our Center for Digital Humanities Research at Nebraska, but it is not the function of a lab, at least not in the limited literal sense that I imagine it. For me, a lab is not specifically an idea generator; a lab is a place in which ideas move from birth to maturation.

It would be incredible hyperbole to say that we formally articulated any of this in advance. Our lab was the opposite of premeditated. We did, however, have a loosely expressed set of core principles. We agreed that:

1. Our work would be narrowly focused on literary research of a quantitative nature.
2. All research would be collaborative, even when the outcome ends up having a single author.
3. All research would take the form of “experiments,” and we would be open to the possibilities of failure; indeed, we would see failure as new knowledge.
4. The lab would be open to students and faculty at all levels–and, on a more ad hoc basis, to students and faculty from other institutions.
5. In internal and external presentation and publication, we would favor the narrative genre of “lab reports” and attempt to show not only where we arrived, but how we got there.

I continue to believe that these were and are the right principles for a lab even while they conflict with much about the way Universities are organized.

In our lab we discovered that to focus, to really focus on the work, we had to resist and even reject some of the established standards of pedagogy, of academic hierarchy, and of publishing convention. We discovered that we needed to remove instructional barriers both internal and external in order to find and attract the right people and the right expertise. We did not do any of this in order to make a statement. We were not academic radicals bent on defying the establishment.

Nor should I leave you with the impression that we figured anything out. The lab remains an organic entity unified by what some might characterize as a monomaniacal focus on literary research. If there was any genius to what we did, it was in the decision to never compromise our focus, to do whatever was necessary to keep our focus on the literature.

Some Advice for DH Newbies

In preparation for a panel session at DH Commons today, I was asked to consider the question: “What one step would you recommend a newcomer to DH take in order to join current conversations in the field?” and then speak for 3 – 4 minutes. Below is the 5 minute version of my answer. . .

With all the folks assembled here today, I figured we’d get some pretty good advice about what constitutes DH and how to get started, so I decided that I ought to say something different from what I’d expect others to say. I have two specific bits of advice, and I suppose that the second bit will be a little more controversial.

But let me foreground that by going back to 2011 when my colleague Glen Worthey and I organized the annual Digital Humanities conference at Stanford around a big tent, summer of love theme. We flung open the flaps on the Big Tent and said come on in . . . We believed, and we continue to believe, that there is a wide range of very good and very interesting work being done in “digital humanities.” We felt that we needed a big tent to enclose all that good work. But let’s face it, inside the big tent it’s a freakin’ three ring circus. Some folks like clowns and others want to see the jugglers. The DH conference is not like a conference on Victorian Literature. And that, of course, is the charm and the curse.

While it probably makes sense for a newcomer to poke around and gain some sense of the “disciplinary” history of the “field.” I think the best advice I can give a newcomer is to spend very little time thinking about what DH is and spend as much time as possible doing DH.

It doesn’t really matter if the world looks at your research and says of it: “Ahhhh, that’s some good Digital Humanities, man.” What matters, of course, is if the world looks at it and says, “Holy cow, I never thought of Jane Austen in those terms” or “Wow, this is really strong evidence that the development of Roman road networks was entirely dependent upon seasonal shifts.” The bottom line is that it is the work you do that is important, not how it gets defined.

So I suppose that is a bit of advice for newcomers, but let me answer the question more concretely and more controversially by speaking as someone who hangs out in one particular ring of the DH Big Tent.

If you understand what I have said thus far, then you know that it is impossible to speak for the Digital Humanities as a group, so, for some, what I am going to say is going to sound controversial. And if I hear that one of you newcomers ran out at the end of this session yelling “Jockers thinks I need to learn a programming language to be a digital humanist,” then I’m going to have to kick your butt right out of the big tent!

Learning a programming language, though, is precisely what I am going to recommend. I’m even going to go a bit further and suggest a specific language called R.

By recommending that you learn R, I am also advocating learning some statistics. R is primarily a language used for statistical computing, which is more or less the flavor of Digital Humanities that I practice. If you want to be able to read and understand the work that we do in this particular ring of the big tent you will need some understanding of statistics; if you want to be able to replicate and expand upon this kind of work, you are going to need to know a programming language, so I recommend learning some R and killing two birds with one stone.

And for those of you who don’t get turned on by p-values, for loops, and latent dirichlet allocation, I think learning a programing language is still in your best interests. Even if you never write a single line of code, knowing a programming language will allow you to talk to the natives, that is, you will be able to converse with the non-humanities programmers and web masters and DBAs and systems administrators, who we so often collaborate with as digital humanists. Whether or not you program yourself, you will need to translate your humanistic questions into terms that a non-specialist in the humanities will understand. You may never write poetry in Italian, but if you are going to travel in Rome, you should at least know how to ask for directions to the coliseum.

DH2012 and the 2013 Busa Award

I could not make it to the DH conference in Hamburg this year (though I did manage to appear virtually). As chair of the Busa Award committee I had the pleasure of announcing that Willard McCarty had won the award. Willard will accept the award in 2013 when DH meets at the University of Nebraska. Here is the text of my announcement which was read today in Hamburg:

I was very pleased to serve as the Chair of the Busa Award committee this cycle, and though I am disappointed that I was unable to travel to Hamburg this year to make this announcement in person, I’m delighted with the end result. I am also delighted that the award will be given at the 2013 conference hosted by the University of Nebraska. Having recently joined the faculty there, I’m quite certain I will be attending next year’s meeting!

The winner of the 2013 Busa Award is a man of legendary kindness and generosity. His contributions to the growth and prominence of Digital Humanities will be familiar to us all. He is a gentleman, a scholar, a philosopher, and a long time fighter for the cause. He is, by one colleague’s accounting, the “Obi-Wan Kenobi” of Digital Humanities. And I must concur that “the force” is strong with this one. Please join me in congratulating Willard McCarty on his selection for the 2013 Busa Award.

Amicus Brief Filed

In the last chapter of forthcoming my book, I write about the challenges of copyright law and how many a digital humanist is destined to become a 19th-centuryist if the law isn’t reformed to specifically allow for and recognize the importance of “non-expressive” use of digitized content.*

This week the Amicus Brief that I co-authored with Matthew Sag and Jason Schultz was submitted. The brief (see Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild, Inc. Et Al V. Hathitrust Et Al.) includes official endorsement from the Association of Computers in the Humanities as well as the support and signature of many individual scholars working in the field.

* “Non-expressive use” is Matthew Sag’s far more pleasing formulation of what many have come to call “non-consumptive use.”

On Distant Reading and Macroanalysis

Earlier this week Kathryn Schultz of the New York Times published a rather provocative, challenging, and in my opinion under-researched and over-sensationalized article about my colleague Franco Morreti’s work theorizing a mode of literary analysis that he has termed “distant-reading.” Others have already pointed out some of the errors Schultz made, and I’m fairly certain Moretti would be happy to clarify any confusion Schultz may have about his work if she were to actually interview him (i.e. before paraphrasing him). My interest here is to offer some specific thoughts and some background on “distant-reading” or what I have preferred to call “macroanalysis.”[1]

The approach to the study of literature that I call macroanalysis, instead of distant-reading (for reasons explained below), is in general ways akin to the social-science of economics or, more specifically, macroeconomics. Before the 20th century there wasn’t a defined field of “Macroeconomics.” There was, however, microeconomics, which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to the study of individual texts via “close-readings” of the material. Macroeconomics, however, is about the study of the entire economy. It tends toward enumeration and quantification and is in this sense similar to literary inquiries that are not highly theorized: bibliographic studies, biographical studies, literary history, philology, and the enumerative analysis that is the foundation of humanities computing.

By way of an analogy, we might think about interpretive close-readings as corresponding to microeconomics while quantitative macroanalysis corresponds to macroeconomics. Consider, then, that in many ways the study of literary genres or literary periods is a type of macro approach to literature. Say, for example, a scholar specializes in early 20th century poetry. Presumably, this scholar could be called upon to provide sound generalizations, or “distant-readings” about 20th century poetry based on a broad reading of individual works within that period. This would be a sort of “macro-, or distant-, reading” of the period. But this parallel falls short of approximating for literature what macroeconomics is to economics, and it is in this context that I prefer the term macroanalysis over distant-reading. The former term places the emphasis on the quantifiable methodology over the more interpretive practice of “reading.” Broad attempts to generalize about a period or about a genre are frequently just another sort of micro-analysis, in which multiple “cases” or “close-readings” of individual texts are digested before generalizations about them are drawn in very qualitative ways. Macroeconomics, on the other hand, is a more number-based discipline, one grounded in quantitative analysis not qualitative assessments. Moreover, macroeconomics employs a number of quantitative benchmarks for assessing, scrutinizing, and even forecasting the macro-economy. While there is an inherent need for understanding the economy at the micro level, in order to contextualize the macro-results, macroeconomics does not directly involve itself in the specific cases, choosing instead to see the cases in the aggregate, looking to those elements of the specific cases that can be generalized, aggregated, and quantified.

Micro-oriented approaches to literature, highly interpretive readings of literature, remain fundamentally important. Just as microeconomics offers important perspectives on the economy. It is the exact interplay between the macro and micro scale that promises a new, enhanced, and perhaps even better understanding of the literary record. The two approaches work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro level, remains essential. While the methods of enquiry, of evidence gathering, are different, they are not antithetical, and they share the same ultimate goal of informing our understanding of the literary record, be it writ large or small. The most fundamental and important difference in the two approaches is that the macroanalytic approach reveals details about texts that are for all intents and purposes unavailable to close-readers of the texts. Writing of John Burrows’s study of Jane Austen’s oeuvre, Julia Flanders points out how Burrows’s computational study brings the most common words such as “the” and “of” into our field of view.

Flanders writes: “His [Burrows] effort, in other words, is to prove the stylistic and semantic significance of these words, to restore them to our field of view. Their absence from our field of view, their non-existence as facts for us, is precisely because they are so much there, so ubiquitous that they seem to make no difference.” (Flanders 2005)

At its most basic, the macroanalytic approach I’m advocating is simply another method of gathering information about texts, of accessing the details. The information is different from what is derived via close reading, but it not of lesser or greater value to scholars for being such.

Flanders goes on: “Burrows’ approach, although it wears its statistics prominently, foreshadows a subtle shift in the way the computer’s role vis-á-vis the detail is imagined. It foregrounds the computer not as a factual substantiator whose observations are different in kind from our own—because more trustworthy and objective—but as a device that extends the range of our perceptions to phenomena too minutely disseminated for out ordinary reading.” (Flanders 2005)

A macroanalytic approach not only helps us to see and understand the larger “literary economy” but, by means of its scope, to better see and understand the degree to which literature and the individual authors who manufacture the literature respond to or react against literary and cultural trends within their realm of experience. If authors are inevitably influenced by their predecessors, then we may even be able to chart and understand “anxieties of influence” in concrete, quantitative ways.

For historical and stylistic questions in particular, the macroanalytic approach has distinct advantages over the more traditional practice of studying literary periods and genres by means of a close study of “representative” texts. Speaking of his own efforts to provide a more encompassing view of literary history, Franco Moretti writes that “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as a whole . . .” (2005). To generalize about a “period” of literature based on a study of a relatively small number of books is to take a significant leap. It is less problematic, though, to consider how a macroanalytic study of several thousand texts might lead us to a better understanding of the individual texts. Until recently, we have not had the opportunity to even consider this later option, and it seems reasonable to imagine that we might, through the application of both approaches, reach a new and better informed understanding of our primary materials. This is what Juri Tynjanov imagined in 1927: “Strictly speaking”, writes Tynjanov, “one cannot study literary phenomena outside of their interrelationships.” Fortunately for me and for scholars such as Moretti, the multitude of interrelationships that overwhelmed and eluded Tynjanov and pushed the limits of close-reading can now be explored with the aid of computation, statistics and huge digital libraries.

My book on this subject, Literary Studies, the Digital Library, and the Inevitability of Influence, is now under contact with [Update: will be published in 2013 as Macroanalysis: Digital Methods and Literary History by University of Illinois Press.

[1] I began using the term macroanalysis in late 2003. At the time, Moretti and I were putting together plans for a co-taught course titled “Electronic Data and Literary Theory.” The course we imagined would be a research seminar in the full sense of the word and in our syllabus (dated 11/3/2003) we wrote: “the main purpose of this seminar is methodological rather than historical: learning how to use electronic search systems to analyze large quantities of data — and hence get a new, better understanding of literary and cultural history.” During the course I began work developing a text analysis toolkit that I later called CATools (for Corpus Analysis Tools). In terms of methodology, I was learning a lot at the time from work in corpus linguistics but also discovering that we (literary folks) have an entirely different set of questions. So it made sense to do at least a bit of wheel reinvention. My first experiments with the macroanalytic methodology were constructed around a corpus of Irish-American novels that I had been building since my dissertation research. I presented the first results of this work in Liverpool, at the 2004 meeting of the American Conference for Irish Studies. My paper, titled “Making and Mining a Digital Archive: the Case of the Irish-American West Project,” was part how-to and part results–I’d made one non-trivial discovery about Irish-American literary history based on this new methodology. In the spring of 2005, I offered a more detailed methodological overview of the toolkit at the inaugural meeting of the Text Analysis Developer’s Alliance. An overview of my project was documented on the TADA blog. Later that summer (2005), I presented a more generalized methodological paper titled “A Macro-Economic Model for Literary Research” at the joint meeting of the ACH and ALLC in Victoria, BC. It was there that I first articulated the economic analogy that I have come to find most useful for explaining Moretti’s idea of “distant-reading.” In 2006, while I was in residence as Research Scholar in the Digital Humanities at the Stanford Humanities Center, I spent a good deal of time thinking about macro-scale approaches to literature and then writing corpus analysis code . By the summer of 2007, I had developed a whole new toolkit and presented the first significant findings in a paper titled “Macro-Analysis (2.0)” which I delivered at the 2007 Digital Humanities meeting in Illinois. Coincidentally, this was the same conference at which Moretti presented the opening keynote lecture, a paper exploring a corpus of 19th century novel titles, which would eventually be published in Critical Inquiry. That research utilized software that I had developed in the CATools package.

Kansas Irish Reprint

Rowfont Press of Wichita, Kansas has just published a newly illustrated edition of Charles Driscoll’s memoir Kansas Irish (with my Critical Introduction). The book is available at Amazon. Kansas Irish and the two sequels that follow provide the most complete and authentic rendering of Irish life on the American prairie in the 19th Century.

On Pamphleteering and Pamphlet One

Several months ago, a group of us from the Stanford Literary Lab wrote and sent out for review the article that now appears in Pamphlet 1 of the Lab. The article, titled “Quantitative Formalism: an Experiment” was submitted, peer-reviewed, and approved for publication in a prestigious literary journal. There was, however, a catch. The editors of the journal asked that we trim the number of charts in the article and that we alter the tone and character of the article to make it less of a narrative. In other words, the article was of a style and content that the editors found to be too foreign to their traditions.

Rather than revise the article in ways we felt would misrepresent its function and intent, we turned to that most traditional of literary forms, the pamphlet. Considering the largely quantitative and digital methodology employed in our research, a pamphlet was a seemingly ironic choice. The most obvious venue would seem to have been the web. After all, aren’t the blog and the web site, the pamphlet forms of the digital age (see, for example, Pamphleteers and Web Sites)? We certainly considered an electronic format, and we have posted a pdf version of the essay on our web site, but we decided to print.

Why print the pamphlet? As a literary form, the pamphlet has a long tradition of going against the grain; it’s an alternative form that is malleable. In the pamphlet, George Orwell wrote, “one has complete freedom of expression, including, if one chooses, the freedom to be scurrilous, abusive, and seditious; or, on the other hand, to be more detailed, serious and ‘high-brow’ than is ever possible in a newspaper or in most kinds of periodicals.” It has been used in campaigning and marketing, but most famously, the pamphlet has been employed by political, religious, and social “provocateurs” and “radicals.” Many of these objects have been lost, but the best have been preserved, such as those crafted by talented satirists (Swift) sober thinkers (Paine) and social critics (Voltaire). And they are preserved because they are historical objects, “ephemera” as the librarians say. The pamphlet is an ephemeral object, an object “lasting only for a day.” As “an experiment” our “quantitative formalism” pamphlet is a middle point, a hash mark on the line of time, not an end point or destination not even a beginning.

It’s interesting to consider how the preferred citation style of literary scholarship, the MLA Style, places emphasis on page reference over time, over moment of publication. With some obvious exceptions, the basic logic here is that what someone says about Shakespeare today has the same validity, the same scholarly purchase, as something said fifty years ago. As a discipline, our “results” tend to be qualitative and interpretive, not bound in time or subject to revision based on the introduction of new evidence. They are, of course, subject to reinterpretation, but that is fundamentally different from the changes wrought by new discoveries. The date-based citation style, on the other hand, places emphasis on points in time and acknowledges the fast-paced and ephemeral nature of certain fields of research. This is most obvious in the sciences, in medicine, for example, where new experiments and new discoveries are constantly and quickly changing the field. One need only eavesdrop on the the Digital Humanities twittersphere for a few hours to note the similarities; the pace of change is rapid.

So why not a pamphlet? Why not recognize in both form, title, and narrative style that ours is an experiment? It is a bit of research that is useful for today but also something we entirely expect to change. Indeed, we are already working on the next iteration(s), the next experiments. Unlike the neatly closed arguments of our traditional work, these experiments of our Literary Lab open as many doors as they close. In fact, in the course of our research on novel genres, it became apparent that we could, must, go on forever. Each test led us to some new idea, some new direction to explore. There are some discoveries to be sure, and some of our results will likely, hopefully, stand the test of time. But my co-authors and I understand, or perhaps simply “believe,” that there is still much, much more work to be done. If it reads more like a lab report than a traditional essay, it’s because it is a lab report and self-consciously so, intentionally so.

Readers wishing to experience the full pleasure of a touchable paper pamphlet may contact me with their name and address. No charge, while supplies last:-)

Unigrams, and bigrams, and trigrams, oh my

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

  1. We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
  2. There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
  3. And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
  4. Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).

    chart.png

    And here’s another chart comparing the usage of “asdf” to “qwer.”

    chart-1.png
    And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.

  5. And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data

    In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).

SEASR Grant

This month a group of researchers at Stanford, University of Illinois, University of Maryland, and George Mason were awarded a $790,000 grant from the Mellon Foundation to advance the prior work of the SEASR project. I’ll be serving as the overall Project Director and as one of the researchers in the Stanford component of the grant. In this phase of the SEASR project, we will focus on leveraging the existing SEASR infrastructure in support of four “use cases.” But “use case” hardly describes the research intensive nature of the proposed work, nor does it capture the strongly humanistic bias of the work proposed. Each partner has committed to a specific research project and each has the expressed goal of advancing humanities research and publishing their results. I’d like to emphasize this point about advancing humanities research.

This grant represents an important step beyond the tool building, QA and UI testing stages of software development. All too often, it seems, our digital humanities projects devote a great deal of time, money, and labor to infrastructure and prototyping and then all too frequently the results languish in the great sea of hammers without a nail. Sure, a few journeymen carpenters stick these tools in their belts and hammer away, but all too often it seems that more effort goes into building the tools and then the resources sit around gathering dust while humanities research marches on in the time-tested modes with which we are most familiar.

Of course, I don’t mean this to be a criticism of the tool builders or the tools built. The TAPOR project, for example, offers many useful text analysis widgets, and I frequency send my colleagues and students there for quick and dirty text-analysis. And just last month I had occasion to use and cite Stefan Sinclair’s Voyeur application. I was thrilled to have Voyeur at my finger tips; it provided a quick and easy way to do exactly what I wanted.

But often, the analytic tasks involved in our projects are multifaceted and cannot be addressed by any one tool. Instead, these projects involve “flows” in which our “humanistic” data travels though a series of analytic “filters” and comes out on the other end in some altered form. The TAPOR project attempts to be a virtual text analysis “workbench” in which the craftsman can slide a project around the bench from one tool to the next. This model works well for smallish projects but is not robust enough for large scale projects and, despite some significant interface improvements over the years, remains, for me at least, a bit clunky. I find it great for quick tasks with one or two texts, but inefficient for processing multiple texts or multiple processes. Part of the TAPOR mission was to develop a suite of tools that could be used by the average, ordinary humanist: which is to say, the humanist without any real technical chops. It succeeds on that front to be sure.

SEASR offers an alternative approach and what it provides in terms of processing power and computational elegance it gives up in terms of ease of use and transparency. The SEASR “interface” is one that involves constructing modular “workflows” in which each module corresponds to some computational task. These modules are linked together such that one process feeds into the next and the business of “sliding” a project around from one tool to another on the virtual workbench is taken over by the workflow manager.

In this grant we have specifically deemphasized UI development in favor of output, in favor of “results” in the humanities sense of the word. As we write in the proposal, “The main emphasis of the project will be on developing, coordinating, and investigating the research questions posed by the participating humanities scholars.” The scholars in this project include myself and Franco Moretti at Stanford, Dan Cohen at GMU, Tanya Clement at University of Maryland, Ted Underwood and John Unsworth both of UIUC. On the technical end, we have Michael Welge and Loretta Auvil of the Automated Learning Group, of the National Center for Supercomputing Applications.

As the project gets rolling, I will have more to post about the specific research questions we are each addressing and the ongoing results of our work. . .

On Collaboration

I’ve been hearing a lot about “collaboration,” especially in the digital humanities. Lisa Spiro at Rice University has written a very informative post about Collaborative Authorship in the Humanities as well as another post providing Examples of Collaborative Digital Humanities Projects. Both of these posts are worth reading, and Spiro offers some well-thought out and researched perspectives.

My own experiences with collaboration include both research and authorship. I have seen first hand how fruitful collaboration, especially interdisciplinary collaboration, can be. It is safe to say that I’m a believer. In fact the course I have been teaching for the last two years, Literary Studies and the Digital Library, is designed entirely around collaborative research projects. And yet I have to say that I’m am entirely suspicious of the current rage for “collaboration.”

No doubt the current popularity of collaboration, at least in the humanities, is a natural extension of the movement toward interdisciplinary studies. Through collaboration with people outside our individual disciplines has led to fruitful work, there seems to be an unnatural desire on the part of some administrators and even some colleagues to “foster collaboration,” as if collaboration were something that occurs in a petri dish, something that needs only to be “fostered” in order to evolve.

But collaboration does not arise out of a petri dish, it arises out of need. Sure, there are serendipitous collaborations that arise out of proximity: X bumps into Y at the water cooler and they get to talking . . . but more often successful collaboration arises out of need: X wishes to investigate a topic but requires the skills of Y in order to do a good job.

Failed collaborations, on the other hand, are all too often the result of good intentioned but overly forced attempts to bring people together. I attended a seminar a couple of years ago on the subject of “fostering collaboration in the humanities.” The organizers of the meeting certainly understood the promise of new knowledge that might be derived through interaction, but they entirely miscalculated when it came to individual motivation to collaborate. It’s a classic case of putting the cart before the horse. In my experience, fruitful collaboration evolves organically and is motivated by the underlying research questions, questions that are always too big and too complex to be addressed by a single researcher.

Auto Converting Project Gutenberg Text to TEI

Those who do corpus level computational text analysis are always hungry for more and more texts to analyze. Though we’ve become adept at locating texts from a wide range of sources (our own institutional repositories as well as a number of other places including Google Books, the Internet Archive, and Project Gutenberg), we still face a number of preprocessing tasks to bring those various files into some standard format. The texts found at these resources are not always in a format friendly to the tools we use for processing those texts. For example, I’ve developed lots of processing scripts that are designed to leverage the metadata that is frequently encoded into TEI-based xml. A text from Project Gutenberg, however, is not only just plain text, but it has a lot of boilerplate text at the beginning and end of each file that needs to be removed prior to text analysis.

I’m currently building a corpus of 19th century novels and discovered that many of the texts I would like to include have already been digitized by Project Gutenberg. This, of course, was great news. But, the system I have developed for ingesting texts into my corpus assumes that the texts will all be in TEI-XML with markup indicating such important things as “author,” “title”, and “date” of publication. I downloaded about 100 novels and was about to begin opening them up one by one and adding the metadata. . .eek! I quickly realized the mundanity of the task and thought, “hmm, I bet someone has written a nice regex script for doing this sort of thing.” A quick trolling of the web led me to the web page of Michiel Overtoom who had developed some python scripts for downloading and cleaning up (“beautifying” in his language) Dutch Gutenberg texts for his eBook Reader. Overtoom’s process is mainly designed to strip out the boilerplate and then rename the files with naming conventions that reflect the author and title of the books.

With Overtoom’s script as a base, I reengineered the code to convert a Gutenberg text into a minimally encoded and TEI-compliant XML file. The script builds a teiHeader that includes the author and title of the work (unfortunately, Project Gutenberg texts do not include publication dates, why?) and then adds “text”, “body”, div, and all the p tags. The final result is a document that meets basic TEI requirements. The script is copied below, but since the all important python spacing may be destroyed by this posting, it’s better to download it here and then change the file extension from .txt. to “.py”. Enjoy!

# gutenbergToTei.py
#
# Reformats and renames etexts downloaded from Project Gutenberg.
#
# Software adapted from Michiel Overtoom, motoom@xs4all.nl, july 2009.
# 
# Modified by Matthew Jockers August 17, 2010 to encode result into TEI based XML
#
# October 26, 2015: Peeter Tinits notes that for non-latin characters ', encoding="utf8"' could be added to both "open" functions.

import os
import re
import shutil

remove = ["Produced by","End of the Project Gutenberg","End of Project Gutenberg"]

def beautify(fn, outputDir, filename):
    ''' Reads a raw Project Gutenberg etext, reformat paragraphs,
    and removes fluff.  Determines the title of the book and uses it
    as a filename to write the resulting output text. '''
    lines = [line.strip() for line in open(fn)]
    collect = False
    lookforsubtitle = False
    outlines = []
    startseen = endseen = False
    title=""
    one="<?xml version=\"1.0\" encoding=\"utf-8\"?><TEI xmlns=\"http://www.tei-c.org/ns/1.0\" version=\"5.0\"><teiHeader><fileDesc><titleStmt>"
    two = "</titleStmt><publicationStmt><publisher></publisher><pubPlace></pubPlace><availability status=\"free\"><p>Project Gutenberg</p></availability></publicationStmt><seriesStmt><title>Project Gutenberg Full-Text Database</title></seriesStmt><sourceDesc default=\"false\"><biblFull default=\"false\"><titleStmt>"
    three = "</titleStmt><extent></extent><publicationStmt><publisher></publisher><pubPlace></pubPlace><date></date></publicationStmt></biblFull></sourceDesc></fileDesc><encodingDesc><editorialDecl default=\"false\"><p>Preliminaries omitted.</p></editorialDecl></encodingDesc></teiHeader><text><body><div>"
    for line in lines:
        if line.startswith("Author: "):
        	author = line[8:]
        	authorTemp = line[8:]
        	continue
        if line.startswith("Title: "):
            title = line[7:]
            titleTemp = line[7:]
            lookforsubtitle = True
            continue
        if lookforsubtitle:
            if not line.strip():
                lookforsubtitle = False
            else:
                subtitle = line.strip()
                subtitle = subtitle.strip(".")
                title += ", " + subtitle
        if ("*** START" in line) or ("***START" in line):
            collect = startseen = True
            paragraph = ""
            continue
        if ("*** END" in line) or ("***END" in line):
            endseen = True
            break
        if not collect:
            continue
        if (titleTemp) and (authorTemp):
        	outlines.append(one)
        	outlines.append("<title>")
        	outlines.append(titleTemp)
        	outlines.append("</title>")
        	outlines.append("<author>")
        	outlines.append(authorTemp)
        	outlines.append("</author>")
        	outlines.append(two)
        	outlines.append("<title>")
        	outlines.append(titleTemp)
        	outlines.append("</title>")
        	outlines.append("<author>")
        	outlines.append(authorTemp)
        	outlines.append("</author>")
        	outlines.append(three)
        	authorTemp = False
        	titleTemp = False
        	continue
        if not line:
            paragraph = paragraph.strip()
            for term in remove:
                if paragraph.startswith(term):
                    paragraph = ""
            if paragraph:
            	paragraph = paragraph.replace("&", "&")
                outlines.append(paragraph)
                outlines.append("</p>")
            paragraph = "<p>"
        else:
            paragraph += " " + line
			
    # Compose a filename.  Replace some illegal file name characters with alternatives.
    #ofn = author + title[:150] + ".xml"
    ofn = filename
    ofn = ofn.replace("&", "")
    ofn = ofn.replace("/", "")
    ofn = ofn.replace("\"", "")
    ofn = ofn.replace(":", "")
    ofn = ofn.replace(",,", "")
    ofn = ofn.replace(" ", "")
    ofn = ofn.replace("txt", "xml")
        
    outlines.append("</div></body></text></TEI>")
    text = "\n".join(outlines)
    text = re.sub("End of the Project Gutenberg .*", "", text, re.M)
    text = re.sub("Produced by .*", "", text, re.M)
    text = re.sub("<p>\s+<\/p>", "", text)
    text = re.sub("\s+", " ", text)
    f = open(outputDir+ofn, "wt")
    f.write(text)
    f.close()

sourcepattern = re.compile(".*\.txt$")
sourceDir = "/Path/to/your/ProjectGutenberg/files/"
outputDir = "/Path/to/your/ProjectGutenberg/TEI/Output/files/"

for fn in os.listdir(sourceDir):
    if sourcepattern.match(fn):
        beautify(sourceDir+fn, outputDir, fn)


Panning for Memes

Over in the English Department Literature Lab, we have been experimenting with Topic Modeling as a means of discovering latent themes (aka topics) in a corpus of 19th century novels. Topic Modeling is an unsupervised machine learning process that employs Latent Dirichlet allocation. “It posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.”

We’ve been experimenting using the Java-Based MAchine Learning for LanguagE Toolkit (Mallet) from UMASS Amherst and a corpus of British and American novels from the 19th century. In one experiment we ran the topic modeler over just the British corpus, in another over just the American corpus. But when we combined the two collections and ran the model over the whole corpus, we discovered that certain topics showed up in only one or the other corpus. For example, one solely American topic was composed of words related to slavery and words written in southern dialect. And there was a strictly British topic clearly indicative of the royalty and aristocracy: words such as “lord,” “king”, “duke,” “sir”, “lady.” This was an interesting result and not simply because it provides a quantitative way of distinguishing topics or themes that are distinct to one nation or another, but also because the topics themselves could be read and interpreted in context.

More interesting for me, however, were two topics that appeared in both corpora. The first, which appeared more in the British corpus was related to “soldiering.” A second topic, which was more common in the American corpus, has to do with Indian wars. The “soldiering” topic was composed of the following words:

“men,” “general,” “captain,” “colonel,” “army,” “horse,” “sir,” “enemy,” “soldier,” “battle,” “day,” “war,” “officer,” “great,” “country,” “house,” “time,” “head,” “left,” “road,” “british,” “soldiers,” “washington,” “night,” “fire,” “father,” “officers,” “heard,” “moment.”

The Indians topic included:

“indian,” “men,” “indians,” “great,” “time,” “chief,” “river,” “party,” “red,” “white,” “place,” “savages,” “woods,” “day,” “side,” “fire,” “war,” “savage,” “water,” “canoe,” “rifle,” “people,” “warriors,” “returned,” “feet,” “friends,” “tree,” “night,” “distance.”

What was most fascinating, however, was that when the soldiering topic was found in the American corpus it usually had to do with Indians, and when the Indian topic appeared in the British corpus it was almost completely in the context of the Irish! As an Irish-Studies scholar, who wrote a theses on the role of the American West in Irish and Irish-American literature, this was an incredibly rich discovery. The literature of the Irish and the Irish Diaspora is filled with comparisons between the Irish situation vis-à-vis the British and the Native American situation vis-à-vis what one Irish American author described as the “Tide of Empire.”

Reader’s wishing to follow this line of comparison in some more contemporary works might want to have a look at Joyce’s short story “An Encounter,” Flann O’Brien’s book At Swim Two Birds, Paul Muldoon’s Madoc and Patrick McCabe’sThe Butcher Boy.

What is a Literature Lab: Not Grunts and Dullards

Yesterday’s Chronicle of Higher Education ran an article by Marc Parry about the work we are doing here in our new Literature Lab with “big data.” It’s awfully nice to be compared to Lewis and Clark exploring the frontiers of literary scholarship, but I think the article fails to give due credit to the exceptional group of students who have been working with Franco Moretti and me in the Lab. Far from being the lab “grunts” that Parry calls them, these students are the lifeblood of the lab, and the projects we are working on spring from their ideas and their passion for literature.

I understand how such a confusion of roles could happen: in the science lab, students often work *under* a faculty member who has received a grant to pursue a particular line of research. Indeed, the funding of grad students in the sciences is often based on this model; the grant pays for the work they do in the lab.

Our literature lab is nothing at all like this; it is a far more egalitarian enterprise and there are no monetary incentives for the students. Instead, the motivation for the students is pure research, the opportunity to push the envelope and experience the excitement of discovering new territory. Yes, Moretti and I serve as guides, advisors, and mentors in this process, but it is important to emphasize the truly collaboration nature of the enterprise.

Moretti and I do not have the answers nor do we necessarily make up the questions. In the case of the most recent work, in fact, all the questions have come directly from the students themselves. A recent article by Amanda Chang and Corrie Goldman (“Stanford Students Use Digital Tools to Analyze Classic Texts“) captures the student’s role quite accurately. Our lab is based on the idea that any good question deserves to be pursued. Students or faculty may pose those questions and teams evolve organically based on interest in the questions. The result, of course, is that we *all* learn a lot.

As a teenager, one of my favorite films was The Magnificent Seven , a remake of the Seven Samurai with gunslingers played by Yul Brynner, Steve McQueen and other marquis names. The basic idea is that these gunslingers hire out to protect a small town being ravaged by bad guys. The excitement of the film comes as Brynner assembles his team of seven gunslingers, each with a special talent. They then train the local residents how to defend themselves and as they do the villagers and the gunmen develop a deep sense of respect and admiration for each other.

Working with this group of students in the lab has been a similar experience (without the guns). The work quite literally could not be done without the team and each member of the team has brought a unique talent to the project. One student in the group is an accomplished coder, another has read most every key book in the corpus, another has a penchant for math, another loves research. They are the magnificent seven, and I have never had the pleasure of working with a more talented group: yes, they are students but for me they are already colleagues. I trust their judgements and have a profound respect, and sometimes awe, for what they already know.

Parry’s article contains bits and pieces of an interview he conducted with Yale Professor Katie Trumpener. Speaking of our work and of Moretti’s notion of “distant-reading” Trumpener apparently said the following:

But what happens when his “dullard” descendants take up “distant reading” for their research?

“If the whole field did that, that would be a disaster,” she says, one that could yield a slew of insignificant numbers with “jumped-up claims about what they mean.”

“Dullard”? Really? I do hope that Ms. Trumpener’s comment was somehow taken out of context here and that she will very quickly write to the Chronicle to set the record straight. Otherwise I fear that some less forgiving souls might conclude that Ms. Trumpener is one herself. . .

UPDATE: Over the weekend I received a clarification via email. Ms. Trumpner writes: “I was referring to Moretti’s potential methodological “descendants”–ie. those coming after him, even long after him, not his current team-mates.” Ms. Trumpner notes that when she was interviewed the discussion was not about our Literature Lab at Stanford, but about Moretti’s approaches and general matters of how she and Moretti approach the study of literature. Her comments were made in that context and not in the context of a discussion of the current work of the lab.

Stalker (R) and the journey of the Jockers iPhone

Lot’s of hoopla in the last few days over the discovery that the iPhone keeps a database of locations it has traveled. Wasn’t long before someone in the R community figured out how to tap into this file and with a mere two lines of code you can visualize where your phone has been on a map.

The code library comes complements of Drew Conway over on the r-bloggers page. I installed the app and within a few seconds had several maps of my recent travels. I attach two images below (don’t tell my mom).


Digital Humanities: Methodology and Questions

Students in our new Literature Lab doing what English Majors do!

Students in our new Literature Lab doing what English Majors do!

Folks keep expressing concern about the future of the humanities, and the “need” for a next big thing. In fact, the title of a blog entry in the April 23, 2010 New York Times takes it for granted that the humanities need “saving.” The blog entry is a follow up to an article from March 31, which explores how some literary critics are applying scientific methodologies and approaches to the study of literature. Of course, this isn’t really new. One only needs to read a few back issues of Literary and Linguistic Computing to know that we’ve been doing this kind of work for a long time (and even longer if one wants to consider the approaches suggested by the Russian Formalists). What is new is that the mainstream humanities and the mainstream press are taking notice.

In her response to the article, Blakey Vermeule (full disclosure, her office is just three doors from mine) makes a key point to take away from the discussion. She writes: “The theory wars are long gone and nobody regrets their passing. What has replaced them is just what was there all along: research and scholarship, but with a new openness to scientific ideas and methods” (emphasis mine). Before explaining why this is the key take-away, a little story. . .

Not too long ago, a colleague took me aside and asked in all earnestness, “what do I need to do to break into this field of Digital Humanities.” This struck me as a rather odd question to ask: a very clear putting of the cart before the horse. Digital Humanities (DH) is a wide stretching umbrella term that attempts to encompass and describe everything from new media theory and gaming to computational text analysis and digital archiving. There is a lot of room under the umbrella, and it really is, therefore, impossible to think of DH as a unified “field” that one can break into. In fact, there are no barriers to entry at all; the doors are wide open, come on in.

But, I’ll go out on a limb and argue that the DH community can be split into two primary groups: Group “A” is composed of researchers who study digital objects; Group “B” is composed of researchers who utilize digital tools to study objects (digital or otherwise). Group A, I would argue, is primarily concerned with theoretical matters and group B with methodological matters. In reality, of course, the lines are blurry, but this is a workable clarification.

. . . What Vermuele and others are describing in the New York Times business falls most cleanly into Group B. But I would not describe this movement toward empirical methodologies as revolutionary: when interested in certain types of questions, an empirical methodology just makes good common sense. I came to utilize computation in my research not because the siren’s song of revolution was tempting me away from my dusty, tired, and antiquated approaches to literature. Rather, computational tools and statistical methods simply offered a way of asking and exploring the questions that I (and others such as those pictured above) have about the literary field. What has changed is not the object of study but the nature of the questions.

So, the answer to my colleague who asked what is needed to “break into this field of Digital Humanities” is simply this: questions, you need questions.

Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Social Networking for digital humanities nerds? Which DH bloggers are you most compatible with? Let’s get the right nerds with the right nerds–match making made in digital humanities heaven.

After seeing Stefan Sinclair’s Voyeuristic analysis of the Day of DH Blog posts, I wrote and asked him how to get access to the “corpus” of posts. He hooked me up, and I pre-processed the data with a few php scripts, then ran an LDA topic modeling process and then some more post processing with R in order to see the most important themes of the day and also to cluster the 117 bloggers based on their thematic similarity.

So, here’s the what and then the how. As for the why? Why not?

What:

117 Day of DH Bloggers

10 Unsupervised Topics (10 is arbitrary–I could have picked 100). These topics are generated by an analysis of the words and word sequences in the individual blogger’s sites. The purpose is to harvest out the most prominent “themes” or topics. These themes are presented in series of word lists. it is up to the researcher to then “label” the word clusters. I have labeled a few of them (in [brackets] at the beginning of the word lists below–you might use another label–this is the subjective part). Here they are:

  1. [human interaction in DH] work today people time working things email year week days bit good meeting tomorrow
  2. day thing mail dh de image based fact called things change ago encoding house
  3. [Academic Writing–including Grants] day time dh start post blog proposal google write great posts lunch nice articles
  4. [Digital publishing and archives] http talk future collection making online version publishing field morning life traditional daily large
  5. conference university blog morning read internet access couple computers archive involved including great written
  6. [DH Teaching] students dh teaching humanities class technology scholars university lab group library support scholarship student
  7. [DH Projects] digital project humanities work projects room meeting collections office building task database spent st
  8. data project xml working projects web interesting user set spend system ways couple time
  9. digital day humanities media writing post computing twitter english humanist real phd web rest
  10. [reading and text-analysis] book text tools software books today reading literary texts coffee edition search tool textual

Unfortunately, the Day of DH corpus isn’t truly big enough to get the sort of crystal clear topics that I have harvested from much larger collections, but still, the topics above, seen in aggregate, do give us a sense of what’s “hot” to talk about in the field.

But let’s get to the sexy part. . .

In addition to harvesting out the prominent topics, the modeling tool outputs data indicating how much (what proportion) of each blog is about each topic. The resulting matrix is of dimension 117×10 (117 blogs and 10 topics). The data in the cells are percentages for each topic in each author’s blog. The values in each row add up to 100%. With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes. Using this data, I then output a matrix showing which author’s have the most in common; thus, I do a little subtle match-making in advance of our digital rendezvous in London–birds of a feather blog together?

Here are the groups:

  • Group1
    1. aimeemorrison
    2. ariefwidodo
    3. barbarabordalejo
    4. caraleitch
    5. carlosmartinez
    6. carlwhithaus
    7. clairewarwick
    8. craigharkema
    9. ellimylonas
    10. geoffreyrockwell
    11. glenworthey
    12. guydaarmstrong
    13. henrietterouedcunliffe
    14. ianjohnson
    15. janrybicki
    16. jenterysayers
    17. jonbath
    18. juliaflanders
    19. juliannenyhan
    20. justinerichardson
    21. kai-christianbruhn
    22. kathleenfitzpatrick
    23. keithlawson
    24. lauramandell
    25. lauraweakly
    26. malterehbein
    27. matthewjockers
    28. meganmeredith-lobay
    29. melissaterras
    30. milenaradzikowska
    31. miranhladnik
    32. patricksahle
    33. paulspence
    34. peterrobinson
    35. pouyllau
    36. rafaelalvarado
    37. raysiemens
    38. reneaudet
    39. rogerosborne
    40. rudymcdaniel
    41. stanruecker
    42. stephanieschlitz
    43. susangreenberg
    44. victoriasmith
    45. vikazafrin
    46. williamturkel
  • Group2
    1. alejandrogiacometti
    2. annacaprarelli
    3. danasolomon
    4. ernestopriego
    5. karensmith
    6. leedurbin
    7. matthewcarlos
    8. paolosordi
    9. sarasteger
    10. stephanethibault
    11. yinliu
  • Group3
    1. alialbarran
    2. amandagailey
    3. cyrilbriquet
    4. federicomeschini
    5. ntlab
    6. stefansinclair
    7. torstenschassan
  • Group4
    1. aligrotkowski
    2. ashtonnichols
    3. calenhenry
    4. devonfitzgerald
    5. enricasalvatori
    6. ericforcier
    7. garrywong
    8. jameschartrand
    9. joelyuvienco
    10. johnnewman
    11. peterorganisciak
    12. shannonlucky
    13. silviarussell
    14. simonmahony
    15. sophiahoosein
    16. stevenhayes
    17. taraandrews
    18. violalasmana
    19. willardmccarty
  • Group5
    1. alunedwards
    2. hopegreenberg
    3. lewisulman
  • Group6
    1. amandavisconti
    2. jamessmith
    3. martinholmes
    4. sperberg-mcqueen
    5. waynegraham
  • Group7
    1. bethanynowviskie
    2. josephgilbert
    3. katherineharris
    4. kellyjohnston
    5. kirstenuszkalo
    6. margaretgraham
    7. matthewgold
    8. paulyoungman
  • Group8
    1. charlestravis
    2. craigbellamy
    3. franzfischer
    4. jeremyboggs
    5. johnwall
    6. kathrynbarre
    7. shawnday
    8. teresadobson
  • Group9
    1. jasonboyd
    2. jolanda-pieta
    3. joriszundert
    4. michaelmaguire
    5. thomascrombez
    6. williamallen
  • Group10
    1. louburnard
    2. nevenjovanovic
    3. sharongoetz
    4. stephenramsay

Twitterers @sramsay and @mattwilkens were poking around here today and wondered what the topics would look like if there were only five topics and five clusters instead of 10 and 10. Here are the topics:

  1. data work time text working tools people thing system xml mail software things texts
  2. day time morning lot work bit find web class teaching student days dh real
  3. digital humanities day tomorrow book twitter university blog computing reading books writing tei emails
  4. day dh today time post things write start online writing working computer year hours
  5. project digital work projects students meeting today people humanities dh scholars library year lab

And here are the Blogger-Mates clusters when I set n=5:

  • Group1
    1. aimeemorrison
    2. alejandrogiacometti
    3. alialbarran
    4. amandagailey
    5. annacaprarelli
    6. ashtonnichols
    7. barbarabordalejo
    8. carlosmartinez
    9. carlwhithaus
    10. clairewarwick
    11. craigbellamy
    12. craigharkema
    13. danasolomon
    14. devonfitzgerald
    15. enricasalvatori
    16. ernestopriego
    17. garrywong
    18. glenworthey
    19. guydaarmstrong
    20. henrietterouedcunliffe
    21. ianjohnson
    22. jameschartrand
    23. janrybicki
    24. jenterysayers
    25. joelyuvienco
    26. johnnewman
    27. jonbath
    28. juliannenyhan
    29. justinerichardson
    30. karensmith
    31. kathleenfitzpatrick
    32. keithlawson
    33. leedurbin
    34. lewisulman
    35. malterehbein
    36. matthewgold
    37. matthewjockers
    38. meganmeredith-lobay
    39. melissaterras
    40. michaelmaguire
    41. miranhladnik
    42. nevenjovanovic
    43. patricksahle
    44. peterrobinson
    45. raysiemens
    46. reneaudet
    47. rogerosborne
    48. shannonlucky
    49. silviarussell
    50. simonmahony
    51. sophiahoosein
    52. stefansinclair
    53. stephanieschlitz
    54. susangreenberg
    55. taraandrews
    56. thomascrombez
    57. torstenschassan
    58. vikazafrin
    59. violalasmana
    60. willardmccarty
    61. williamallen
    62. williamturkel
    63. yinliu
  • Group2
    1. aligrotkowski
    2. ariefwidodo
    3. calenhenry
    4. caraleitch
    5. charlestravis
    6. ericforcier
    7. geoffreyrockwell
    8. jolanda-pieta
    9. juliaflanders
    10. lauraweakly
    11. margaretgraham
    12. matthewcarlos
    13. milenaradzikowska
    14. nt2lab
    15. paolosordi
    16. peterorganisciak
    17. rudymcdaniel
    18. sarasteger
    19. sharongoetz
    20. stanruecker
    21. stevenhayes
    22. victoriasmith
  • Group3
    1. alunedwards
    2. hopegreenberg
    3. katherineharris
    4. stephanethibault
    5. teresadobson
  • Group4
    1. amandavisconti
    2. cyrilbriquet
    3. federicomeschini
    4. jamessmith
    5. joriszundert
    6. martinholmes
    7. rafaelalvarado
    8. sperberg-mcqueen
    9. stephenramsay
    10. waynegraham
  • Group5
    1. bethanynowviskie
    2. ellimylonas
    3. franzfischer
    4. jasonboyd
    5. jeremyboggs
    6. johnwall
    7. josephgilbert
    8. kai-christianbruhn
    9. kathrynbarre
    10. kellyjohnston
    11. kirstenuszkalo
    12. lauramandell
    13. louburnard
    14. paulspence
    15. paulyoungman
    16. pouyllau
    17. shawnday

Analyze This (Page)

“TAToo” is a fun Flash widget developed by Peter Organisciak at the University of Alberta. Peter works under the supervision of Digital Humanists Par Excellence and TAPoR Gurus Geoffrey Rockwell and Stan Ruecker. The widget (just some embed-able code) does “layman’s” text analysis on the web pages in which its code is embedded. I’ve added the code to the right sidebar of my blog to take it for a test drive.

The tool offers several “views” of your text. The default view is a word cloud in which words with greater frequency are both larger and more bold. Looking at the word cloud can give you a pretty quick sense of the page’s key terms and concepts.

By clicking on the “Tool:” bar at the top of the widget, you can select other options. The “List Words” view provides a term frequency list. You can then click on any word in the list to see its collocates. Alternatively, there are both a Collocate view and a Concordance view that allow users to enter a specific word and get information about the company that word keeps.

Kudos to Peter and the rest of the TAPoR Tools team for continuing to pursue the fine art of tool making.

65,000 Texts to Mine?

A story in the Feb. 7th issue of the Telegraph reports that the British Library is going to make 65,000 first edition texts available for public download via Amazon’s Kindle. This news is almost as exciting as Google’s decision some years ago to partner with a consortium of big libraries in order to digitize all their books. What makes this project from the British Library particularly exciting is that the texts being offered are all works of 19th century fiction.

Unlike the Google project that is digitizing everything, this offering from the BL is already presorted to include just the kind of content that literary researchers can really use. With Google, I assume, one is going to have to figure out how to sort the legal books from the cook books, the memoirs from the fiction. Here, however, the BL has already done a big part of the work.

It will be interesting to see how this material gets offered and what sort of metadata is included with the individual files. For those of us who are interested in corpus-mining and macroanalysis (as opposed to just reading a single book at a time) the metadata is crucial. If, for example, we have the publication date of each text in an easily extractable format (e.g. TEI XML) we could explore all kinds of chronological investigations.

In prior research, working with a corpus of just 250 19th century British novels, I explored the “theme” of childhood by quantifying the relative frequency of a “cluster” or “semantic field” of words suggestive of “childhood”. In that work, I discovered a proportionally higher incidence of the theme during the Victorian period, a finding that tends to confirm the idea that childhood was an “invention” of the Victorians. But, then again, a corpus of 250 novels doesn’t even scratch the surface.

I’m not sure just what’s included in the British Library’s 65,000 texts. I assume these are not just British texts, but American, German, etc. Franco Moretti has estimated that there were 8,000 to 10,000 novels published in the Great Britain in the 19th century (20-40,000 works of prose fiction). Surely a good many of these are part of the BL’s 65,000. Which brings us back to the metadata question. Will it be possible to generate a list of which texts in the 65,000 are British-authored and British-published *novels*? If the answer is yes, then the game is on.

Get the texts, convert from mobi to pdf, html, or other text format using any number of open source apps and then poof! You’ve got a COUS–Corpus of Unusual Size! Of course, it’d be a lot easier if the BL would make the texts available (for researchers at least) through a channel that doesn’t involve Amazon or one of the eBook formats. I’m investigating that path now and will report on any progress.

Is it the Joyce Industry or the Shakespeare Industry?

At the recent Digital Humanities Conference in Maryland, Matthew Wilkins and I got into a discussion about famous authors and the “industries” of scholarship that their works have inspired (see Matt’s blog post about our discussion and his survey analysis of the MLA bibliography).

The first time I ever heard the term “industry” used in this context was in reference to the scholarship generated by Joyce’s novel Ulysses. As Joyce himself predicted (bragged) the book would keep scholars busy for centuries to come, and, of course, Joyce was right–well maybe not centuries, but you get the idea. But can we really compare the Shakespeare “industry” to the Joyce “industry” given that the Bard had such a significant head start in terms of establishing his scholarly “fan base”?

Using the MLA bibliography, Matthew W. took a stab at this and compiled some rough figures of recent scholarship on the two masters. By Matt’s count, since 1923, Joyce has inspired just 9315 citations to Shakespeare’s massive 35,489.

But there is an obvious problem here: the figures begin in 1923 and Ulysses, the book that really puts Joyce on the map, was only published in 1922. So Joyce is getting into the industry-building business a bit late. Clearly we must do some norming here to account for the Bard’s head start.

Now, since I am pretty sure that I owe Matt a beer if the Bard has a bigger industry, I think some well thought out math is warranted here:-) . . .

Shakespeare dies in 1616 and Joyce dies in 1941. Subtracting each death date from the last year of Matt’s analysis (2008) means that Shakespeare had 392 years to develop his industry and Joyce only 67 years. If we divide the total number of citations Matt found in the MLA bibliography by the total number of industry-building years, then the figures tell a very different story. Joyce averages 139 citations per year whereas Shakespeare manages only a paltry 90.5

But wait, there’s more. . . Querying the MLA bibliography using the search terms “shakespeare and hamlet” results in 4079 citations. A similar query for “joyce and ulysses” returns 3269. Normed for years of industry-building time these figures tell a sad, sad tale for the man from Stratford. Ulysses inspires 48.8 citations per year and Hamlet a meager 10.4.

In this sense, the Bard can be thought of as the steady industrial giant. His stock increases little by little, and he is a generally good investment. For the sake of convenience, let’s call him “GM.”

Joyce, on the other hand is a relative new comer to the marketplace. He is more like a Silicon Valley startup and his stock starts off slow and then sky-rockets. For the sake of convenience, we’ll call him “Google.”

Now, getting back to the central question, who’s bigger. . . You’ll find the answer here.

Machine-Classifying Novels and Plays by Genre

In the post that follows here, I describe some recent experiments that I (and others) have conducted. The goal of these experiments was to accurately machine-classify novels and plays (Shakespeare’s) by genre. One of the most interesting results ends up having more to do with feature extraction than classification algorithm

Background

Several weeks ago, Mike Witmore visited the Beyond Search workshop that I organize here at Stanford. In prior work, Witmore and some colleagues utilized a program called Docuscope (Developed at Carnegie Mellon) to distinguish between and classify (statistically) Shakespeare’s histories and comedies.

“Equipped with a specialized dictionary, Docuscope is able to divide texts into strings of words that are then sorted into one of eighteen word categories, such as “Inner Thinking” and “Past Events.” The program turns differentiating amongst genres into a statistical task by testing the frequency of occurence of words in each of the categories for each individual genre and recognizing where significant differences occur.”

Docuscope was designed as a tool for analyzing student writing, but Witmore (et. al.) discovered that it could also be employed as a specialized sort of feature extraction tool.

To test the efficacy of Docuscope as a tool for detecting and clustering novels by genre, Franco Moretti and I created a full text corpus that included 36 19th century novels (striped of title page and other identifying information). We divided this corpus into three groups and organized them by genre:

  • Group one consisted of 12 texts belonging to 3 different (but fairly similar) genres (gothic, historical tale, and national tale)
  • Group two consisted of 12 texts belonging to 3 different genres that were quite different (industrial, silver-fork, bildungsroman).
  • Group three consisted of 12 texts belonging to 6 different genres that mix 3 genres from those already included in group one or two and 3 new genres (evangelical, newgate, and anti-jacobin).

Witmore was given this corpus in electronic form (each novel in plain text). For identification purposes (since Mike was not privy to the actual genres or titles of the novels), he labeled each of the 12 genre groups with a number 1-12. Witmore’s numberings correspond to genres as follows:

  1. Gothic
  2. Historical Novels
  3. National Tales
  4. Industrial Novels
  5. Silver-Fork Novels
  6. Bildungsroman
  7. Anti-Jacobin
  8. Industrial
  9. Gothic
  10. Evangelical
  11. Newgate
  12. Bildungsroman

Using Docuscope, Witmore ran a series of tests in attempt to cluster the similar genres together. The experiment was designed to pick the three groups from 7-12 that have genre cognates in 1-6. Witmore’s results for the closest affiliated genres were impressive:

  • 2:9 (Historical with Gothic)
  • 1:9 (Gothic with Gothic) Witmore notes that this 2nd cluster was a close (statistically) second to the above
  • 4:8 (Industrial with Industrial)
  • 6:12 (Bildungsroman with Bildungsroman)

Witmore’s results also suggested an especially close relationship between the Gothic and Historical, Witmore writes that “groups 1 and 2 looked like they paired with the same candidate group (9).”

Additional Experiments

All of this work Witmore had done and the results he derived got me thinking more completely about the problem of genre classification. In many ways, genre classification is akin to authorship attribution. Generally speaking though, with authorship problems one attempts to extract a feature set that excludes context sensitive features from the analysis. (The consensus in most authorship attribution research suggests that a feature set made up primarily of frequent, or closed-class, word features yields the most accurate results) For genre classification, however, one would intuitively assume that context words would be critical (e.g. Gothic novels often have “castles” so we would not want to exclude context sensitive words like “castle.”) But my preliminary experiments have suggested just the opposite, namely that a distinct and detectable genre “signal” may be derived from a limited set of high-frequency features

Using just 42 word and punctuation features, I was able to classify the novels in the corpus described above equally as well as Witmore did using Docuscope (and a far more complex feature set). To derive my feature set, I lowercase the texts, count and convert to relative frequency the various features types, and then winnow the feature set by choosing only those features that have a mean relative frequency of 3% or greater. This results in the following 42 features (The prefix “p_” indicates a punctuation token instead of a word token.):

“a”, “all”, “an”, “and”, “as”, “at”, “be”, “but”, “by”, “for”, “from”, “had”, “have”, “he”, “her”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “on”, “p_apos”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_quote”, “p_semi”, “she”, “that”, “the”, “this”, “to”, “was”, “were”, “which”, “with”, “you”

Using the “dist” and “hclust” functions in the open-source “R” statistics application, I cluster the texts and output the following dendrogram:

These results were compelling, and after I shared them with Mike Witmore, he suggested testing this methodology on his Shakespeare corpus. Again the results were compelling and this process accurately clustered the majority of Shakespeare’s plays into appropriate clusters of “tragedy,” “comedy,” and “history”. The dendrogram below shows the results of my Shakespeare experiment using these 37 features

“a”, “and”, “as”, “be”, “but”, “for”, “have”, “he”, “him”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “p_apos”, “p_colon”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_ques”, “p_semi”, “so”, “that”, “the”, “this”, “thou”, “to”, “what”, “will”, “with”, “you”, “your”.

These initial tests raise a number of important questions, not the least of which is the question of how much of a factor genre plays in determining the usage of high frequency word and punctuation tokens. We have plans to conduct a series of more rigorous experiments, and the results of these tests will be forthcoming. In the meantime, my initial tests appear to confirm, again, the significant role that common function words play in defining literary style .

Chronicle of Higher Education Article

This week the Chronicle of Higher Education ran an article written by Jennifer Howard about “literary geospaces.” The article featured some work I have done mapping Irish-American literature using Google Earth (and also profiled the work of Janelle Jenstad who has been mapping early modern London).

Picture of Jockers with Google Earth by Noah Berger

Photo by Noah Berger

The bit about my Google Earth/Irish-American literature mash up resulted in several emails from folks wanting to know more about the project and more specifics about my findings. . . beware what you ask for. . .

I began building a bibliographic database of Irish-American literature many years ago when I was working on my dissertation (Jockers, Matthew L. “In search of Tir-Na-Nog: Irish and Irish-American Literature in the West.” Southern Illinois University, 1997). In 2002 I received a grant from the Stanford Humanities Laboratory to fund a web project called “The Irish-American West.” At that point I moved the database into MySql and put the whole thing on line with a search interface. As part of the grant, I also began digitizing and putting on line a number of specific Irish-American novels from the west. All of this work was later moved to the web site of the Western Institute of Irish Studies, a non-profit that I helped establish with then Irish Consul Donal Denham and a few other Bay Area enthusiasts. The archive and the database are alive and well at the Institute, and each year students who take my Introduction to Humanities Computing course help the archive grow by encoding one or two more full texts. (The group projects my students complete each year can be found on my courses page)

Ironically, on St. Patrick’s day in 2007, I was invited to present a paper at the 2007 MLA meeting in Chicago as part of a panel session titled “Literary Geospaces.” The paper I delivered “Beyond Boston: Georeferencing Irish-American Literature” utilized Google Earth to help the audience visualize both the landscape and chronology of Irish-American literary history. I warned the audience at the time not to be seduced by the incredible visual appeal of Google Earth; GE is a stunning application, and I was honestly worried that my audience would lose track of my central thesis about the literary history of Irish-America if they got too caught up in the visualization of the data. I was also worried about the amount of time that went into the preparation of the Google Earth mash-up. The MLA is a meeting of literature and language professors, and I didn’t want to give the impression that putting something like this together was a simple matter (along with the Google Earth app itself, I’d utilized php, xml, xsl, html, and Mysql to build the .kml file that runs the whole show).

The central thesis of the paper was that in order to understand Irish-American literature we need to look not simply to the watershed moments of Irish-American history, but we must look to the very geography of America. As long ago as 1997, my research had shown that the Irish experience in America was largely determined by place. It’s true, of course, that the time of immigration to the U.S. was important in coloring the Irish experience: were these pre-famine immigrants, famine refugees, or the 1980’s so-called “commuter Irish.” But I discovered that equally important to chronology was place and the business of where the immigrants settled. For my research, I divided the country up into a number of regions (Midwest, mountain, southwest, pacific. . .) and each one of these regions turned out to have a distinct “brand” of Irish-American writing. Generally speaking, though, the further west we go the more likely we are to find writers describing the Irish-American experience in positive terms. And perhaps more importantly, the further west we go the more Irish writing there seems to be if we view “more” in relative terms, as a percentage of the Irish population.

I suppose one of the most interesting things I discovered along the way involves what was happening in the early part of the 20th century. My colleague Charles Fanning has speculated that in the early 1900s, from around 1900 to 1930, Irish-Americans turned away from writing about their experience in the United States. These were difficult times for Irish-Americans, and Fanning writes in his impressive book The Irish Voice in America how “a number of circumstances–historical, cultural, and political, including the politics of literature–combined to [create] a form of wholesale cultural amnesia (3).”

What I discovered was that Irish writers in the western U.S. were largely undeterred.

And this all made perfectly good sense: Irish writers in the West did not have to face the same prejudice that there counterparts in the East faced. There was no established Anglo-Protestant majority in the West, there was far less competition for good jobs, and generally speaking the Irish who ventured west were better off and typically better educated than their countrymen in the East. Thus they had more means and more opportunity for writing. So if we look at the entire corpus we find not a period of literary recession in the early 1900s, but instead a period of heightened activity. It’s only when we probe that activity that we discover that writers from west of the Mississippi are the ones being active.

Here is a link to a Quicktime video of the Google Earth mash-up. I’m still working on setting up an interactive version that will query my database dynamically and allow visitors to sort and probe the entire collection. . . more on that later.

POS Tagging XML with xGrid and the Stanford Log-linear Part-Of-Speech Tagger

Recently (4/2008) I had reason to Part-Of-Speech tag a whole mess of novels, around 1200. I installed the Stanford Tagger and ran my first job of 250 novels on an old G4 under my desk. Everything worked fine, but the job took six days. After that experience, I figured out how to utilize xGrid for “distributed” tagging, or what I’ll call, according to convention, “Tagging@Home.” At the time that I was working on this tagging project, the folks in Stanford’s NLP group, especially Chris Manning and Anna Rafferty, were improving the tagger and adding some XML functionality to the program. I’m very grateful to Chris and Anna for their work. What follows is a practical guide for those who might wish to employ the tagger for use with XML or who might want to understand how to set up xGrid to help distribute a big tagging job.

First I provide some simple examples showing how to take advantage of the XML functionality added in the May 19, 2008 release of the
Stanford Log-linear Part-Of-Speech Tagger. Further down in this page I include information about setting up xGrid to farm out a large tagging job to a network of Macs, useful if you want to POS tag a large corpus. These example assume that you have installed the tagger and understand how to invoke the tagger from the cmd line. If you are not yet familiar with the tagger, you should consult the ReadMe.txt file and javadoc that come with it. In the javadoc, see specifically the link for “MaxentTagger” where there is a useful “parameter description” table.

Example One: Tagging XML with the Stanford Log-linear Part-Of-Speech Tagger

Many texts are currently available to us in XML format, and in literary circles the most common flavor of XML is
TEI. In this example we will POS tag a typical TEI encoded XML file.

Begin by examining the structure of your source XML file to determine which XML elements “contain” the content that you wish to POS tag. More than likely, you don’t want to tag the title page, for example, but are interested primarily in the main text. To complete the exercises below, you may want to
download a shortened version of James McHenry’s
The Wilderness
which is the text I use in the examples; alternatively, you may use your own .xml file. The example file “wilderness.xml” is marked up according to TEI standards and thus contains two major structural divisions: “teiHeader” and “text.” The “teiHeader” element contains metadata about the file and the “text” element contains the actual text that has been marked up. For purposes of this example, I shortened the text to include only the first two chapters.

For this example, I assume that you wish to POS tag the text portions of the book that are contained in the main “body” of the book, that is to say, you are not interested in POS tagging the title page(s) or any ancillary material that may come before or after the primary text. To separate the main body of the text, TEI uses a “body” element, so we might begin by having the tagger focus only on the text that occurs within the body element.

If we were simply tagging a plain text file, such as what you might find at
Project Gutenberg, the usual command (as found in the tagger “readme.txt” file) would be as follows:


java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -textFile sample-input.txt > sample-output.txt

To deal with XML based input, however, we need to add another parameter indicating to the tagger that the source file is XML and more importantly that we only want to tag the contents of a specific XML element (or several specific elements, which is also possible). The additional parameter is “-xmlInput” and the parameter is followed by a space delimited list of XML tags whose content we want the POS tagger to tag. The revised command for tagging just the contents of the “body” element would look like this:


java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput body -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

When invoked in this manner, the tagger grabs the content of the body tag for processing and ignores (or strips out) any XML markup contained within the selected tag. Running the command above, thus has the effect of stripping out all of the rich markup for chapters, paragraphs and etc. In order to preserve more of the structural markup, a slightly better approach is to use the “p” tag instead of the “body” tag, as follows:


java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

If you run this command, you’ll be a bit closer, but you might notice that our source file includes a series of poetic epigraphs at the beginning of chapters and a periodic bit of poetry dispersed elsewhere throughout the prose. Using just the “p” tag above, we fail to POS tag the poetic sections. Fortunately, the tagger allows us to specify more than one POS tag in the command, and we can thus POS tag both “p” tags and “l” tags (which contain lines of poetry)
*see note*:


java -mx300m -classpath stanford-postaggerar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p\ l -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

Complete execution time for the command above using McHenry’s
Wildernesson a typical desktop computer is about 30-40 seconds but the processing time will vary depending on your machine and the speed of your processor. Here is a snippet of what the resulting output will look like:


<l>As/IN slow/JJ our/PRP$ ship/NN her/PRP$ foamy/NN track/NN ,/,</l>
<l>Against/IN the/DT wind/NN was/VBD cleaving/VBG ,/,</l>
<l>Her/PRP$ trembling/VBG pendant/JJ still/RB look/VB 'd/NNP back/RB</l>
<l>To/TO that/DT dear/RB isle/VB 'twas/NNS leaving/VBG ;/:</l>
<l>So/RB loth/NN we/PRP part/VBP from/IN all/DT we/PRP love/VBP ,/,</l>
<l>From/IN all/DT the/DT links/NNS that/IN bind/NN us/PRP ,/,</l>
<l>So/RB turn/VB our/PRP$ hearts/NNS where'er/VBP we/PRP rove/VBP ,/,</l>
<l>To/TO those/DT we/PRP 've/VBP left/VBN behind/IN us/PRP !/. Moore/NNP</l>

<p>Let/VB melancholy/JJ spirits/NNS talk/VBP as/IN they/PRP please/VBP concerning/VBG the/DT
degeneracy/NN and/CC increasing/VBG miseries/NNS of/IN mankind/NN ,/, I/PRP will/MD not/RB
believe/VB them/PRP ./. They/PRP have/VBP been/VBN speaking/VBG ill/JJ of/IN themselves/PRP ,/,
and/CC predicting/VBG worse/JJR of/IN their/PRP$ posterity/NN ,/, from/IN time/NN immemorial/JJ ;/:
and/CC yet/RB ,/, in/IN the/DT present/JJ year/NN ,/, 1823/CD ,/, when/WRB ,/, if/IN the/DT one/CD
hundreth/NN part/NN of/IN their/PRP$ gloomy/JJ forebodings/NNS had/VBD been/VBN realized/VBN ,/,
the/DT earth/NN must/MD have/VB become/VBN a/DT Pandemonium/NN ,/, and/CC men/NNS something/NN
worse/JJR than/IN devils/NNS ,/, -LRB-/-LRB- for/IN devils/NNS they/PRP have/VBP been/VBN long/JJ
ago/RB ,/, in/IN the/DT opinion/NN of/IN these/DT charitable/JJ denunciators/NNS ,/, -RRB-/-RRB-
I/PRP am/VBP free/JJ to/TO assert/VB ,/, that/IN we/PRP have/VBP as/IN many/JJ honest/JJ men/NNS ,/,
pretty/RB women/NNS ,/, healthy/JJ children/NNS ,/, cultivated/VBN fields/NNS ,/, convenient/JJ
houses/NNS ,/, elegant/JJ kinds/NNS of/IN furniture/NN ,/, and/CC comfortable/JJ clothes/NNS ,/,
as/IN any/DT generation/NN of/IN our/PRP$ ancestors/NNS ever/RB possessed/VBN ./.</p>

*Note that after the -xmlInput parameter we include the “p” tag and the “l” tag separated by an *escaped* space character. UNIX chokes on the space character if we don’t escape it with a backslash. Martin Holmes pointed out to me that if you are using windows, then you should put the space delimited tags inside quotes (like this: “p l”) and disregard the backslash.

Example Two: POS Tagging an XML source file and outputting results in XML

In addition to being able to “read” xml, the 5-19-2008 tagger release also includes functionality allowing users to output POS tagged results as well-formed xml. The process for doing so is very much the same as Example One above, however, to output XML, we need to add an additional “-xmlOutput” parameter to the command.


java -mx300m -classpath stanford-postaggerar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-wsj-0-18.tagger -xmlInput p\ l -xmlOutput -textFile ./pathToYourDirectory/Wilderness.xml > ./pathToYourDirectory/wilderness-output.xml

In this case, the resulting output is a far richer XML in which each sentence is wrapped in a “sentence” tag and each word in a “word” tag. The sentence tag includes an “id” attribute indicating its number in the sequence of sentences in the entire document. Likewise, the word element contains an id attribute referecning the word’s position in the sentence followed by a “pos” attribute indicating what part of speech the tagger assigned the word. Here is a sample of the output:


<p>
<sentence id="17">
<word id="0" pos="VB">Let</word>
<word id="1" pos="JJ">melancholy</word>
<word id="2" pos="NNS">spirits</word>
<word id="3" pos="VBP">talk</word>
<word id="4" pos="IN">as</word>
<word id="5" pos="PRP">they</word>
<word id="6" pos="VBP">please</word>
<word id="7" pos="VBG">concerning</word>
<word id="8" pos="DT">the</word>
<word id="9" pos="NN">degeneracy</word>
<word id="10" pos="CC">and</word>
<word id="11" pos="VBG">increasing</word>
<word id="12" pos="NNS">miseries</word>
<word id="13" pos="IN">of</word>
<word id="14" pos="NN">mankind</word>
<word id="15" pos=",">,</word>
<word id="16" pos="PRP">I</word>
<word id="17" pos="MD">will</word>
<word id="18" pos="RB">not</word>
<word id="19" pos="VB">believe</word>
<word id="20" pos="PRP">them</word>
<word id="21" pos=".">.</word>
</sentence>
</p>

Using the Stanford Log-linear Part-Of-Speech Tagger with xGrid

There is really nothing special about using the tagger on an xGrid, but since it took me six hours to figure out how to do it and how to set everything up, I provide below a basic set up guide that will have you tagging in ten minutes (give or take an hour)

Not being a sys admin, I found the available documentation about xGrid a bit tricky to navigate, and frankly it just didn’t address what I feel to be rather fundemental questions about exactly how xGrid does what it does. More useful than Apple’s own xGrid documentation were
Charles Parnot’s xGrid tutorials available
here (The Apple manual even refers to these). Especially useful is Parnot’s “GridStuffer” application (more on that in a minute). The one problem I had with Parnot’s tutorials is that they all assumed that I was using a single machine as Client, Controller, and Agent. In xGrid lingo, the Client is the machine that submits a job to the Controller. The Controller is where xGrid “lives,” and it is the Controller that serves as the distributor and the distribution point for sending a job out to the “Agents.” Agents are the machines that are enlisted to do the heavy lifting, that is, they are all the machines on the network that are signed up to parallel process your job.

Example Three: POS Tagging with xGrid from the Command Line

To make life easy, the first thing I did was to figure out how to submit the job to xGrid from the cmd line. In this case I had ssh access to the server hosting the controller, so I installed the Tagger in a folder I created inside /Users/Shared/. The full path to my installation of the tagger was “/Users/Shared/Tagger/stanford-postagger-full-2008-05-19/”. Once this was done, I cd’ed (changed directory) down into this directory and then entered the following cmd at the prompt (you would substitute “hostname” and “password” with your information, e.g. – h myserver.mydomain.edu -p myPW)

xgrid -h <hostname> -p <password> -job submit /usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -textFile sample-input.txt

You will notice one thing about the tagger portion of this cmd that is different: there is no redirect of the output. Above, when we were just tagging files without xGrid, the cmd ended like this “-textFile sample-input.txt > sample-output.txt” telling the program to print its output to a new file titled “sample-output.txt”. xGrid has it’s own way of handling “results” or stdout; xGrid saves the results of a job on its own and then allows us to retrieve the results using a job id.

So, after submitting the cmd above, xGrid returns some important information to the terminal, somehting like this:


{
jobIdentifier = 2822;
}

This tells us that our job is id number 2822, and we’ll need that number in order to “return” the results after the job has finished. But after we submit the job we can also check on its status by entering the following cmd:


xgrid -h <hostname> -p <password> -job attributes -id 2195

Which returns something like this:


{
jobAttributes = {
activeCPUPower = 0;
applicationIdentifier = "com.apple.xgrid.cli";
dateNow = 2008-05-23 15:15:16 -0700;
dateStarted = 2008-05-23 15:14:45 -0700;
dateStopped = 2008-05-23 15:14:49 -0700;
dateSubmitted = 2008-05-23 15:14:41 -0700;
jobStatus = Finished;
name = "/usr/bin/java";
percentDone = 100;
taskCount = 1;
undoneTaskCount = 0;
};
}

jobStatus here tells us if the job is running, if it is finished or if it has failed. This job has finished, and so I enter the next cmd to retrieve the results:


xgrid -h <hostname> -p <password> -job results -id 2195

This command tells the xGrid to print the results of the job to my Terminal window, which isn’t all that great if I’ve just tagged a big file. What I really want is to send the result to a new file, so I can modify the cmd above to redirect the output to a file, like this:


xgrid -h <hostname> -p <password> -job results -id 2195 > sample-output.txt

You’ll now find a nicely POS-Tagged file titled “sample-output.txt” in your working directory.

Example Four: POS Tagging with xGrid Using Charles Parnot’s GridStuffer

You can download a copy of Charles Parnot’s GridStuffer at the
xGrid@Stanford website. GridStuffer is a slick little application that greatly simplifies the work involved in putting together a large xGrid job. You should read
Parnot’s GridStuffer Tutorialto understand what GridStuffer does and how it works. Everything that I do here is more or less exactly what is done in the tutorial, well, not exactly. The real point of difference pertains to the input file that contains all the “commands.” Because the Stanford Tagger is a Java application (and not quite like Parnot’s Fasta program) I had to figure out how to articulate the command lines for the GridStuffer input file. In retrospect it all seems very logical and simple, but trying to move from Parnot’s example to my own case actually proved quite challenging. In fact, it was only by reading through one of the forums that I discovered a key missing piece to my puzzle. Other users had reported problems calling Java application and a common thread had to do with paths and file locations. . .

The real trick (and it’s no trick once you understand how xGrid and GridStuffer work) involves how you define your paths and where you place files. I spent a lot of time trying to figure this all out, and even Parnot admits that understanding which paths are relative and which paths are absolute can get challenging. Let me explain. With GridStuffer, you are not, as I did above, logging into the Controller machine (server) and running an xGrid command locally. Instead, you are running GridStuffer on your own machine, separate from the Controller machine and thus acting as a true “Client.” This makes one wonder, right from the start about whether you need to be concerned about file paths on your machine or on the controller or on both. The answer is, “yes.”

Now there are always many ways to skin the cliche, so don’t assume that the way I set things up is the only approach or even the best approach; it is, however, an approach that works. I began by installing the tagger on my local machine (the machine that would serve as the Client and from which I would launch and invoke GridStuffer. I installed the tagger in my “Shared” directory at the following path: /Users/Shared/tagger/stanford-postagger-full-2008-05-19

Inside of this root folder of the tagger (stanford-postagger-full-2008-05-19), I then added another folder titled “input” into which I copied all of the files that I wanted to tag (in this case several hundred novels marked up in TEI XML). Next I created the “input file” (or “commands”) that GridStuffer requires and most importantly, I
put it into the exact same directory(stanford-postagger-full-2008-05-19). Well, the truth is that I did not actually do this at first and spent a good deal of time trying to figure out why the program was choking. In Parnot’s tutorial, he has you store these files on a folder on your Desktop. This practice apparently works just fine with his Fasta tutorial, but it makes a java app like the POS Tagger (and others) choke. Moving the input file to the root directory of the Tagger application solved all my problems (well, almost all of them). Anyhow, not only is the placement of this file important, but this is the critical file in the entire
shebang; it is the file with *all* the calls to the tagger. Here is a snippet from my file:


/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.3.xml
/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.4.xml
/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.5.xml
/usr/bin/java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -arch bidirectional -lang english -model models/bidirectional-wsj-0-18.tagger -xmlInput p -textFile input/book.6.xml

Now for my big job of several hundred novels, I didn’t actaully cut and paste each of these. I wrote a simple script to iterate through a directory of files and then write this “commands” file. I used a little php script run from the command line of the terminal; you could use Perl or Python or some other for the same purpose. Note in the code above that I am tagging xml files and selecting the contents of the “p” elements just as I did in the tagger examples above.

It took me about six bangs of my head on the desk to figure out that I needed to provide the full path to “java” (i.e. /usr/bin/java). That was the only other counterintuitive bit. With this as my input file, I then selected an output directory using the handy Gridstuffer interface and Voila! I was soon tagging away.