Academic, Commentary, Text-Mining

Revisiting Chapter Nine of Macroanalysis

Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.

Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.

Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.

If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.

For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.

In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.

I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.

For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.

For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).

Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.

One might wonder if what’s being graphed here is obvious? After all wouldn’t we expect topics to be time sensitive, faddish, and wouldn’t we expect style to be likewise? Well, I suppose expectations are a matter of personal opinion.

What my data show are that some topics appear and disappear over time (e.g. vampires) in what seem to be faddish ways, others seem to appear with regularity and even predictability (love), and some are just downright odd, appearing and disappearing in no recognizable pattern (animals). Such is also the case with the word frequencies that we often speak of as a proxy for “style.” In the 19th century, for example, use of the word “like” in English fiction was fairly consistent and flat compared to other frequent words that fluctuate more from year to year or decade to decade: e.g. “of” and “it”.

So, I don’t think it is a foregone conclusion that novels published in a particular time period are necessarily similar. It is possible that a particularly popular topic might catch on or that a powerful writer’s style might get imitated. It is equally plausible that in a race to “make it new” writers would intentionally avoid working with popular topics or imitating a typical style.

And when it comes to author gender/sex, I don’t think it is obvious that male writers will write like other males and females like other females. The data reveal that even while the majority (roughly 80%) in each class write more like members of their class, many women (~20%) write more like men and many men (~20%) write more like women. Which is to say, there are central tendencies and there are outliers. When it comes to author gender, study after study indicate that the central tendency is about 80% of writers. Looking at how these distributions evolve over time, seems to me a especially interesting place for ongoing research.

But what we are ultimately dealing with here, in these graphs, are the central tendencies. I continue to believe, as I have argued in Macroanalysis and in The Bestseller Code, that it is only through an understanding of the central tendencies that we can begin to understand and appreciate what it means to be an outlier.

Jan 12 2017

Academic, R-Code, Text-Mining

Resurrecting a Low Pass Filter (well, kind of)

On April 6th, 2015, I posted Requiem for a low pass filter acknowledging that the smoothing filter as I had implemented it in the beta version of Syuzhet was not performing satisfactorily. Ben Schmidt had demonstrated that the filter was artificially distorting the edges of the plots, and prior to Ben’s post, Annie Swafford had argued that the method was producing an unacceptable “ringing” artifact. Within days of posting the “requiem,” I began hearing from people in the signal processing community offering solutions and suggesting I might have given up on the low pass filter too soon.

One good suggestion came from Tommy McGuire (via the Syuzhet GitHub page). Tommy added a “padding factor” argument to the get_transformed_values function in order deal with the periodicity artifacts at the beginnings and ends of a transfomred signal. McGuire’s changes addressed some of the issues and were rolled into the next version of the package.

A second important change was the addition of a function similar to the get_transformed_values function but using a discrete cosine transformation (see get_dct_transform) instead of the FFT. The idea for using DCT was offered by Bradley Riddle, a signal processing engineer who works on time series analysis software for in-air acoustic, SONAR, RADAR and speech data. DCT appears to have satisfactorily solved the problem of periodicity artifacts, but users can judge for themselves (see discussion of simple_plot below).

In the latest release (April 28, 2016), I kept the original get_transformed_values as modified by Tommy McGuire and also added in the new get_dct_transform. The DCT is much better behaved at the edges, and it requires less tweaking (i.e. no padding). Figure 1 shows a plot of Madame Bovary (the text Ben had used in his example of the edge artifacts) with the original plot line (without Tommy McGuire’s update) produced by get_transformed_values (in blue) and a new plot line (in red) produced by the get_dct_transform. The red line is a more accurate representation of the (tragic) plot as we know it.

As in the past, I have graphed several dozen novels that I (and my students and colleagues) know well in order to validate that the shapes being produced by the DCT method are accurate representations of the shifting emotions in the novels. I have also worked with a handful of creative writing colleagues here at UNL and elsewhere, graphing their novels and getting feedback from them about whether the shapes match their sense of their own books. In every case, the response has been “yes.” (Though that does not guarantee you won’t find an exception–please let me know if you do!)
bovary_plot

Figure 1

Like the original get_transformed_values, the new get_dct_transform implements a low-pass filter to handle the smoothing. For those following the larger discussion, note that there is nothing unusual or strange about using low-pass filters for smoothing data. Indeed, the well known moving average is an example of a low-pass filter and a simple Google search will turn up countless articles about smoothing data with FFT and DCT. The trick with any smoother is determining how much smoothing you want to do. With a moving average, the witchcraft comes in setting the size of the moving widow to determine how much noise to remove. With the get_dct_transform it is about setting the number of low frequency components to retain. In any such smoothing you have to accept/assume that the important (desired) information is contained in the lower frequency variation and not in the higher frequency noise.

To help users visualize how two very common filters smooth data in comparison to the get_dct_transform, I added a function called “simple_plot.” With simple_plot it is easy to see how the three different smoothing methods represent the data. Figure 2 shows the output of calling simple_plot for Madame Bovary using the function’s default values. The top panel shows three lines produced by: 1) a Loess smoother, 2) a rolling mean, and 3) the get_dct_transform. (Note that with a rolling mean, you lose data at both beginning and the end of the series.) The bottom image shows a flatter DCT line (i.e. produced by retaining fewer low frequency components and, therefore, having less noise). The bottom image also uses the reverse transform process as a way to normalize the x-axis to 100 units. (In the new release, there is now another function, rescale_x_2, that can be used as an alternative way to normalize both the x and y axis.)

simple_plot_bovary

Figure 2

The other change in the latest release is the addition of a custom sentiment dictionary compiled with help from the students in my lab. I have documented the creation and testing of that dictionary in two previous blog posts: “That Sentimental Feeling” (12/20/2015) and “More Syuzhet Validation” (August 11, 2016). In these posts, human coded sentiment data is compared to machine derived data in eleven well known novels. We still have more work to do in terms of tweaking and validating the dictionary, but so far it is performing as well as the other dictionaries and in some case better.

Also worth mention here is yet another smoothing method suggested by Jianbo Gao, who has developed an innovative adaptive approach to smoothing time series data. Jianbo and I met at the Institute for Applied Mathematics last summer and, with John Laudun and Timothy Tangherlini, we wrote a paper titled “A Multiscale Theory for the Dynamical Evolution of Sentiment in Novels” that was delivered at the Conference on Behavioral, Economic, and Socio-Cultural Computing last November. I have not found the time to implement this adaptive smoother into the Syuzhet package, but it is on the todo list.

Also on the todo list for a future release is adding the ability to work with languages other than English. Thanks to Denis Roussel, GitHub contributor “denrou”, this work is now progressing nicely.

Over the past few years, a number of people have contributed to this work, either directly or indirectly. I think we are making good progress, and I want to acknowledge the following people in particular: Aaron Dominguez, Andrew Piper, Annie Swafford, Ben Schmidt, Bradley Riddle, Chris Stubben, David Bamman, Denis Roussel, Drue Marr, Ellie Wilke, Faith Aberle, Felix Peckitt, Gabi Kirilloff, Jianbo Gao, Julius Fredrick, Lincoln Mullen, Marti Hearst, Michael Hoffman, Natalie Mackley, Nissanka Wickremasinghe, Oliver Keyes, Peter Organisciak, Roz Thalken, Sarah Cohen, Scott Enderle, Tasha Saathoff, Ted Underwood, Timothy Schaffert, Tommy McGuire, and Walter Jacob. (If I forgot you, I’m sorry, please let me know).

Dec 20 2015

Academic, Text-Mining

That Sentimental Feeling

Eight months ago I began a series of blog posts about my experiments using sentiment analysis as a proxy for plot movement. At the time, I had done a fair bit of anecdotal analysis of how well the sentiments detected by a machine matched my own sense of the sentiments in a series of familiar novels. In addition to the anecdotal spot-checking, I had also hand-coded every sentence of James Joyce’s novel Portrait of the Artist as a Young Man and then compared the various machine methods to my own human coded values. The similarities (seen in figure 1) were striking.

Figure 1

Soon after my first post about this work, David Bamann hired five Mechanical Turks to code the sentiment in each scene of Shakespeare’s Romeo and Juliet. David posted his results online and then Ted Underwood compared the trajectory produced by David’s turks to the machine values produced by the Syuzhet R package I had developed. Even though David’s Turks had coded scenes and Syuzhet had coded sentences, the human and machine trajectories that resulted were very similar. Figure 2 shows the two graphs, first from David’s blog and then from Ted’s.

Figure 2

Before releasing the package, I was fairly confident that the machine was doing a good job of approximating what human beings would think. I hoped that others, like David and Ted, would provide further validation. Many folks posted results online and many more emailed me saying the tool was producing trajectories that matched their sense of the novels they applied it to, but no one conducted anything beyond anecdotal spot checking. After returning to UNL in late August (I had been on leave for a year), I hired four students to code the sentiment of every sentence in six contemporary novels: All the Light We Cannot See by Anthony Doerr, The Da Vinci Code by Dan Brown, Gone Girl by Gillian Flynn, The Secret Life of Bees by Sue Monk Kidd, The Lovely Bones by Alice Sebold, and The Notebook by Nicholas Sparks.

These novels were selected to cover several major contemporary genres. They span a period from 2003 to 2014. None are experimental in the way that Portrait of the Artist is, but they do cover a range of styles between what we might call “low-brow” to “high-brow.” Each sentence of each novel was sentiment coded by three human raters. The precise details of this study, including statistics about inter-rater agreement and machine-to-human agreement, are part of a larger analysis I am conducting with Aaron Dominguez.

What follows are six graphs showing moving averages of the human coded sentiment along side moving averages from two of the sentiment detection methods implemented in the Syuzhet R package. The similarity of the shapes derived from the the human and machine data is quite striking.

light code girl bees bones notebook

Apr 28 2015

Academic, Text-Mining

Cumulative Sentiments

This morning Andrew N. Jackson posted an interesting alternative to the smoothing of sentiment trajectories. Instead of smoothing the trajectories with a moving average, lowess, or, dare I say it, low-pass filter, Andrew suggests cumulative summing as a “simple but potentially powerful way of re-plotting” the sentiment data. I spent a little time exploring and thinking about his approach this morning, and I’m posting below a series of “plot plots” from five novels.[1]

I’m not at all sure about how we could/should/would go about interpreting these cumulative sum graphs, but the lack of information loss is certainly appealing. Looking at these graphs requires something of a mind shift away from the way that I/we have been thinking about emotional trajectories in narrative. Making that shift requires reframing plot movement as an aggregation of emotional valence over time, a reframing that seems to be modeling something like the “cumulative effect on the reader” as Andrew writes, or perhaps it’s the cumulative effect on the characters? Whatever the case, it’s a fascinating idea that while not fully in line with Vonnegut’s conception of plot shape does have some resonance with Vonnegut’s notion of relativity. The cumulative shapes seen below in Portrait and Gone Girl are especially intriguing . . . to me.

portrait_sum

bovary_sum

[1] All of these plots use sentiment values extracted with the AFinn method, which is what Andrew implemented in Python. Andrew’s iPython notebook, by the way, is worth a close read; it provides a lot of detail that is not in his blog post, including some deeper thinking around the entire business of modeling narrative in this way.

Apr 01 2015

Academic, Commentary, R-Code, Text-Mining

My Sentiments (Exactly?)

While developing the Syuzhet package–a tool for tracking relative shifts in narrative sentiment–I spent a fair amount of time gut-checking whether the sentiment values returned by the machine methods were a good match for my own sense of the narrative sentiment. Between 70% and 80% of the time, they were what I considered to be good sentence level matches. . . but sentences were not my primary unit of interest.

Rather, I wanted a way to assess whether the story shapes that the tool produced by tracking changes in sentiment were a good approximation of central shifts in the “emotional trajectory” of a narrative. This emotional trajectory was something that Kurt Vonnegut had described in a lecture about the simple shapes of stories. On a chalkboard, Vonnegut graphed stories of good fortune and ill fortune in a demonstration that he calls “an exercise in relativity.” He was not interested in the precise high and lows in a given book, but instead with the highs and lows of the book relative to each other.

Blood Meridian and The Devil Wears Prada are two very different books. The former is way, way more negative. What Vonnegut was interested in understanding was not whether McCarthy’s book was more wholly negative than Weisberger’s, he was interested in understanding the internal dynamics of shifting sentiment: where in a book would we find the lowest low relative to the highest high. Implied in Vonnegut’s lecture was the idea that this tracking of relative high and lows could serve as a proxy for something like “plot structure” or “syuzhet.”

This was an interesting idea, and sentiment analysis offered a possible way forward. Unfortunately, the best work in sentiment analysis has been in very different domains. Could sentiment analysis tools and dictionaries that were designed to assess sentiment in movie reviews also detect subtle shifts in the language of prose fiction? Could these methods handle irony, metaphor, and so forth? Some people, especially if they looked only at the results of a few sentences, might reject the whole idea out of hand. Movie reviews and fiction, hogwash! Instead of rejecting the idea, I sat down and human coded the sentiment of every sentence in Joyce’s Portrait of the Artist. I then developed Syuzhet so that I could apply and compare four different sentiment detection techniques to my own human codings.

This human coding business is nuanced. Some sentences are tricky. But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment. Consider this contrived example:

“I hated the way he looked at me that morning, and I was glad that he had become my friend.”

Is that a positive or negative sentence? Given the coordinating “and” perhaps the second half is more important than the first part? I coded sentences such as this as neutral, and thankfully these were the outliers and not the norm. Most of the time–even in a complex novel like Portrait where the style and complexity of the sentences are both evolving with the maturation of the protagonist–it was fairly easy to make a determination of positive, negative, or neutral.

It turns out that when you do this sort of close reading you learn a lot about the way that authors write/express/manipulate “sentiment.” One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky. In fact, in many random passages that I examined from other books, and in the entirety of Portrait, tricky sentences were usually followed or preceded by other simple sentences that would clarify the sentiment of the larger passage. This is an important observation because at the level of an individual sentence, we know that the various machine methods are not super effective.[1] That said, I was pretty surprised by the amount of sentence level agreement in my ad hoc test. On a sentence by sentence basis, here is how the four methods in the package performed:[2]

Bing 84% agreement
Afinn 80% agreement
Stanford 60% agreement
NRC 50% agreement

These results surprised me. I was shocked that the more awesome Stanford method did not outperform the others. I was so shocked, in fact, that I figured I must have done something wrong. The Stanford sentiment tagger, for example, thinks that the following sentence from Joyces Portrait is negative.

“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo.”

It was a “very good time.” How could that be negative? I think “a very good time” is positive and so do the other methods. The Stanford tagger also indicated that the sentence “He sang that song” is slightly negative. All of the other methods scored it as neutral, and so did I.

I’m a huge fan of the Stanford tagger; I’ve been impressed by the way that it handles negation, but perhaps when all is said and done it is simply not well-suited to literary prose where the syntactical constructions can be far more complicated than typical utilitarian prose? I need more time to study how the Stanford tagger behaved on this problem, so I’m just going to exclude it from the rest of this report. My hypothesis, however, is that it is far more sensitive to register/genre than the dictionary based methods.

So, as I was saying, what happens with sentiment in actual prose fiction is usually achieved over a series of sentences. That simile, that bit of irony, that negated sentence is typically followed and/or preceded by a series of more direct sentences expressing the sentiment of the passage. For example,

“She was not ugly. She was exceedingly beautiful.”
“I watched him with disgust. He ate like a pig.”

Prose, at least the prose that I studied in this experiment, is rarely composed of sustained irony, sustained negation, sustained metaphor, etc. Usually authors provide us with lots of clues about the sentiment we are meant to experience, and over the course of several sentences, a paragraph, or a page, the sentiment tends to become less ambiguous.

So instead of just testing the machine methods against my human sentiments on a sentence by sentence basis, I split Joyce’s portrait into 20 equally sized chunks, and calculated the mean sentiment of each. I then compared those means to the means of my own human coded sentiments. These were the results:

Bing 80% agreement
Afinn 85% agreement
NRC 90% agreement

Not bad. But of course any time we apply a chunking method like this we risk breaking the text right in the middle of a key passage. And, as we increase the number of chunks and effectively decrease the size of each passage, the values tend to decrease. I ran the same test using 100 segments and saw this:

Bing 73% agreement
Afinn 77% agreement
NRC 58% agreement (ouch)

Figure 1 graphs how the AFinn method (with 77% agreement over 100 segments) tracked the sentiment compared to my human sentiments.

Figure 1

Next I transformed all of the sentiment vectors (machine and human) using the get_transformed_values function. I then calculated the amount of agreement. With the low pass filter set to the default of 3, I observed the following agreement:

Bing 73% agreement
Afinn 74% agreement
NRC 86% agreement

With the low pass filter set to 5, I observed the following agreement:

Bing 87% agreement
Afinn 93% agreement
NRC 90% agreement

Figure 2 graphs how the transformed AFinn method tracked narrative changes in sentiment compared to my human sentiments.[3]

Figure 2

As I have said elsewhere, my primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories. If you do that, and you have successes or failure, I’d be very interested in hearing from you (please send me an email).

Given all of the above, I suppose my current working benchmark for human to machine accuracy is something like ~80%. Frankly, though, I’m more interested in the big picture and whether or not the overall shapes produced by this method map well onto our human sense of a book’s emotional trajectory. They certainly do seem to map well with my sense of Portrait of the Artist, and with many other books in my library, but what about your favorite novel?

FOOTNOTES:
[1] For what it is worth, the same can probably be said about us, the human beings. Given a single sentence with no context, we could probably argue about its positive or negativeness.
[2] Each method uses a slightly different value range, so when I write of “agreement,” I mean only that the machine method agreed with the human (me) that a given sentence was positively or negatively charged. My rating scale consisted of three values: 1, 0, -1 (positive, neutral, negative). I did not test the extent of the positiveness or the degree of negativeness.
[3] I explored low-pass values in increments of 5 all the way to 100. The percentages of agreement were consistently between 70 and 90.

Mar 24 2015

Academic, Commentary, R-Code, Text-Mining

A Ringing Endorsement of Smoothing

On March 7, Annie Swafford posted an interesting critique of the transformation method implemented in Syuzhet. Her basic argument is that setting the low-pass filter too low may result in misleading ringing artifacts.[1] This post takes up the issue of ringing artifacts more directly and explains how Annie’s clever method of neutralizing values actually demonstrates just how effective the Syuzhet tool is in doing what it was designed to do! But lest we begin chasing any red herring, let me be very clear about the objectives of the software.

The tool is meant to reveal the simple (and latent) shape of stories, not the complex shape of stories, not the perfect shape of stories, not the absolute shape of stories, just the simple foundational shapes.[2] This was the challenge that Vonnegut put forth when he said “There is no reason why the simple shape of stories cannot be fed into computers.”
The tool uses sentiment, as detected by four possible methods, as a proxy for “plot.” This is in keeping with Vonnegut’s conception of “plot” as a movement between what he called “good fortune” and “ill fortune.” The gamble Syuzhet makes is that the sentiment detection methods are both “good enough” and also may serve as a satisfying proxy for the “good” and “ill” fortune Vonnegut describes in his essay and lecture.
Despite some complex mathematics, there is an interpretive dimension to this work. I suppose this is why folks call it “digital humanities” instead of physics. Syuzhet was designed to estimate and smooth the emotional highs and lows of a narrative; it was not designed to provide a perfect mapping of emotional valence. I don’t think such perfect mapping is computationally possible; if you want/need that kind of experience, then you need to read the book (some of ’em are even worth it). I’m interested in detecting/revealing the simple shape of stories by approximating the fundamental highs and lows of emotional valence. I believe that this is what Vonnegut had in mind.
Finally, when examining the shapes produced by graphing the Syuzhet values, we must remember what Vonnegut said: “This is an exercise in relativity, really. It is the shape of the curves that matters and not their origins.” When Vonnegut speaks of the shapes, he speaks of them as “simple” shapes.

In her critique of the software, Annie expresses concern over the potential for ringing artifacts when using a Fourier transformation and a very low, low-pass filter. She introduces an innovative method for detecting this possible ringing. To demonstrate the efficacy of her method, she “neutralizes” one third of the sentiment values in Joyce’s Portrait of the Artist as a Young Man and then retransforms and graphs the new neutralized shape against the original foundation shape of the story.

Annie posits that if the Syuzhet tool is working as she thinks it should, then the last third of the foundational shape should change in reaction to this neutralization. In Annie’s example, however, no significant change is observed, and she concludes that this must be due to a ringing artifact. Figure 1 (below) is the evidence she presents on her blog.

: Figure 1: last third neutralized

For what it is worth, we do see some minor differences between the blue and the orange lines, but really, these look like the same “Man in Hole” plot shapes. Ouch, this does look like a bad ringing artifact. But could there be another explanation?

There may, indeed, be some ringing here, but it’s not nearly so extreme as Figure 1 suggests. An alternative conclusion is that the similarity we observe in the two lines is due to a similarity between the actual values and the neutralized values. As it happens, the last third of the novel is already pretty neutral compared to the rest of the novel. In fact, the mean valence for the entire last third of the novel is -0.05. So all we have really achieved in this test is to replace a section of relatively neutral valence with another segment of totally neutral valence.

This is not, therefore, a very good book in which to test for the presence of a ringing artifacts using this particular method of neutralization. What we see here is a case of the right result but the wrong conclusion. Which is not to say that there is not some ringing present; I’ll get to that in a moment. But first another experiment.

If, instead of resetting those values to zero, we set them to 3 (making Portrait end on a very happy note indeed), we get a much different shape (blue line in figure 3). The earlier parts of the novel are now represented as comparatively less negative and the end of the novel is mucho positive.

Figure 3: Portrait with artificial positive ending

And, naturally, we can also set those values very negative and produce the graph seen in figure 4. Oh, poor Stephen.

Figure 4: Portrait with artificial negative ending

“But wait, Jockers, you can’t deny that there is still an artificial “hump” there at the end of figure 3 and an artificial trough at the end of figure 4.” Nope, I won’t deny it, there really can be ringing artifacts. Let’s see if we can find some that actually matter . . .

First let’s test the beginning of the novel using the Swafford method. We neutralize the beginning third of the novel and graph it against the original shape (figure 5). Hmm, again it seems that the foundation shape is pretty close to the original. Is this a ringing artifact?

Figure 5: first third neutralized

Could be, but in this case it is probably just another false ringer. Guess what, the beginning of Joyce’s novel is also comparatively neutral. This is why the Swafford method results in something similar when we neutralize the first third of the book. Do note that the first third is a little bit less neutral than the last third. This is why we see a slightly larger difference between the blue and orange lines in figure 5 compared to figure 1.

But what about the middle section?

If we set the middle third of the novel to neutral, what we get is a very different shape (and a very different novel)! Figure 6 uses the Swafford method to remove the central crisis of the novel. This is no longer a “man in hole” story, and the resulting shape is precisely what we would expect. Make no mistake, that hump of happiness is not a ringing artifact. That hump in the middle is now the most sustained non-negative moment in the book. We have replaced hell with limbo (not heaven because these are neutral values), and in comparison to the other parts of the book, limbo looks pretty good! Keep in mind Vonnegut’s message from #4 above: “This is an exercise in relativity.” Also keep in mind that there is some scaling going on over the y-axis; in other words, we should not get too hung up on the precise position on the y-axis at the expense of seeing the simple shape.

In the new graph, the deepest trough has now shifted to the early part of the novel, which is now the location of the greatest negative valence in the story (it’s the section where Stephen gets sick and is then beaten by father Dolan). The end of the book now looks relatively darker since we no longer have the depths of hell from the midsection for comparison, but the end third of Portrait is definitely not as negative as the beginning third and this is reflected nicely in figure 6. (This more positive ending is also evident, by the way, in the original shape–orange line–where the hump in the last third is slightly higher than the early hump.)

Figure 6: Portrait with Swaffordized Middle

So, the Swafford method proves to be a very useful tool for testing and confirming our expectations. If we remove the most negative section of the novel, then we should see the nadir of the simple shape shift to the next most negative section. That is precisely what we see. I have tested this over a series of other novels, and the effect is the same (see figure 9 below, for example). This is a great method for validating that the tool is working as expected. Thanks Annie Swafford!

“But wait a second Jockers, what about those rascally ringing artifacts you promised.”

Yes, yes, there can indeed be ringing artifacts. Let’s go get some. . . .

Annie follows her previous analysis with what seems like an even more extreme example. She neutralizes everything in Joyce’s Portrait except for the middle 20 sentences of the novel.[3] When the resulting graph looks a lot like the original man-in-hole graph, she says, in essence: “Busted! there is your ringing artifact Dr. J!” Figure 7 is the graphic from her blog.

Figure 7: Only 20 (sic) sentences of Portrait

Busted indeed! Those positive valence humps, peaking at 25 and 75 on the x-axis are dead ringers for ringers. We know from constructing the experiment in this manner, that everything from 0 to ~49 and everything from ~51 to 100 on the x-axis is perfectly neutral, and yet the tool, the visualization, is revealing two positive humps before and after the middle section: horrible, happy, phantom humps that do not exist in the story!

But wait. . .

With all smoothing methods some degree of imprecision is to be expected. Remember what Vonnegut points out: this is “an exercise in relativity.” Relatively speaking, even the extreme example in figure 7 is, in my opinion, not too bad. Just imagine a hypothetical protagonist cruising along in a hypothetical novel such as the one Annie has written with her neutral values. This protagonist is feeling pretty good about all that neutrality; she ain’t feeling great, but she’s better than bad. Then she hits that negative section . . . as Vonnegut would say, “oh, God damn it.”[4] But then things get better, or more precisely, things get comparatively better. So, the blue line is not a great representation of the narrative, but it’s not a bad approximation either.

But look, I understand my colleague’s concern for more precision, and I don’t want it to appear that I’m taking this ringing business too lightly. Figure 8 (below) was produced using precisely the same data that Annie used in her two-sentence example; everything is neutralized except for those two sentences from the exact middle of the novel. This time, however, I have used a low pass filter set at 100. Voila! The new shape (blue) is nothing at all like the original (orange), and the new shape also provides the deep level of detail–and lack of ringing–that some users may desire.[5] Unfortunately, using such a high, low-pass filter does not usually produce easily interpretable graphs such as seen in figure 8.

Figure 8: Original shape with neutralized “Swafford Shape” using 100 components

In this very simple example, turning the low-pass filter up to 100 produces a graph that is very easy to read/interpret. When we begin looking at real novels, however, a low-pass of 100 does not result in shapes that are very easy to visually interpret, and it becomes necessary to smooth them. I think that is what visualization is all about, which is to say, simplifying the complex so that we can get the gist. One way to simplify these emotional trajectories is to use a low, low pass filter. Given that going low may cause more ringing, you need to decide just how low you can go. Another option, that I demonstrated in my previous post, is to use a high value for the low pass filter (to avoid potential ringing) and then apply a lowess smoother (or your own favorite smoother) in order to reveal the “simple shape” (see figure 1 of http://www.matthewjockers.net/2015/03/09/is-that-your-syuzhet-ringing/).

In a future post, I’ll explore something I mentioned to Annie in our email exchange (prior to her public critique): an ad hoc method I’ve been working on that seeks to identify an “ideal” number of components for the low pass filter.

Figure 9: Dorian Gray behaving exactly as we would expect with last third neutralized

FOOTNOTES:

[1] Annie does not actually explain that the low-pass filter is a user controlled parameter or that what she is actually testing is the efficacy of the default value. Users of the tool are welcome to experiment with different values for the low pass filter as I have done here: Is that your Syuzhet Ringing.

[2] I’ve been calling these simple shapes “emotional trajectories” and “plot.” Plot is potentially controversial here, so if folks would like to argue that point, I’m sympathetic. For the first year of this research, I never used the word “plot,” choosing instead “emotional trajectory” or “simple shape,” which is Vonnegut’s term. I realize plot is a loaded and nuanced word, but “emotional trajectory” and “simple shape” are just not really part of our nomenclature, so plot is my default term.

[3] ~~There is a small discrepancy between Annie’s blog and her code~~. Correction: Annie writes about and includes a graph showing the middle “20” sentences, but then provides code for retaining both the middle 2 and the middle 20 sentences. Either way the point is the same.

[4] The two negative valence sentences from the middle of Portrait are as follows: “Nay, things which are good in themselves become evil in hell. Company, elsewhere a source of comfort to the afflicted, will be there a continual torment: knowledge, so much longed for as the chief good of the intellect, will there be hated worse than ignorance: light, so much coveted by all creatures from the lord of creation down to the humblest plant in the forest, will be loathed intensely.”

[5] Annie has written that “Syuzhet computes foundation shapes by discarding all but the lowest terms of the Fourier transform.” That is a rather misleading comment. The low-pass-filter is set to 3 by default, but it is a user tunable parameter. I explained my reasons for choosing 3 as the default in my email exchange with Annie prior to her critique. It is unclear to me why Annie does not to mention my explanation, so here it is from our email exchange:

“. . . The short and perhaps unsatisfying answer is that I selected 3 based on a good deal of trial and error and several attempts to employ some standard filters that seek to identify a cutoff / threshold by examining the frequencies (ideal, butterworth, and several others that I don’t remember any more). The trouble with these, and why I selected 3 as the default, is that once you go higher than 3 the resulting plots get rather more complicated, and the goal, of course, is to do the opposite, which is to say that I seek to reduce the plot to a simple base form (along the lines of what Vonnegut is suggesting). Three isn’t magic, but it does seem to work well at rooting out the foundational shape of the story. Does it miss some of the subtitles, yep, but again, that is the point, in part. The longer answer is that is that this is something I’m still experimenting with. I have one idea that I’m working with now…”

Mar 09 2015

Academic, Commentary, R-Code, Text-Mining

Is that Your Syuzhet Ringing?

Over the weekend, Annie Swafford published another installment in her ongoing critique of Syuzhet, the R package that I released in early February. In her recent blog post, an interesting approach for testing the get_transformed_values function is proposed[1].

Previously Annie had noted how using the default values for the low-pass filter may result in too much information loss, to which I replied that that is the point. (Readers hung up on this point are advised to go back and watch the Vonnegut video again.) With any kind of smoothing, there is going to be information loss. The function is designed to allow the user to tune the low pass filter for greater or lesser degrees of noise (an important point that I shall return to in a moment).

In the new post, Annie explores the efficacy of leaving the low pass filter at its default value of 3; she demonstrates how this value appears to produce a ringing artifact. This is something that the two of us had discussed at some length in an email correspondence prior to this blogging frenzy. In that correspondence, I promised to explore adding a gaussian filter to the package, a filter she believes would be more appropriate. Based on her advice, I have explored that option, and will do so further, but for now I remain unconvinced that there is a problem for Gauss to solve.[2]

As I said in my previous post, I believe the true test of the method lies in assessing whether or not the shapes produced by the transformation are a good approximation of the shape of the story. But remember too, that the primary point of the transformation function is to solve the problem of length; it is hard to compare the plot shape of a long novel to a short one. The low-pass argument is essentially a visualization and noise reduction parameter. Users who want a closer, scene by scene or sentence by sentence representation of the sentiment data, will likely gravitate to the get_percentage_values function (and a very large number of bins) as, for example, Lincoln Mullen has done on Rpubs.[3]

The downside to that approach, of course, is that you cannot compare two sentiment arcs mathematically; you can only do so by eye. You cannot compare them mathematically because the amount of text inside each percentage segment will be quite different if the novels are of different lengths, and that would not be a fair comparison. The transformation function is my attempt at solving this time domain conundrum. While I believe that it solves the problem well, I’m certainly open to other options. If we decide that the transformation function is no good, that it produces too much ringing, etc. then we should look for a more attractive alternative. Until an alternative is found and demonstrated, I’m not going to allow the perfect to become the enemy of the good.

But, alas, here we are once again on the question of defining what is “good” and what is “good enough.” So let us turn now to that question and this matter of ringing artifacts.

The problem of ringing artifacts is well understood in the signal processing literature if a bit less so in the narratological literature:-) Annie has done a fine job of explicating the nature of this problem, and I can’t help thinking that this is a very clever idea of hers. In fact, I wrote to Annie acknowledging this and noting how I wish I had thought of it myself.

But after repeating her experiment a number of times, with greater and lesser degrees of success, I decided that this exercise is ultimately a bit of a red herring. Among other things, there are no books with zero neutral values for an entire third, but more importantly the exercise has more to do with the setting of a particular user parameter than it does with the package.

I’d like to now offer a bit of cake and eat it too. This most recent criticism has focused on the default values for the low-pass filter that I set for the function. There is, of course, nothing preventing adjustment of that parameter by those with a taste for adventure. The higher the number, the greater the number of components that are retained; the more components we retain, the less ringing and the closer we get to reproducing the original signal.

So let us assume for a moment that the sentiment detection methods all work perfectly. We know as a matter of fact that they don’t work perfectly (you know, like human beings), but this matter of imprecision is something we have already covered in a previous post where I showed that the three dictionary based methods tend to agree with each other and with the more sophisticated Stanford method. So even though we know we are not getting every sentence’s sentiment just right, let’s pretend that we are, if only for a moment.

With that assumed, let us now recall the primary rationale for the Fourier transformation: to normalize the length of the x-axis. As it happens, we can do that normalization (the cake) and also retain a great many more components than the 3 default components (eating it). Figure 1 shows Joyce’s Portrait of the Artist transformed using a low pass filter size of 100.

This produces a graph with a lot more noise, but we have effectively eliminated any objectionable ringing. With the addition of a smoothing line (lowess function in R), what we see once again (ta da) is a beautiful, if rather less dramatic, example of Vonnegut’s Man in Hole! And this is precisely the goal, to reveal the plot shape latent in the noise. The smaller low-pass filter accentuates this effect, the higher low-pass filter provides more information: both show the same essential shape.

Figure 1: Portrait with low pass at 100

Figure 2: Portrait with low pass at 3

Figure 3: Portrait with low pass at 20

In the course of this research, I have hand examined the transformed shapes for several dozen novels. The number of novels I have examined corresponds to the number that I feel I know well enough to assess (and also happen to possess in digital form). These include such old and new favorites as:

Portrait of the Artist
Picture of Dorian Grey
Ulysses
Blood Meridian
Gone Girl
Finnegans Wake (nah, just kidding)
. . .
And many more.

As I noted in my previous post, the only way to determine the efficacy of this model is to see if it approximates reality. We have now plotted Portrait of the Artist six ways to Sunday, and every time we have seen a version of the same man in hole shape. I’ve read this book 20 times, I have taught this book a dozen times. It is a man in hole plot.

In my (admittedly) anecdotal evaluations, I have continued to see convincing graphs, such as the one above (and the one below in figure 4). I have found a few special books that don’t do very well, but that is a story you will have to wait for (spoiler alert, they are not works of satire or dark humor, but they are multi-plot novels involving parallel stories).

Still, I am open to the possibility that there is some confirmation bias possible here. And this is why I wanted to release the package in the first place. I had hoped that putting the code on gitHub would entice others toward innovation within the code, but the unexpected criticism has certainly been healthy too, and this conversation has certainly made me think of ways that the functions could be improved.

In retrospect, it may have been better to wait until the full paper was complete before distributing the code. Most of the things we have covered in the last few weeks on this blog are things that get discussed in finer detail in the paper. Despite more details to come, I believe, as Dryden might say, that the last (plot) line is now sufficiently explicated.

Bonus Images:

Figure 4

In terms of basic shape, Figure 4 is remarkably similar to the more dramatized version seen in figure 5 below. If you can’t see it, you aren’t reading enough Vonnegut.

Figure 5

[1] How’s that for some awkward passive voice? A few on Twitter have expressed some thoughts on my use of Annie’s first name in my earlier response. Regular readers of this blog will know that I am consistent in referring to people by their full names upon first mention and by their first names thereafter. Previous victims of my “house style” have included David Mimno, David; Dana Mackenzie, Dana; Ben Schmidt, Ben; Franco Moretti, Franco, and Julia Flanders, Julia. There are probably others.

[2] Anyone losing sleep over this gaussian filter business is welcome to grab the code and give it a whirl.

[3] In the essay I am writing about this work, I address a number of the nuances that I have skipped over in these blog posts. One of the nuances I discuss is an automated process for the selection of a low-pass filter size.

Mar 04 2015

Academic, Commentary, R-Code, Text-Mining

Some thoughts on Annie’s thoughts . . . about Syuzhet

Annie Swafford has raised a couple of interesting points about how the syuzhet package works to estimate the emotional trajectory in a novel, a trajectory which I have suggested serves as a handy proxy for plot (in the spirit of Kurt Vonnegut).

Annie expresses some concern about the level of precision the tool provides and suggest that dictionary based methods (such as the three I include as options in syuzhet) are not reliable. She writes “Sentiment analysis based solely on word-by-word lexicon lookups is really not state-of-the-art at all.” That’s fair, I suppose, but those three lexicons are benchmarks of some importance, and they deserve to be included in the package if for no other reason than for comparison. Frankly, I don’t think any of the current sentiment detection methods are especially reliable. The Stanford tagger has a reputation for being the main contender for the title of “best in the open source market,” but even it hovers around 80 – 83% accuracy. My own tests have shown that performance depends a good deal on genre/register.

But Annie seems especially concerned about the three dictionary methods in the package. She writes “sentiment analysis as it is implemented in the syuzhet package does not correctly identify the sentiment of sentences.” Given that sentiment is a subtle and nuanced thing, I’m not sure that “correct” is the right word here. I’m not convinced there is a “correct” answer when it comes to this question of valence. I do agree, however, that some answers are more or less correct than others and that to be useful we need to be on the closer side. The question to address, then, is whether we are close enough, and that’s a hard one. We would probably find a good deal of human agreement when it comes to the extremes of sentiment, but there are a lot of tricky cases, grey areas where I’m not sure we would all agree. We certainly cannot expect the tool to perform better than a person, so we need some flexibility in our definition of “correct.”

Take, for example, the sentence “I studied at Leland Stanford Junior University.” The state-of-the-art Stanford sentiment parser scores this sentence as “negative.” I think that is incorrect (you are welcome to disagree;-). The “bing” method, that I have implemented as the default in syuzhet, scores this sentence as neutral, as does the “afinn” method (also in the package). The NRC method scores it as slightly positive. So, which one is correct? We could go all Derrida on this sentence and deconstruct each word, unpack what “junior” really means. We could probably even “problematize” it! . . . But let’s not.

What Annie writes about dictionary based methods not being the most state-of-the-art is true from a technical standpoint but sophisticated methods and complexity do not necessarily correlate with results. Annie suggest that “getting the Stanford package to work consistently would go a long way towards addressing some of these issues,” but as we saw with the sentence above, simple beat sophisticated, hands down[1].

Consider another sentence: “Syuzhet is not beautiful.” All four methods score this sentence as positive, even the Stanford tool, which tends to do a better job with negation, says “positive.”

It is easy to find opposite cases where sophisticated wins the day. Consider this more complex sentence: “He was not the sort of man that one would describe as especially handsome.” Both NRC and Afinn score this sentence as neutral, Bing scores it slightly positive and Stanford scores it slightly negative. When it comes to negation, the Stanford tool tends to perform a bit better, but not always. The very similar sentence “She was not the sort of woman that one would describe as beautiful” is scored slightly positive by all four methods.

What I have found in my testing is that these four methods usually agree with each other, not exactly but close enough. Because the Stanford parser is very computationally expensive and requires special installation, I focused the examples in the Syuzhet Package Vignette on the three dictionary based methods. All three are lightning fast by comparison, and all three have the benefit of simplicity.

But, are they good enough compared to the more sophisticated Stanford parser?

Below are two graphics showing how the methods stack up over a longer piece of text. The first image shows sentiment using percentage based segmentation as implemented in the get_percentage_values() function.

Four Methods Compared using Percentage Segmentation

The three dictionary methods appear to be a bit closer, but all four methods do create the same basic shape. The next image shows the same data after normalization using the get_transformed_values() function. Here the similarity is even more pronounced.

Four Methods Compared Using Transformed Values

While we could legitimately argue about the accuracy of one sentence here or one sentence there, as Annie has done, that is not the point. The point is to reveal a latent emotional trajectory that represents the general sense of the novel’s plot. In this example, all four methods make it pretty clear what that shape is: This is what Vonnegut called “Man in Hole.”

The sentence level precision that Annie wants is probably not possible, at least not right now. While I am sympathetic to the position, I would argue that for this particular use case, it really does not matter. The tool simply has to be good enough, not perfect. If the overall shape mirrors our sense of the novel’s plot, then the tool is working, and this is the area where I think there is still a lot of validation work to do. Part of the impetus for releasing the package was to allow other people to experiment and report results. I’ve looked at a lot of graphs, but there is a limit to the number of books that I know well enough to be able to make an objective comparison between the Syuzhet graph and my close reading of the book.

This is another place where Annie raises some red flags. Annie calls attention to these two images (below) from my earlier post and complains that the transformed graph is not a good representation of the noisy raw data. She writes:

The full trajectory opens with a largely flat stretch and a strong negative spike around x=1100 that then rises back to be neutral by about x=1500. The foundation shape, on the other hand, opens with a rise, and in fact peaks in positivity right around where the original signal peaks in negativity. In other words, the foundation shape for the first part of the book is not merely inaccurate, but in fact exactly opposite the actual shape of the original graph.

Annie’s reading of the graphs, though, is inconsistent with the overall plot of the novel, whereas the transformed plot is perfectly consistent with the novel. What Annie calls a “strong negative spike” is the scene in which Stephen is pandied by Father Arnell. It is an important negative moment, to be sure, but not nearly as important, or as negative, as the major dip that occurs midway through the novel–when Stephen experiences Hell. The scene with Arnell is a minor blip compared to the pages and pages of hell and the pages and pages of anguish Stephen experiences before his confession.

Annie is absolutely correct in noting that there is information loss, but wrong in arguing that the graph fails to represent the novel. The tool has done what it was designed to do: it successfully reveals the overall shape of the narrative. The first third of the novel and the last third of the novel are considerably more positive than the middle section. But this is not meant to say or imply that the beginning and end are without negative moments.

It is perfectly reasonable to want to see more of the page to page, or scene by scene fluctuations in sentiment, and that can be easily achieved by using the percentage segmentation method or by altering the low-pass filter size. Changing the filter size to retain five components instead of three results in the graph below. This new graph captures that “strong negative spike” (not so “strong” compared to hell) and reveals more of the novel’s ups and downs. This graph also provides more detail about the end of the novel where Stephen comes down off his bird-girl high and moves toward a more sober perspective for his future.

Portrait with Five Components

Of course, the other reason for releasing the code is so that I can get suggestions for improvements. Annie (and a few others) have already propelled me to tweak several functions. Annie found (and reported on her blog) some legitimate flaws in the openNLP sentence parser. When it comes to passages with dialog, the openNLP parser falls down on the job. I ran a few dialog tests (including Annie’s example) and was able to fix the great majority of the sentence parsing errors by simply stripping out the quotation marks in advance. Based on Annie’s feedback, I’ve added a “quote stripping” parameter to the get_sentences() function. It’s all freshly baked and updated on github.

But finally, I want to comment on Annie’s suggestion that

some texts use irony and dark humor for more extended periods than you [that’s me] suggest in that footnote—an assumption that can be tested by comparing human-annotated texts with the Syuzhet package.

I think that would be a great test, and I hope that Annie will consider working with me, or in parallel, to test it. If anyone has any human annotated novels, please send them my/our way!

Things like irony, metaphor, and dark humor are the monsters under the bed that keep me up at night. Still, I would not have released this code without doing a little bit of testing:-). These monsters can indeed wreak a bit of havoc, but usually they are all shadow and no teeth. Take the colloquial expression “That’s some bad R code, man.” This sentence is supposed to mean the opposite, as in “That is a fine bit of R coding, sir.” This is a sentence the tool is not likely to get right; but, then again, this sentence also messes up my young daughter, and it tends to confuse English language learners. I have yet to find any sustained examples of this sort of construction in typical prose fiction, and I have made a fairly careful study of the emotional outliers in my corpus.

Satire, extended satire in particular, is probably a more serious monster. Still, I would argue that the sentiment tools performs exactly as expected; they just don’t understand what they are “reading” in the way that we do. Then again, and this is no fabrication, I have had some (as in too many) college students over the years who haven’t understood what they were reading and thought that Swift was being serious about eating succulent little babies in his Modest Proposal (those kooky Irish)!

So, some human beings interpret the sentiment in Modest Proposal exactly as the sentiment parser does, which is to say, literally! (Check out the special bonus material at the bottom of this post for a graph of Modest Proposal.) I’d love to have a tool that could detect satire, irony, dark humor and the like, but such a tool is still a good ways off. In the meantime, we can take comfort in incremental progress.

Special thanks to Annie Swafford for prompting a stimulating discussion. Here is all the code necessary to repeat the experiments discussed above. . .

library(syuzhet)
path_to_a_text_file <- system.file("extdata", "portrait.txt",
package = "syuzhet")
joyces_portrait <- get_text_as_string(path_to_a_text_file)
poa_v <- get_sentences(joyces_portrait)

# Get the four sentiment vectors
stanford_sent <- get_sentiment(poa_v, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(poa_v, method="bing")
afinn_sent <- get_sentiment(poa_v, method="afinn")
nrc_sent <- get_sentiment(poa_v, method="nrc")

######################################################
# Plot them using percentage segmentation
######################################################
plot(
  scale(get_percentage_values(stanford_sent, 10)), 
  type = "l", 
  main = "Joyce's Portrait Using All Four Methods\n and Percentage Based Segmentation", 
  xlab = "Narrative Time", 
  ylab = "Emotional Valence",
  ylim = c(-3, 3)
)
lines(
  scale(get_percentage_values(bing_sent, 10)),
  col = "red", 
  lwd = 2
)
lines(
  scale(get_percentage_values(afinn_sent, 10)),
  col = "blue", 
  lwd = 2
)
lines(
  scale(get_percentage_values(nrc_sent, 10)),
  col = "green", 
  lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)

######################################################
# Transform the Sentiments
######################################################
stan_trans <- get_transformed_values(
  stanford_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)
bing_trans <- get_transformed_values(
  bing_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)
afinn_trans <- get_transformed_values(
  afinn_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)

nrc_trans <- get_transformed_values(
  nrc_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)

######################################################
# Plot them all
######################################################
plot(
  stan_trans, 
  type = "l", 
  main = "Joyce's Portrait Using All Four Methods", 
  xlab = "Narrative Time", 
  ylab = "Emotional Valence",
  ylim = c(-2, 2)
)

lines(
  bing_trans,
  col = "red", 
  lwd = 2
)
lines(
  afinn_trans,
  col = "blue", 
  lwd = 2
)
lines(
  nrc_trans,
  col = "green", 
  lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)


######################################################
# Sentence Parsing Annie's Example
######################################################
annies_sentences_w_quotes <- '"Mrs. Rachael, I needn’t inform you who were acquainted with the late Miss Barbary’s affairs, that her means die with her and that this young lady, now her aunt is dead–" "My aunt, sir!" "It is really of no use carrying on a deception when no object is to be gained by it," said Mr. Kenge smoothly, "Aunt in fact, though not in law."'

# Strip out the quotation marks
annies_sentences_no_quotes <- gsub("\"", "", annies_sentences)

# With quotes, Not Very Good:
s_v <- get_sentences(annies_sentences_w_quotes)
s_v

# Without quotes, Better:
s_v_nq <- get_sentences(annies_sentences_no_quotes)
s_v_nq

######################################################
# Some Sentence Comparisons
######################################################
# Test one
test <- "He was not the sort of man that one would describe as especially handsome."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

# test 2
test <- "She was not the sort of woman that one would describe as beautiful."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

# test 3
test <- "That's some bad R code, man."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

SPECIAL BONUS MATERIAL

Swift’s classic satire presents some sentiment challenges. There is disagreement between the Stanford method and the other three in segment four where the sentiments move in opposite directions.

FOOTNOTE

[1] By the way, I’m not sure if Annie was suggesting that the Stanford parser was not working because she could not get it to work (the NAs) or because there was something wrong in the syuzhet package code. The code, as written, works just fine on the two machines I have available for testing. I’d appreciate hearing from others who are having problems; my implementation definitely qualifies as a first class hack.

Feb 25 2015

Academic, Commentary, Text-Mining

The Rest of the Story

My blog on February 2, about the Syuzhet package I developed for R (now available on CRAN), generated some nice press that I was not expecting: Motherboard, then The Paris Review, and several R blogs (Revolutions, R-Bloggers, inside-R) all featured the work. The press was nice, but I was not at all prepared for the focus to be placed on the one piece of the story that I had yet to explain, namely, how I used the Syuzhet code and some unsupervised machine clustering to identify what seem to be six, or possibly seven, archetypal plot shapes. So, here now is the rest of the story. . .

In brief: (A Plot Modeling Recipe)

Apply functions available in the Syuzhet package, to generate a generalized a plot shape for every book in a corpus of 41,383 novels.[1]
Employ euclidean distance to build a large distance matrix by computing the similarity between every pair of novels.
Use unsupervised hierarchical clustering to group books based on the similarity of their plot shape.
Examine the resulting clusters with furrowed brow and say “hmmmm.”
Test several methods of cluster identification (silhouette, gap statistic, elbow).
Develop ad-hoc cluster identification algorithm.
Observe that there are six, or maybe seven, fundamental plot shapes.
Repeat everything over and over again for 12 months while worrying a lot about observing six or seven plots.

Caveats:

Before I reveal the six/seven plots (scroll down if you can’t wait), it’s important to point out that what I offer here is the result of two particular methods of analysis. If you don’t like the plot shapes that these methods reveal, then you’ll be free to take issue with the methods and try a different approach. You could, for example,

Read 41,383 novels and sketch the plots of each using Vonnegut’s chalkboard. You could then spend a few decades organizing and classifying them into some sort of taxonomy. You could then work on clustering them into a finite set of foundational shapes. This is more or less the method Vonnegut employed, excepting, of course, that he probably only read a few hundred stories and probably only sketched out a few dozen on his chalk board.
You could use another method, such as the one that Benjamin Schmidt has proposed over at his Sapping Attention blog.

Background:

In my previous post, I explained how I developed some software (named “Suyzhet” in homage to Propp) to extract plot shapes from novels based on sentiment analysis. In order to understand how I derive the six/seven plot archetypes, we need to understand a little bit about Euclidean distance and hierarchical clustering. The former provides a mathematical way of computing the similarity or distance between two points in space. When that space is two dimensional, it’s pretty easy to visualize what is going on: we plot two points on an x-y grid and then measure the distance between them. When the space is three dimensional, it gets a bit harder, but you can still imagine measuring the distance between some point about three feet off the floor in your kitchen and some point about five feet off the floor in your living room. Once we go beyond the third dimension things get downright tricky, and we have to rely on the mathematics of the Euclidean metric. Regardless of the dimensions, though, the fundamental idea is the same: we are measuring the distance between points and the shorter that distance the more similar the points are. In this case the points are books, and the feature that determines their point in space is their “plot shape” as derived from Syuzhet.

Once the distances between all the points are measured, we construct a “distance matrix.” This distance matrix is just a big spread sheet where we can look up the distance from any one point to any other point. It might look something like Figure 1. According to this matrix, the distance between Book 1 and Book 3 is “0.5” whereas the distance between Book 2 and Book 3 is “0.25.”

Figure 1: A Distance Matrix

Hierarchical clustering methods use this distance matrix as a foundation upon which to build a hierarchy of similarities. This hierarchy is often visualized as a dendrogram such as seen in Figure 2.

Figure 2: Dendrogram

Figure 2 is a bit like a tree (upside down); it has branches. At any vertical point, we can cut this tree and the result would be to separate it into two or more branches, or clusters. For example, cutting the tree in Figure 2 at a height of 225, would result in four primary clusters. The trick with this sort of tree cutting, is identifying an “ideal” vertical position to insert the saw. Before I get to that, though, we need to step back for a moment to those plots created with the Syuzhet software.

The Plot Thickens

In my previous post, I showed what the plots of Joyce’s Portrait and Wilde’s Dorian Grey look like when graphed using Suyzhet. Underneath each plot graph, is a sequence of 100 numbers from which the shape of the plot is derived. I have collected these sequences for 41,383 novels, and when I average them, I get the “super average plot archetype” seen in Figure 3.

Figure 3: The Super Average Plot

That is kind of interesting, but things get a lot more interesting after a bit of tree cutting. If you look at the dendrogram in Figure 2 again, you see that cutting the tree just below 250 will result in two primary clusters. After cutting the tree at that point, it is then possible to calculate a mean shape for all the books in each cluster. The result is seen in Figure 4.

Figure 4: Two Primary Plots

In homage to Vonnegut, I have titled the shape on the left “man in hole.” 46% of the books in this corpus fall into this cluster. The remaining 54% are more similar to the plot on the right, which I have named “man on hill.” At this point, I’d encourage you to take a quick peek Maya Eilam’s very nice visualization of Vonnegut’s archetypal plot shapes. The plots I’ll show here are not going to look quite the same, but there will be some resonance.

Looking again at the dendrogram, you can see that the two primary clusters (MOH and MIH), can be split fairly easily into a set of four clusters. When the tree is cut in this manner, the two plots shown in Figure 4, split into four.

Figure 5: MIH Types I and II

Figure 5 shows the derivatives of the man in hole plot shape. The man in hole plot splits into one shape (“Type I”) that looks a lot like classical tragedy and another (“Type II”) that looks more like comedy. Whatever the case, one has a much happier ending than the other. Figure 6 shows the derivatives of the man on hill.

Figure 6: Man on Hill Types I and II

Here again, one plot leads us to a happy ending and the other to a rather dark conclusion.

Cutting the tree beyond these four shapes gets trickier. It is difficult to know where precisely to stop and cut. Move the cut point just a little bit, and we could go from having 10 clusters to 20; it is possible, in fact, to keep moving the the cut point further and further down the tree until a point at which every book is its own cluster! Doing that, however, would be rather silly (see “Caveats” item 1 above). So the objective is to find an “ideal” place to cut the tree such that the resulting clusters have the greatest amount of internal homogeneity while simultaneously being as different from each other as possible.

My solution to this problem involves iterating through a series of possible cut points and then taking two measures after each cutting. The first is a measure of cluster homogeneity the second is a measure of cluster dissimilarity. This process is more easily described in pseudocode:

Let K be a number of possible clusters from 2 to 50.

for(K in 2:50){
- cut the tree such that there are K clusters
- calculate the amount of in-cluster homogeneity
- calculate the dissimilarity between the K clusters
}

With each iteration, I store the resulting values so that I can compare them and identify a value of K that best fulfills the objectives described above. In order to make this test more robust, I opted to randomly select a subset of one half of the books in the corpus (roughly 20K) and run this test over and over again (each time with a new random sample). When I did this, I found that the method identified six as the ideal number of clusters about 90% of the time. The other 10% of the time, it said that seven or eight was a better choice.[2]

In addition to this mathematical approach, I also employed good old subjective evaluation. The tool suggested six or seven, but this number (six, seven) would be rather useless if the resulting shapes did not make any sense to those of us who actually read the books. So, I looked at a lot of plots; everything from two to twenty. After twenty, I figure there is not much point because the shapes get so similar to each other that it would be rather hard to make the case that plot 19 is really all that different from plot 20. With six and with seven, however, there remains good deal of variation.

We saw above how MIH and MOH both split into sub types. These I labeled as MIH Type I, MIH Type II, MOH Type I, and MOH Type II. At the cut point that results in six plots, MIH Type I and MOH Type II stay as we saw them above in figures 5 and 6, but MIH II and MOH I both split resulting in the shapes seen in Figure 7.

Figure 7: Level Six

Already we can begin to see some shape repetition. The variant of MIH seen in the lower right, is ultimately a steeper, or more extreme, version of the basic MIH. The other three, though, appear rather more distinct.

At level seven, MOH II splits in two resulting in the shapes shown in Figure 8. After seven, we begin to see a lot more shape repetition, and though each of these shapes is unique in terms of its precise placement on the y axis, i.e. some are more happy others more dark, the arcs are generally similar.

Obviously, there is a great deal more interpretive work to be done here. Many of these shapes, I think, can be further classified according to their “affects” and “effects.” What, for example, is the overall impression one gets from a book that takes a character to great heights (MOH) and then plunges him/her into a pit of despair from which there is no exit (as is seen in Figure 8 left).

Figure 8: Seven Plots

But perhaps even more interesting than any of this is the possibility for movement between scales. Scale hopping is something I advocate in Macroanalysis. The great power of big(ish) data is that it allows us to contextualize our small reading. Joyce’s Portrait of the Artist (Figure 9) is a type of MIH. What other books are MIHs? Are they popular books? Are they classics? Best sellers? Can we find another telling of the same story? This is the work that I am doing now, moving from the large to the small and back again. Figures 10-15 (below) present six popular/well-known novels and their corresponding plot types for consideration.

[Update March 2: Annie Swafford offers an interesting critique of this work on her blog. Her post includes some comments from me in response.]

Figure 9: Joyce’s Portrait

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Footnotes:

[1] The Suyzhet package performs a certain type of text analysis, and I’m claiming that the results of this analysis may serve as a pretty darn good proxy for plot. That said, I’ve been working on this problem for two years, and I know some specific places where it fails. The most spectacular example of failure was discovered by my son. He’d just finished reading one of the books in my corpus, and I showed him the plot shape from the book and asked him it it made sense. He said, “well, yes, mostly. But this spike here is all wrong.” It was a spike in good fortune, positive valence, at precisely the place in the novel where the villains had scored a major victory. The positive valence was associated with a several page long section in which the bad guys were having a very good time. Readers, of course, would see this as a negative moment in the text, Suyzhet does not. Nor does Suyzhet understand irony and dark humor and so on. On a whole, however, Suyzhet gets it right, and that’s because most books are not sustained satire, or sustained irony. Most books end up using emotional markers in a fairly consistent and conventional way. Indeed, even for an experimental novel such as Joyce’s Ulysses, Suyzhet produces a plot shape that I consider to be a good match to the ebbs and flows of the text.

[2] In a longer, less blog friendly version of this research that is to appear in a collection of essays on digital literary studies, I explain the mathematics in precise detail.

Jun 12 2014

Academic, Commentary, Text-Mining

Reading Macroanalysis: The Hard Way!

This past November, Judge Denny Chin ruled to dismiss the Authors Guild’s case against Google; the Guild vowed they would appeal the decision and two months ago their appeal was submitted. I’ll leave it to my legal colleagues to discuss the merit (or lack) in the Guild’s various arguments, but one thing I found curious was the Guild’s assertion that 78% of every book is available, for free, to visitors to the Google Books pages.

According to the Guild’s appeal:

Since 2005, Google has displayed verbatim text from copyrighted books on these pages. . . Google generally divides each page image into eighths, which it calls “snippets.”. . . Once a user retrieves a book through her initial search, she can enter any other search terms she chooses, and the author’s verbatim words will be displayed in three snippets for each search. Although Google has stated that any given search by a user “only” displays three snippets of each book, a single user can view far more than three snippets from a Library Project book by performing multiple searches using different terms, including terms suggested by Google. . . Even minor variations in search terms will yield different displays of text. . . Google displays snippets from each book, except that it withholds display of 10% of the pages in each book and of one snippet per page. . .Thus, Google makes the vast majority of the text of these books—in all, 78% of each work—available for display to its users.

I decided to test the Guild’s assertion, and what better book to use than my own: Macroanalysis: Digital Methods and Literary History.

In the “Preview,” Google displays the front matter (table of contents, acknowledgements, etc) followed by the first 16 pages of my text. I consider this tempting pabulum for would be readers and within the bounds of fair use, not to mention free advertising for me. The last sentence in the displayed preview is cut off; it ends as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now. . . ” Thus ends page 16 and thus ends Google’s preview.

According to the author’s guild, however, a visitor to this book page can access much more of the book by using a clever method of keyword searching. What the Guild does not tell us, however, is just how impractical and ridiculous such searching is. But that is my conclusion and I’m getting ahead of myself here. . .

To test the guild’s assertion, I decided to read my book for free via Google books. I began by reading the material just described above, the front matter and the first 16 pages (very exciting stuff, BTW). At the end of this last sentence, it is pretty easy to figure out what the next word would be; surely any reader of English could guess that the next word, after “. . .scaling of digital content that is now. . . ” would be the word “available.”

Just to be sure, though, I double-checked that I was guessing correctly by consulting the print copy of the book. Crap! The next word was not “available.” The full sentence reads as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now held in twenty-first-century digital libraries.”

Now why is this mistake of mine important to note? Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book. When I entered the word “available” into the search field, I was hoping to get a snippet of text from the next page, a snippet that would allow me to read the rest of the sentence. But because I guessed wrong, I in fact got non-contiguous snippets from pages 77, 174, 72, 9, 56, 15, 37, 162, 8, 4, 80, 120, 154, 46, 133, 79, 27, 97, 147, and 17, in that order. These are all the pages in the book where I use the word “available” but none include the rest of the sentence I want to read. Ugh.

Fortunately, I have a copy of the full text on my desk. So I turn to page 17 and read the sentence. Aha! I now conduct a search for the word “held.” This search results in eight snippets; the last of these, as it happens, is the snippet I want from page 17. This new snippet contains the next 42 words. The snippet is in fact just the end of the incomplete sentence from page 16 followed by another incomplete sentence ending with the words: “but we have not yet fully articulated or explored the ways in which. . . ”

So here I have to admit that I’m the author of this book, and I have no idea what follows. I go back to my hard copy to find that the sentence ends as follows: “. . . these massive corpora offer new avenues for research and new ways of thinking about our literary subject.”

Without the full text by my side, I’d be hard pressed to come up with the right search terms to get the next snippet. Luckily I have the original text, so I enter the word “massive” hoping to get the next contiguous snippet. Six snippets are revealed, the last of these includes the sentence I was hoping to find and read. After the word “which,” I am rewarded with “these massive corpora offer new avenues for” and then the snippet ends! Crap, I really want to read this book for free!

So I think to myself, “what if instead of trying to guess a keyword from the next sentence, I just use a keyword from the last part of the snippet. “avenues” seems like a good candidate, so I plug it in. Crap! The same snippet is show again. Looks like I’m going to have to keep guessing. . .

Let’s see, “new avenues for. . . ” perhaps new avenues for “research”? (Ok, I’m cheating again by going back to the hard copy on my desk, but I think a savvy user determined to read this book for free might guess the word “research”). I plug it in. . . 38 snippets are returned! I scroll though them and find the one from page 17. The key snippet now includes the end of the sentence: “research and new ways of thinking about our literary subject.”

Now I’m making progress. Unfortunately, I have no idea what comes next. Not only is this the end of a sentence, but it looks like it might be the end of a paragraph. How to read the next sentence? I try the word “subject” and Google simply returns the same snippet again (along with assorted others from elsewhere in the book). So I cheat again and look at my copy of the book. I enter the word “extent” which appears in the next sentence. My cheating is rewarded and I get most of the next sentence: “To some extent, our thus-far limited use of digital content is a result of a disciplinary habit of thinking small: the traditionally minded scholar recognizes value in digital texts because they are individually searchable, but this same scholar, as a. . . ”

Thank goodness I have tenure and nothing better to do!

The next word is surely the word “result,” which I now dutifully enter into the search field. Among the 32 snippets that the search returns, I find my target snippet. I am rewarded with a copy of the exact same snippet I just saw with no additional words. Crap! I’m going to have to be even more cleaver if I’m going to game this system.

Back to my copy of the book I turn. The sentence continues “as a result of a traditional training,” so I enter the word “traditional,” and I’m rewarded with . . . the same damn passage again! I have already seen it twice, now thrice. My search for the term “traditional” returns a hit for “traditionally” in the passage I have already seen and, importantly, no hit for the instance of “traditional” that I know (from reading the copy of the book on my desk) appears in the next line. How about “training,” I wonder. Nothing! Clearly Google is on to me now. I get results for other instances of the word “training” but not for the one that I know appears in the continuation of the sentence I have already seen.

Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.

I get it. Times are tough and some folks simply need to steal books from snippet view because they can’t afford to buy them. I’m sympathetic to these folks; they need to satisfy their intense passion for reading and knowledge and who could blame them? Then again, if we consider the opportunity cost at $7.25 per hour (the current minimum wage), then stealing this book from snippet view would cost a savvy text pirate $2,218.75 in lost wages. The eBook version of my text, linked to from the Google Books web page, sells for $14.95. Hmmm?

Jun 05 2014

Academic, Commentary, Text-Mining

A Novel Method for Detecting Plot

While studying anthropology at the University of Chicago, Kurt Vonnegut proposed writing a master’s thesis on the shape of narratives. He argued that “the fundamental idea is that stories have shapes which can be drawn on graph paper, and that the shape of a given society’s stories is at least as interesting as the shape of its pots or spearheads.” The idea was rejected.

In 2011, Open Culture featured a video in which Vonnegut expanded on this idea and suggested that computers might someday be able to model the shape of stories, that is, the movement of the narratives, the plots. The video is about four minutes long; it’s worth watching.

About the same time that I discovered this video, I was working on a project in which I was applying the tools and techniques of sentiment analysis to works of fiction.^[1] Initially I was interested in tracing the evolution of emotional content in novels over the course of the 19th century. By accident I discovered that the sentiment I was detecting and measuring in the fiction could be used as a highly accurate proxy for plot movement.

Joyce’s Portrait of the Artist as a Young Man is a story that I know fairly well. Once upon a time a moo cow came down along the road. . .and so on . . .

Here is the shape of Portrait of the Artist as a Young Man that my computer drew based on an analysis of the sentiment markers in the text:

If you are familiar with the plot, you’ll readily see that the computer’s version of the story is accurate. As it happens, I was teaching Portrait last fall, so I projected this image onto the white board and asked my students to annotate it. Here are a few of the high (and low) points that we identified.

Because the x-axis represents the progress of the narrative as a percentage, it is easy to move from the graph to the actual pages in the text, regardless of the edition one happens to be using. That’s precisely what we did in the class. We matched our human reading of the book with the points on the graph on a page-by-page basis.

Here is a graph from another Irish novel that you might know; this is Wilde’s Picture of Dorian Gray.

If you remember the story, you’ll see how well this plot line models the movement of the story. Discovering the accuracy of these graphs was quite thrilling.

This next image shows Dan Brown’s blockbuster novel The Da Vinci Code. Notice how much more regular the fluctuations are. This is the profile of a page turner. Notice too how the more generalized blue trend line hovers above neutral in terms of its emotional valence. Dan Brown never lets the plot become too troubled or too much of a downer. He baits us and teases us with fluctuating emotion.

Now compare Da Vinci Code to one of my favorite contemporary novels, Cormac McCarthy’s Blood Meridian. Blood Meridian is a dark book and the more generalized blue trend line lingers in the realms of negative emotion throughout the text; it is a very different book from The Da Vinci Code.^[2]

I won’t get into the precise details of how I am measuring emotional valence in these books here.^[3] It’s a bit too complicated for an already too long blog post. I will note, however, that the process involves two major components: a controlled vocabulary of positive and negative sentiment markers collected by Bing Liu of the University of Illinois at Chicago and a machine model that I trained to identify and score passages as positive or negative.

In a follow-up post, I’ll describe how I normalized the plot shapes in 40,000 novels in order to compare the shapes and discover what appear to be six archetypal plots!

NOTES:
[1] In the field natural language processing there is an area of research known as sentiment analysis or, sometimes, opinion mining. And when our colleagues engage in this kind of work, they very often focus their study on a highly stylized genre of non-fiction: the review, specifically movie reviews and product reviews. The idea behind this work is to develop computational methods for detecting what we, literary folk, might call mood, or tone, or sentiment, or perhaps even refer to as affect. The psychologists prefer the word valence, and valence seems most appropriate to this research of mine because the psychologists also like to measure degrees of positive and negative valence. I am not aware of anyone working in sentiment analysis who is specifically interested in modeling emotional valence in fiction. In fact, the great majority of work in this field is so far removed from what we care about in literary studies that I spent about six months simply wondering whether or not the methods developed by folks trying to gauge opinions in movie reviews could even be usefully employed in studies of literature.
[2] I gained access to some of these novels through a data transfer agreement made between the University of Nebraska and a private company that is no longer in business. See Unfolding the Novel.
[3] I’m working on a longer and more formal version of this research report for publication. The longer version will include all the details of the methodology. Stay Tuned:-)

Apr 21 2014

Academic, R-Code, Text-Mining, Tips and Code

Text Analysis with R . . . coming soon.

My new book, Text Analysis with R for Students of Literature is due from Springer sometime in May. I got the cover proofs this week (below). Looking good:-)

Apr 06 2014

Academic, R-Code, Text-Mining

Simple Point of View Detection

[Note 4/6/14 @ 2:24 CST: oops, had a small error in the code and corrected it: the second if statement should have been “< 1.5" which made me think of a still simpler way to code the function as edited.] [Note 4/6/14 @ 2:52 CST: After getting some feedback from Jonathan Goodwin about Ford's The Good Soldier, I added a slightly modified version of the function to the bottom of this post. The new function makes it easier to experiment by dialing in/out a threshold value for determining what the function labels as “first” vs. “third.”]

In my Macroanalysis class this term, my students are studying character and characterization. As you might expect, the manner in which we/they analyze character varies depending upon whether the story is being told in the first or third person. Since we are working with a corpus of about 3500 books, it was not practical (at least in the time span of a single class) to hand code each book for its point of view (POV). So instead of opening each text and skimming it to determine its POV, I whipped up a super simple “POV detector” function in R.*

Below you will find the function and then three examples showing how to call the function using links to the Project Gutenberg versions of Feodor Dostoevsky’s first person novel, Notes from the Underground, Stoker’s epistolary Dracula, and Joyce’s third person narrative Portrait of the Artist as a Young Man.**

We have not done anything close to a careful analysis of the results of these predictions, but we have checked a semi-random selection of 30 novels from the full corpus of 3500. At least for these 30, the predictions are spot on. If you take this function for a test drive and discover texts that don’t get correctly assigned, please let me know. This is a very “low-hanging fruit” approach and the key variable (called “first.third.ratio”) can easily be tuned. Of course, we might also consider a more sophisticated machine classification approach, but until we learn otherwise, this functions seems to be doing quite well. Please test it out and let us know what you discover…

predict.pov.from.plain<-function(path.to.file){
  first.person<-c("i", "me", "my", "myself", "we", "us")
  third.person<-c("he", "she", "his", "hers", "him", "her", "himself", "herself")
  text.lines <- scan(path.to.file, what="character", sep="\n")
  text.v <- paste(text.lines, collapse=" ")
  text.lower.v <- tolower(text.v)
  text.words.l <- strsplit(text.lower.v, "\\W")
  text.word.v <- unlist(text.words.l)
  not.blanks.v  <-  which(text.word.v!="")
  text.word.v <-  text.word.v[not.blanks.v]
  text.freqs.t <- table(text.word.v)/length(text.word.v)
  first.sum<-sum(text.freqs.t[first.person], na.rm = TRUE)
  third.sum<-sum(text.freqs.t[third.person], na.rm = TRUE)
  first.third.ratio<-third.sum/first.sum
  if(first.third.ratio >= 1.5){
    pov<-"third"
  } else {
    pov<-"first"
  }
  return(pov)
}

# Now Some Examples: 
# Notes from Underground, First Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/600/pg600.txt") 

# Dracula Epistolary, First Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/345/pg345.txt")

# Portrait of the Artist as a young man, Third Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/4217/pg4217.txt")

Here is a slightly revised version of the function that allows you to set a different “threshold” when calling the function. Johnathan Goodwin reported on Twitter that Ford’s The Good Soldier was being reported as “third” person (which is wrong). Using this new version of the function, you can dial up or down the threshold until you find the a sweet spot for a particular text (such as The Good Soldier) and then use that threshold to test on other texts.

predict.pov.from.plain.w.thresh<-function(path.to.file, threshold=1.5){
  first.person<-c("i", "me", "my", "myself", "we", "us")
  third.person<-c("he", "she", "his", "hers", "him", "her", "himself", "herself")
  text.lines <- scan(path.to.file, what="character", sep="\n")
  text.v <- paste(text.lines, collapse=" ")
  text.lower.v <- tolower(text.v)
  text.words.l <- strsplit(text.lower.v, "\\W")
  text.word.v <- unlist(text.words.l)
  not.blanks.v  <-  which(text.word.v!="")
  text.word.v <-  text.word.v[not.blanks.v]
  text.freqs.t <- table(text.word.v)/length(text.word.v)
  first.sum<-sum(text.freqs.t[first.person], na.rm = TRUE)
  third.sum<-sum(text.freqs.t[third.person], na.rm = TRUE)
  first.third.ratio<-third.sum/first.sum 
  if(first.third.ratio >= threshold){
    pov<-"third"
  } else {
    pov<-"first"
  } 
  return(pov)
}
# The Good Soldier is first Person POV but if we set the threshold too low, it gets assigned third person. . . This version of the function allows you to tune the threshold for experimenting with different texts.
# using a threshold of 2.1, this book is assigned a proper POV
predict.pov.from.plain.w.thresh("http://www.gutenberg.org/cache/epub/4217/pg4217.txt", threshold=2.1)

*[Actually, I whipped up two functions: the one seen here and another one that takes Part of Speech (POS) tagged text as input. Both versions seem to work equally well but this one that takes plain text as input is easier to implement.]

**[Note that none of the Gutenberg boilerplate text is removed in this process. In the implementation I am using with my students, all metadata has already been removed.]

Feb 25 2014

1 Comment

Academic, R-Code, Text-Mining, Tips and Code

Experimenting with “gender” package in R

Yesterday afternoon, Lincoln Mullen and Cameron Blevins released a new R package that is designed to guess (infer) the gender of a name. In my class on literary characterization at the macroscale, students are working on a project that involves a computational study of character genders. . . needless to say, the ‘gender‘ package couldn’t have come at a better time. I’ve only begun to experiment with the package this morning, but so far it looks very promising.

It doesn’t do everything that we need, but it’s a great addition to our workflow. I’ve copied below some R code that uses the gender package in combination with some named entity recognition in order to try and extract character names and genders in a small sample of prose from Twain’s Tom Sawyer. I tried a few other text samples and discovered some significant challenges (e.g. Mrs. Dashwood), but these have more to do with last names and the difficult problems of accurate NER than anything to do with the gender package.

Anyhow, I’ve just begun to experiment, so no big conclusions here, just some starter code to get folks thinking. Hopefully others will take this idea and run with it!

library(gender)
library(openNLP)
require(NLP)

sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()

s <- as.String("Tom did play hookey, and he had a very good time. He got back home barely in season to help Jim, the small colored boy, saw next-day's wood and split the kindlings before supper—at least he was there in time to tell his adventures to Jim while Jim did three-fourths of the work. Tom's younger brother (or rather half-brother) Sid was already through with his part of the work (picking up chips), for he was a quiet boy, and had no adventurous, trouble-some ways. While Tom was eating his supper, and stealing sugar as opportunity offered, Aunt Polly asked him questions that were full of guile, and very deep—for she wanted to trap him into damaging revealments. Like many other simple-hearted souls, it was her pet vanity to believe she was endowed with a talent for dark and mysterious diplomacy, and she loved to contemplate her most transparent devices as marvels of low cunning.")

a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
entity_annotator <- Maxent_Entity_Annotator()
named.ents<-s[entity_annotator(s, a2)]
named.ents.l <- strsplit(named.ents, "\\W")
named.ents.v <- unlist(named.ents.l)
not.blanks.v  <-  which(named.ents.v!="")
named.ents.v <-  named.ents.v[not.blanks.v]
gender(tolower(named.ents.v))

And here is the output:

   name gender proportion_male proportion_female
1   tom   male          0.9971            0.0029
2   jim   male          0.9968            0.0032
3   tom   male          0.9971            0.0029
4   tom   male          0.9971            0.0029
5  aunt                 NA                NA
6 polly female          0.0000            1.0000

Sep 03 2013

Academic, Text-Mining, Tips and Code

Text Analysis with R for Students of Literature

[Update (9/3/13 8:15 CST): Contributors list now active at the main Text Analysis with R for Students of Literature Resource Page]

Below this post you will find a link where you can download a draft of Text Analysis with R for Students of Literature. The book is under review with Springer as part of a new series titled “Quantitative Methods in the Humanities and Social Sciences.”

Springer agreed to let me post the draft manuscript here (thank you, Springer), and my hope is that you will download the manuscript, take it for a test drive, and then send me your thoughts. I’m especially interested to hear about areas of confusion, places where you get lost, or where you feel your students might get lost. I’m also interested in hearing about the errors (hopefully not too many), and, naturally, I’ll be delighted to hear about anything you like.

I’m open to suggestions for new sections, but before you suggest that I include another chapter on “your favorite topic,” please read the Preface where I lay out the scope of the book. It’s a beginner’s book, and helping “literary folk” get started with R is my primary goal. This is not the place to get into debates or details about hyper parameter optimization or the relative merits of p-values.*

Please also read the Acknowledgements. It is there that I hint at the spirit and intent behind the book and behind this call for feedback. I did not learn R without help, and there is still a lot about R that I have to learn. I want to acknowledge both of these facts directly and specifically. Those who offer feedback will be added to a list of contributors to be included in the print and online editions of the final text. Feedback of a substantial nature will be acknowledged directly and specifically.

Book is now in production and draft has been removed.~~That’s it. Download Text Analysis with R for Students of Literature (1.3MB .pdf)~~

* Besides, that ground has been well-covered by Scott Weingart

Apr 12 2013

Academic, Text-Mining, Tips and Code

“Secret” Recipe for Topic Modeling Themes

The recently (yesterday) published issue of JDH is all about topic modeling. It’s a great issue, and it got me thinking about some of the lessons I have learned over seven or eight years of modeling literary corpora. One of the important things I have learned is that the quality of the final model (which is to say the coherence and usefulness of the topics) is largely dependent upon preprocessing. I know, I know: “that’s not much fun.”

Fun or no, it is the reality, and there’s no getting around it. One of the first things you discover when you begin modeling literary materials is that books have a lot of characters. And here I don’t mean the letters “A, B, C,” but actual literary characters as in “Ahab, Beowulf, and Copperfield.” These characters can cause no end of headaches in topic modeling. Let me explain. . .

As I write this blog post, I am running a smallish topic modeling job over a corpus of 50 novels that I have selected for use in a topic modeling workshop I am teaching next week in Milwaukee. Without any preprocessing I get topics that look like these two:

There is nothing wrong with these topics except that one is obviously a “Moby Dick” topic and the other a “Dracula” topic. A big part of the reason these topics formed in this way is because of the power of the character names (yes, “whale” is a character). The presence of the character names tends to bias the model and make it collect collocates that cluster around character names. Instead of getting a topic having to do with “seafaring” (a theme, by the way, that appears in both Moby Dick and Dracula) we get these broad novel-specific topics instead.

That is not what we want.

To deal with this character “problem,” I begin by expanding the usual topic modeling “stop list” from the 100 or so high frequency, closed class words (such as “the, of, a, and. . .”) to include about 5,600 common names, or “named entities.” I posted this “expanded stoplist” to my blog some months ago as ancillary material for my book; feel free to copy it for your own work. I built my expanded stop list through a combination of named entity recognition and the scraping of baby name web sites:-)

Using the exact same model parameters that produced the two topics above, but now with the expanded stop list, I get topics that are much less about individual novels and much more about themes that cross novels. Here are two examples.

A topic of words relating to Native Americans, but mostly from *Last of the Mohicans*?

The first topic cloud seems pretty good. In the previous run of the model, without the expanded stop list, there was no such topic. The second one; however, is still problematic, largely because my expanded stopwords list, even at 5,631 words, is still imperfect. “Heyward” is a character from Last of the Mohicans whose name is not in my stop list.

But in addition to this imperfection, I would argue that there are other problems as well, at least if our objective is to harvest topics of a thematic nature. Notice, for example, the word “continued” just to the left of “heyward” and then notice “demanded” near the bottom of the cloud. These words do not contribute very much at all to the thematic sense of the topic, so ideally they too should be stopped out.

As a next step in preprocessing, therefore, I employ Part-of-Speech tagging or “POS-Tagging” in order to identify and ultimately “stop out” all of the words that are not nouns! Since I can already hear my friend Ted Underwood screaming about “discourses,” let me justify this suggestion with a small but important caveat: I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation.

POS tagging is well documented, so I’m not going to foreground it here other than to say that it’s an imperfect method. It does make mistakes, but the best taggers (such as the Stanford Tagger that I usually use) have very (+97%) accuracy (see, for example Manning 2011).

After running a POS tagger, I have a simple little script that uses a simple little regular expression to change the following tagged sentences:

The/DT family/NN of/IN Dashwood/NNP had/VBD been/VBN long/RB settled/VBN in/IN Sussex./NNP Their/PRP$ estate/NN was/VBD large,/RB and/CC their/PRP$ residence/NN was/VBD at/IN Norland/NNP Park,/NNP in/IN the/DT centre/NN of/IN their/PRP$ property,/NN where,/, for/IN many/JJ generations,/NNS they/PRP had/VBD lived/VBN in/IN so/RB respectable/JJ a/DT manner,/JJ as/IN to/TO engage/VB the/DT general/JJ good/JJ opinion/NN of/IN their/PRP$ surrounding/VBG acquaintance./NN

into

family estate residence centre property generations opinion acquaintance

Just with this transformation to nouns alone, you can begin to see how a theme of “property” or “family estates” might eventually evolve from these words during the topic modeling process. But there is still one more preprocessing step before we can run the LDA. The next step (which can really be the first step) is text chunking or segmentation.

Topic models like to have lots of texts; or more precisely they like to have lots of bags of words. Topic models such as LDA do not take into account word order, they assume that each text or document is a bag of words. Novels are very big bags, and if we don’t chunk them up into smaller pieces we end up getting topics of a very general nature. By chunking each novel into smaller pieces, we allow the model to discover themes that occur only in specific places within novels and not just across entire novels. Consider the theme of death, for example. While there may be entire novels about death, more than likely death is going to pop up once or twice in every novel. In order for the topic model to detect or find a death topic, however, it needs to encounter bags of words that are largely about death. If the whole novel is a single bag of words, then death might not be prominent enough to rise to the level of “topicdom.”

I have found through lots and lots of experimentation that 500-1000 word chunks are pretty darn good when modeling novels. It might help to think in terms of pages: 500-1000 words is roughly 2-4 pages. The argument for this length goes something like this: a good death scene takes several pages to develop. . . etc.

Exactly what chunk length I choose is a matter of some witchcraft and alchemy; it is similar to the witchcraft and tarot involved in choosing the number of topics. I’ll not unpack either of those here, but you can read more in chapter 8 of my book (plug). Here the point is to simply say that some chunking needs to happen if you are working with big documents.

So here, finally, is my “secret” recipe in pseudo code:

for each novel as novel {
POS tag novel
split tagged novel into 1000 word chunks
for each chunk as chunk {
remove non-nouns from chunk
lowercase everything
remove stop list words from chunk
}
}
run LDA over chunks
analyze data

Of course, there is a lot more to it than this: you need to keep track of which chunks go with which novels and so on. But this is the general recipe.* Here are two topics derived from the same corpus of novels, now without character names and without non-nouns.

* The word “Secret” in my title is in quotes because there is nothing secret about the ingredients in this particular recipe. The idea of combining POS tagging, text chunking, and LDA is well established in various papers, including, for example, “TagLDA: Bringing document structure knowledge into topic models” (2006) and Reading Tea Leaves: How Humans Interpret Topic Models (2009).

Feb 22 2013

1 Comment

Academic, Text-Mining

Pronouns in 19th Century Fiction

Some folks I follow on Twitter (@scott_bot, @benmschmidt, @rayncordell, @foxyfolklorist, and others) were engaged in a conversation this week about the frequency of gendered pronouns in a corpus of 233 fairy tales from @foxyfolklorist’s dissertation. For a bit of literary contextualization, I tweeted a bar graph showing the frequency of 13 pronouns in a corpus of ~3,500 19th century novels. The bar graph (seen again here) breaks down pronoun usage by author gender (M, F, and U).

It is natural to wonder, as David Mimno (@dmimno) did this morning, if there is any significance to the gender results: is gender really correlated to these observed means or are the observed means just an artifact of messy data. One way to explore the extent to which these observed means really are an entailment of gender is to ask what the means would look like if gender were not a factor. In other words, what would happen if all the data about author gender were shuffled and the means then recalculated?*

If we do this shuffling and recalculating a whole bunch of times, say 100 times, we can then plot all the fake “genderless” permutations along side the actual observed means and thereby see whether the observed means are outside or inside what we would expect if gender were not a factor influencing pronoun use.**

Below are the plots for the 13 pronouns from my original bar graph (above). What you’ll see below is that for certain pronouns, such as “him,” “I,” “me,” “my” and “your”, the observed (“real”) means are within the range of “expected” values if gender were not a consideration. For other pronouns, however, such as “he,” “her,” “she” and “we,” the observed values are outside the values in the randomized “fake” data generated by taking gender out of the equation.

Another fascinating element of these graphs is found in the third “U” column. These are authors of unknown gender. It is hard not too look at these observed values and wonder about the most likely genders of those anonymous writers. . .

* [As it happens, this is precisely the approach that David Mimno suggested we take in some other work (under review) in which we assess the significance of topic use (rather than pronoun use) by male and female authors.]

** [Naturally, it could be that the determining factor here is not really gender at all. It could be that “we” (readers, editors, publishers, etc) have selected for books authored by men that express one set of linguistic qualities and books by women that express another set. In other words, these graphs don’t prove that women and men necessarily use pronouns differently, only that they do so (or don’t depending on the pronoun in question) in this particular corpus of 19th century fiction.]

Feb 20 2013

3 Comments

Academic, Text-Mining

Unfolding the Novel

I’m excited to announce a new research project dubbed “Unfolding the Novel” (which is a play on both “paper” and “protein” folding). In collaboration with colleagues from the Stanford Literary Lab and Arizona State University and in partnership with researchers of the Book Genome project of BookLamp.com we have begun work that traces stylistic and thematic change across 300 years of fiction, from 1700-2000! Today UNL posted a news release announcing the partnership and some of our goals.

The primary goal of the project is to map major stylistic and thematic trends over 300 years of creative literature. To facilitate this work, BookLamp is providing access to a large store of metadata pertaining to mostly 20th and 21st century works of fiction. This data will be combined with similar data we have already compiled from the 19th century and new data we are curating now from the 18th century. The research team will not access the actual books but will explore at the macroscale in ways that are similar to what one can do with the data provided to researchers at the Google Ngrams project. A major difference, however, is that the data in the “Unfolding” project is highly curated, limited to fiction in English, and enriched with additional metadata including information about both gender and genre distribution.

Our initial data set consists of token frequency information that has been aggregated across one or more global metadata facets including but not limited to publication year, author gender, and book genre. Such data includes, for example, a table containing the year-to-year mean relative frequencies of the most common words in the corpus (e.g the relative frequencies of the words “the, a, an, of, and” etc).

I’ll be reporting on the project here as things progress, but for now, it’s back to the drudgery of the text mines. . . 😉

Jul 20 2012

Academic, Text-Mining

Computing and Visualizing the 19th-Century Literary Genome

I was unable to attend the DH 2012 meeting in Hamburg, but I recorded my paper as a screen cast, and my ever faithful colleague Glen Worthey kindly delivered it on my behalf. The full presentation can be viewed here as a QuickTime movie.

Picture of my (empty) seat in Hamburg (I assume that is a glass of Riesling). Image Credit: Jan Rybicki.

May 28 2012

1 Comment

Academic, Text-Mining

Macroanalysis

In preparation for the publication of my book (Macroanalysis: Digital Methods and Literary History, UIUC Press, 2013), I’ve begun posting some graphs and other data to my (new) website. To get the ball rolling, I have created an interactive “theme viewer” where visitors will find a drop down menu of the 500 themes I harvested from a corpus of 3,346 19-century British, Irish and American novels using “topic modeling” and a series of pre-processing routines that I detail in the book. Each theme is accompanied by a word cloud showing the relative importance of each term to the topic, and each cloud is followed by four graphs showing the distribution of the topic/theme over time and across author genders and author nationalities. Here is a sample of a theme I have labeled “FACTORY AND WORKHOUSE LABOR.” You can click on the thumbnails below for larger images, but the real fun is over at the theme viewer.

Sep 29 2011

Academic, Text-Mining

The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

For my forthcoming book, which includes a chapter on the uses of topic modeling in literary studies, I wrote the following vignette. It is my imperfect attempt at making the mathematical magic of LDA palatable to the average humanist. Imperfect, but hopefully more fun than plate notation. . .

. . . imagine a quaint town, somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

One afternoon Herman Melville bumps into Jane Austen at the bocce ball court, and they get to talking.

“You know,” says Austen, “I have not written a thing in weeks.”

“Arrrrgh,” Melville replies, “me neither.”

So hand in hand they stroll down Gibbs Lane to the LDA Buffet. Now, down at the LDA Buffet no one gets fat. The buffet only serves light (leit?) motifs, themes, topics, and tropes (seasonal). Melville hands a plate to Austen, grabs another for himself, and they begin walking down the buffet line. Austen is finicky; she spoons a dainty helping of words out of the bucket marked “dancing.” A slightly larger spoonful of words, she takes from the “gossip” bucket and then a good ladle’s worth of “courtship.”

Melville makes a bee line for the “whaling” trough, and after piling on an Ahab-sized handful of whaling words, he takes a smaller spoonful of “seafaring” and then just a smidgen of “cetological jargon.”

The two companions find a table where they sit and begin putting all the words from their plates into sentences, paragraphs, and chapters.

At one point, Austen interrupts this business: “Oh Herman, you must try a bit of this courtship.”

He takes a couple of words but is not really fond of the topic. Then Austen, to her credit, asks permission before reaching across the table and sticking her fork in Melville’s pile of seafaring words, “just a taste,” she says. This work goes on for a little while; they order a few drinks and after a few hours, voila! Moby Dick and Persuasion are written . . .

[Now, dear reader, our story thus far provides an approximation of the first assumption made in LDA. We assume that documents are constructed out of some finite set of available topics. It is in the next part that things become a little complicated, but fear not, for you shall sample themes both grand and beautiful.]

. . . Filled with a sense of deep satisfaction, the two begin walking back to the lodging house. Along the way, they bump into a blurry-eyed Hemingway, who is just then stumbling out of the Rising Sun Saloon.

Having taken on a bit too much cargo, Hemingway stops on the sidewalk in front of the two literati. Holding out a shaky pointer finger, and then feigning an English accent, Hemingway says: “Stand and Deliver!”

To this, Austen replies, “Oh come now, Mr. Hemingway, must we do this every season?”

More gentlemanly then, Hemingway replies, “My dear Jane, isn’t it pretty to think so. Now if you could please be so kind as to tell me what’s in the offing down at the LDA Buffet.”

Austen turns to Melville and the two writers frown at each other. Hemingway was recently banned from the LDA Buffet. Then Austen turns toward Hemingway and holds up six fingers, the sixth in front of her now pursed lips.

“Six topics!” Hemingway says with surprise, “but what are today’s themes?”

“Now wouldn’t you like to know that you old sot.” Says Melville.

The thousand injuries of Melville, Hemingway had borne as best he could, but when Melville ventured upon insult he vowed revenge. Grabbing their recently completed manuscripts, Hemingway turned and ran toward the South. Just before disappearing down an alleyway, he calls back to the dumbfounded writers: “All my life I’ve looked at words as though I were seeing them for the first time. . . tonight I will do so again! . . . ”

[Hemingway has thus overcome the first challenge of topic modeling. He has a corpus and a set number of topics to extract from it. In reality determining the number of topics to extract from a corpus is a bit trickier. If only we could ask the authors, as Hemingway has done here, things would be so much easier.]

. . . Armed with the manuscripts and the knowledge that there were six topics on the buffet, Hemingway goes to work.

After making backup copies of the manuscripts, he then pours all the words from the originals into a giant Italian-leather attache. He shakes the bag vigorously and then begins dividing its contents into six smaller ceramic bowls, one for each topic. When each of the six bowls is full, Hemingway gets a first glimpse of the topics that the authors might have found at the LDA Buffet. Regrettably, these topics are not very good at all; in fact, they are terrible, a jumble of random unrelated words . . .

[And now for the magic that is Gibbs Sampling.]

. . . Hemingway knows that the two manuscripts were written based on some mixture of topics available at the LDA Buffet. So to improve on this random assignment of words to topic bowls, he goes through the copied manuscripts that he kept as back ups. One at a time, he picks a manuscript and pulls out a word. He examines the word in the context of the other words that are distributed throughout each of the six bowls and in the context of the manuscript from which it was taken. The first word he selects is “heaven,” and at this word he pauses, and asks himself two questions:

“How much of ‘Topic A,’ as it is presently represented in bowl A, is present in the current document?”
“Which topic, of all of the topics, has the most ‘heaven’ in it?” . . .

[Here again dear reader, you must take with me a small leap of faith and engage in a bit of further make believe. There are some occult statistics here accessible only to the initiated. Nevertheless, the assumptions of Hemingway and of the topic model are not so far-fetched or hard to understand. A writer goes to his or her imaginary buffet of themes and pulls them out in different proportions. The writer then blends these themes together into a work of art. That we might now be able to discover the original themes by reading the book is not at all amazing. In fact we do it all the time–every time we say that such and such a book is about “whaling” or “courtship.” The manner in which the computer (or dear Hemingway) does this is perhaps less elegant and involves a good degree of mathematical magic. Like all magic tricks, however, the explanation for the surprise at the end is actually quite simple: in this case our magician simply repeats the process 10 billion times! NOTE: The real magician behind this LDA story is David Mimno. I sent David a draft, and along with other constructive feedback, he supplied this beautiful line about computational magic.]

. . . As Hemingway examines each word in its turn, he decides based on the calculated probabilities whether that word would be more appropriately moved into one of the other topic bowls. So, if he were examining the word “whale” at a particular moment, he would assume that all of the words in the six bowls except for “whale” were correctly distributed. He’d now consider the words in each of those bowls and in the original manuscripts, and he would choose to move a certain number of occurrences of “whale” to one bowl or another.

Fortunately, Hemingway has by now bumped into James Joyce who arrives bearing a cup of coffee on which a spoon and napkin lay crossed. Joyce, no stranger to bags-of-words, asks with compassion: “Is this going to be a long night.”

“Yes,” Hemingway said, “yes it will, yes.”

Hemingway must now run through this whole process over and over again many times. Ultimately, his topic bowls reach a steady state where words are no longer needing to be being reassigned to other bowls; the words have found their proper context.

After pausing for a well-deserved smoke, Hemingway dumps out the contents of the first bowl and finds that it contains the following words:

“whale sea men ship whales penfon air side life bounty night oil natives shark seas beard sailors hands harpoon mast top feet arms teeth length voyage eye heart leviathan islanders flask soul ships fishery sailor sharks company. . . “

He peers into another bowl that looks more like this:

“marriage happiness daughter union fortune heart wife consent affection wishes life attachment lover family promise choice proposal hopes duty alliance affections feelings engagement conduct sacrifice passion parents bride misery reason fate letter mind resolution rank suit event object time wealth ceremony opposition age refusal result determination proposals. . .”

After consulting the contents of each bowl, Hemingway immediately knows what topics were on the menu at the LDA Buffet. And, not only this, Hemingway knows exactly what Melville and Austen selected from the Buffet and in what quantities. He discovers that Moby Dick is composed of 40% whaling, 18% seafaring and 2% gossip (from that little taste he got from Jane) and so on . . .

[Thus ends the fable.]

For the rest of the (LDA) story, see David Mimno’s Topic Modeling Bibliography

Sep 18 2011

Academic, Text-Mining

Aberrant Adjectives in 19th Century Novels

I created the visualization below using Many Eyes and a data set derived from part-of-speech tagged novels from 19th century Britain. Found here are the 100 most “aberrant adjectives.” Aberrant here is determined by selecting those words that have the greatest amount of usage deviation (measured by relative frequency) over a 13 decade time period. To qualify a word must also appear in every decade.

Dec 22 2010

1 Comment

Academic, Commentary, Text-Mining

Unigrams, and bigrams, and trigrams, oh my

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).

And here’s another chart comparing the usage of “asdf” to “qwer.”

And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.
And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data
In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).

Nov 02 2010

Academic, Commentary, Text-Mining

SEASR Grant

This month a group of researchers at Stanford, University of Illinois, University of Maryland, and George Mason were awarded a $790,000 grant from the Mellon Foundation to advance the prior work of the SEASR project. I’ll be serving as the overall Project Director and as one of the researchers in the Stanford component of the grant. In this phase of the SEASR project, we will focus on leveraging the existing SEASR infrastructure in support of four “use cases.” But “use case” hardly describes the research intensive nature of the proposed work, nor does it capture the strongly humanistic bias of the work proposed. Each partner has committed to a specific research project and each has the expressed goal of advancing humanities research and publishing their results. I’d like to emphasize this point about advancing humanities research.

This grant represents an important step beyond the tool building, QA and UI testing stages of software development. All too often, it seems, our digital humanities projects devote a great deal of time, money, and labor to infrastructure and prototyping and then all too frequently the results languish in the great sea of hammers without a nail. Sure, a few journeymen carpenters stick these tools in their belts and hammer away, but all too often it seems that more effort goes into building the tools and then the resources sit around gathering dust while humanities research marches on in the time-tested modes with which we are most familiar.

Of course, I don’t mean this to be a criticism of the tool builders or the tools built. The TAPOR project, for example, offers many useful text analysis widgets, and I frequency send my colleagues and students there for quick and dirty text-analysis. And just last month I had occasion to use and cite Stefan Sinclair’s Voyeur application. I was thrilled to have Voyeur at my finger tips; it provided a quick and easy way to do exactly what I wanted.

But often, the analytic tasks involved in our projects are multifaceted and cannot be addressed by any one tool. Instead, these projects involve “flows” in which our “humanistic” data travels though a series of analytic “filters” and comes out on the other end in some altered form. The TAPOR project attempts to be a virtual text analysis “workbench” in which the craftsman can slide a project around the bench from one tool to the next. This model works well for smallish projects but is not robust enough for large scale projects and, despite some significant interface improvements over the years, remains, for me at least, a bit clunky. I find it great for quick tasks with one or two texts, but inefficient for processing multiple texts or multiple processes. Part of the TAPOR mission was to develop a suite of tools that could be used by the average, ordinary humanist: which is to say, the humanist without any real technical chops. It succeeds on that front to be sure.

SEASR offers an alternative approach and what it provides in terms of processing power and computational elegance it gives up in terms of ease of use and transparency. The SEASR “interface” is one that involves constructing modular “workflows” in which each module corresponds to some computational task. These modules are linked together such that one process feeds into the next and the business of “sliding” a project around from one tool to another on the virtual workbench is taken over by the workflow manager.

In this grant we have specifically deemphasized UI development in favor of output, in favor of “results” in the humanities sense of the word. As we write in the proposal, “The main emphasis of the project will be on developing, coordinating, and investigating the research questions posed by the participating humanities scholars.” The scholars in this project include myself and Franco Moretti at Stanford, Dan Cohen at GMU, Tanya Clement at University of Maryland, Ted Underwood and John Unsworth both of UIUC. On the technical end, we have Michael Welge and Loretta Auvil of the Automated Learning Group, of the National Center for Supercomputing Applications.

As the project gets rolling, I will have more to post about the specific research questions we are each addressing and the ongoing results of our work. . .

May 30 2010

Academic, Commentary, Text-Mining

Panning for Memes

Over in the English Department Literature Lab, we have been experimenting with Topic Modeling as a means of discovering latent themes (aka topics) in a corpus of 19th century novels. Topic Modeling is an unsupervised machine learning process that employs Latent Dirichlet allocation. “It posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.”

We’ve been experimenting using the Java-Based MAchine Learning for LanguagE Toolkit (Mallet) from UMASS Amherst and a corpus of British and American novels from the 19th century. In one experiment we ran the topic modeler over just the British corpus, in another over just the American corpus. But when we combined the two collections and ran the model over the whole corpus, we discovered that certain topics showed up in only one or the other corpus. For example, one solely American topic was composed of words related to slavery and words written in southern dialect. And there was a strictly British topic clearly indicative of the royalty and aristocracy: words such as “lord,” “king”, “duke,” “sir”, “lady.” This was an interesting result and not simply because it provides a quantitative way of distinguishing topics or themes that are distinct to one nation or another, but also because the topics themselves could be read and interpreted in context.

More interesting for me, however, were two topics that appeared in both corpora. The first, which appeared more in the British corpus was related to “soldiering.” A second topic, which was more common in the American corpus, has to do with Indian wars. The “soldiering” topic was composed of the following words:

“men,” “general,” “captain,” “colonel,” “army,” “horse,” “sir,” “enemy,” “soldier,” “battle,” “day,” “war,” “officer,” “great,” “country,” “house,” “time,” “head,” “left,” “road,” “british,” “soldiers,” “washington,” “night,” “fire,” “father,” “officers,” “heard,” “moment.”

The Indians topic included:

“indian,” “men,” “indians,” “great,” “time,” “chief,” “river,” “party,” “red,” “white,” “place,” “savages,” “woods,” “day,” “side,” “fire,” “war,” “savage,” “water,” “canoe,” “rifle,” “people,” “warriors,” “returned,” “feet,” “friends,” “tree,” “night,” “distance.”

What was most fascinating, however, was that when the soldiering topic was found in the American corpus it usually had to do with Indians, and when the Indian topic appeared in the British corpus it was almost completely in the context of the Irish! As an Irish-Studies scholar, who wrote a theses on the role of the American West in Irish and Irish-American literature, this was an incredibly rich discovery. The literature of the Irish and the Irish Diaspora is filled with comparisons between the Irish situation vis-à-vis the British and the Native American situation vis-à-vis what one Irish American author described as the “Tide of Empire.”

Reader’s wishing to follow this line of comparison in some more contemporary works might want to have a look at Joyce’s short story “An Encounter,” Flann O’Brien’s book At Swim Two Birds, Paul Muldoon’s Madoc and Patrick McCabe’sThe Butcher Boy.

Mar 19 2010

Academic, Commentary, Text-Mining

Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Social Networking for digital humanities nerds? Which DH bloggers are you most compatible with? Let’s get the right nerds with the right nerds–match making made in digital humanities heaven.

After seeing Stefan Sinclair’s Voyeuristic analysis of the Day of DH Blog posts, I wrote and asked him how to get access to the “corpus” of posts. He hooked me up, and I pre-processed the data with a few php scripts, then ran an LDA topic modeling process and then some more post processing with R in order to see the most important themes of the day and also to cluster the 117 bloggers based on their thematic similarity.

So, here’s the what and then the how. As for the why? Why not?

What:

117 Day of DH Bloggers

10 Unsupervised Topics (10 is arbitrary–I could have picked 100). These topics are generated by an analysis of the words and word sequences in the individual blogger’s sites. The purpose is to harvest out the most prominent “themes” or topics. These themes are presented in series of word lists. it is up to the researcher to then “label” the word clusters. I have labeled a few of them (in [brackets] at the beginning of the word lists below–you might use another label–this is the subjective part). Here they are:

[human interaction in DH] work today people time working things email year week days bit good meeting tomorrow

day thing mail dh de image based fact called things change ago encoding house

[Academic Writing–including Grants] day time dh start post blog proposal google write great posts lunch nice articles

[Digital publishing and archives] http talk future collection making online version publishing field morning life traditional daily large

conference university blog morning read internet access couple computers archive involved including great written

[DH Teaching] students dh teaching humanities class technology scholars university lab group library support scholarship student

[DH Projects] digital project humanities work projects room meeting collections office building task database spent st

data project xml working projects web interesting user set spend system ways couple time

digital day humanities media writing post computing twitter english humanist real phd web rest

[reading and text-analysis] book text tools software books today reading literary texts coffee edition search tool textual

Unfortunately, the Day of DH corpus isn’t truly big enough to get the sort of crystal clear topics that I have harvested from much larger collections, but still, the topics above, seen in aggregate, do give us a sense of what’s “hot” to talk about in the field.

But let’s get to the sexy part. . .

In addition to harvesting out the prominent topics, the modeling tool outputs data indicating how much (what proportion) of each blog is about each topic. The resulting matrix is of dimension 117×10 (117 blogs and 10 topics). The data in the cells are percentages for each topic in each author’s blog. The values in each row add up to 100%. With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes. Using this data, I then output a matrix showing which author’s have the most in common; thus, I do a little subtle match-making in advance of our digital rendezvous in London–birds of a feather blog together?

Here are the groups:

Group1

aimeemorrison
ariefwidodo
barbarabordalejo
caraleitch
carlosmartinez
carlwhithaus
clairewarwick
craigharkema
ellimylonas
geoffreyrockwell
glenworthey
guydaarmstrong
henrietterouedcunliffe
ianjohnson
janrybicki
jenterysayers
jonbath
juliaflanders
juliannenyhan
justinerichardson
kai-christianbruhn
kathleenfitzpatrick
keithlawson
lauramandell
lauraweakly
malterehbein
matthewjockers
meganmeredith-lobay
melissaterras
milenaradzikowska
miranhladnik
patricksahle
paulspence
peterrobinson
pouyllau
rafaelalvarado
raysiemens
reneaudet
rogerosborne
rudymcdaniel
stanruecker
stephanieschlitz
susangreenberg
victoriasmith
vikazafrin
williamturkel

Group2

alejandrogiacometti
annacaprarelli
danasolomon
ernestopriego
karensmith
leedurbin
matthewcarlos
paolosordi
sarasteger
stephanethibault
yinliu

Group3

alialbarran
amandagailey
cyrilbriquet
federicomeschini
ntlab
stefansinclair
torstenschassan

Group4

aligrotkowski
ashtonnichols
calenhenry
devonfitzgerald
enricasalvatori
ericforcier
garrywong
jameschartrand
joelyuvienco
johnnewman
peterorganisciak
shannonlucky
silviarussell
simonmahony
sophiahoosein
stevenhayes
taraandrews
violalasmana
willardmccarty

Group5

alunedwards
hopegreenberg
lewisulman

Group6

amandavisconti
jamessmith
martinholmes
sperberg-mcqueen
waynegraham

Group7

bethanynowviskie
josephgilbert
katherineharris
kellyjohnston
kirstenuszkalo
margaretgraham
matthewgold
paulyoungman

Group8

charlestravis
craigbellamy
franzfischer
jeremyboggs
johnwall
kathrynbarre
shawnday
teresadobson

Group9

jasonboyd
jolanda-pieta
joriszundert
michaelmaguire
thomascrombez
williamallen

Group10

louburnard
nevenjovanovic
sharongoetz
stephenramsay

Twitterers @sramsay and @mattwilkens were poking around here today and wondered what the topics would look like if there were only five topics and five clusters instead of 10 and 10. Here are the topics:

data work time text working tools people thing system xml mail software things texts
day time morning lot work bit find web class teaching student days dh real
digital humanities day tomorrow book twitter university blog computing reading books writing tei emails
day dh today time post things write start online writing working computer year hours
project digital work projects students meeting today people humanities dh scholars library year lab

And here are the Blogger-Mates clusters when I set n=5:

Group1

aimeemorrison
alejandrogiacometti
alialbarran
amandagailey
annacaprarelli
ashtonnichols
barbarabordalejo
carlosmartinez
carlwhithaus
clairewarwick
craigbellamy
craigharkema
danasolomon
devonfitzgerald
enricasalvatori
ernestopriego
garrywong
glenworthey
guydaarmstrong
henrietterouedcunliffe
ianjohnson
jameschartrand
janrybicki
jenterysayers
joelyuvienco
johnnewman
jonbath
juliannenyhan
justinerichardson
karensmith
kathleenfitzpatrick
keithlawson
leedurbin
lewisulman
malterehbein
matthewgold
matthewjockers
meganmeredith-lobay
melissaterras
michaelmaguire
miranhladnik
nevenjovanovic
patricksahle
peterrobinson
raysiemens
reneaudet
rogerosborne
shannonlucky
silviarussell
simonmahony
sophiahoosein
stefansinclair
stephanieschlitz
susangreenberg
taraandrews
thomascrombez
torstenschassan
vikazafrin
violalasmana
willardmccarty
williamallen
williamturkel
yinliu

Group2

aligrotkowski
ariefwidodo
calenhenry
caraleitch
charlestravis
ericforcier
geoffreyrockwell
jolanda-pieta
juliaflanders
lauraweakly
margaretgraham
matthewcarlos
milenaradzikowska
nt2lab
paolosordi
peterorganisciak
rudymcdaniel
sarasteger
sharongoetz
stanruecker
stevenhayes
victoriasmith

Group3

alunedwards
hopegreenberg
katherineharris
stephanethibault
teresadobson

Group4

amandavisconti
cyrilbriquet
federicomeschini
jamessmith
joriszundert
martinholmes
rafaelalvarado
sperberg-mcqueen
stephenramsay
waynegraham

Group5

bethanynowviskie
ellimylonas
franzfischer
jasonboyd
jeremyboggs
johnwall
josephgilbert
kai-christianbruhn
kathrynbarre
kellyjohnston
kirstenuszkalo
lauramandell
louburnard
paulspence
paulyoungman
pouyllau
shawnday

Mar 12 2010

Academic, Commentary, Text-Mining

Analyze This (Page)

“TAToo” is a fun Flash widget developed by Peter Organisciak at the University of Alberta. Peter works under the supervision of Digital Humanists Par Excellence and TAPoR Gurus Geoffrey Rockwell and Stan Ruecker. The widget (just some embed-able code) does “layman’s” text analysis on the web pages in which its code is embedded. I’ve added the code to the right sidebar of my blog to take it for a test drive.

The tool offers several “views” of your text. The default view is a word cloud in which words with greater frequency are both larger and more bold. Looking at the word cloud can give you a pretty quick sense of the page’s key terms and concepts.

By clicking on the “Tool:” bar at the top of the widget, you can select other options. The “List Words” view provides a term frequency list. You can then click on any word in the list to see its collocates. Alternatively, there are both a Collocate view and a Concordance view that allow users to enter a specific word and get information about the company that word keeps.

Kudos to Peter and the rest of the TAPoR Tools team for continuing to pursue the fine art of tool making.

Feb 10 2010

Academic, Commentary, Text-Mining

65,000 Texts to Mine?

A story in the Feb. 7th issue of the Telegraph reports that the British Library is going to make 65,000 first edition texts available for public download via Amazon’s Kindle. This news is almost as exciting as Google’s decision some years ago to partner with a consortium of big libraries in order to digitize all their books. What makes this project from the British Library particularly exciting is that the texts being offered are all works of 19th century fiction.

Unlike the Google project that is digitizing everything, this offering from the BL is already presorted to include just the kind of content that literary researchers can really use. With Google, I assume, one is going to have to figure out how to sort the legal books from the cook books, the memoirs from the fiction. Here, however, the BL has already done a big part of the work.

It will be interesting to see how this material gets offered and what sort of metadata is included with the individual files. For those of us who are interested in corpus-mining and macroanalysis (as opposed to just reading a single book at a time) the metadata is crucial. If, for example, we have the publication date of each text in an easily extractable format (e.g. TEI XML) we could explore all kinds of chronological investigations.

In prior research, working with a corpus of just 250 19th century British novels, I explored the “theme” of childhood by quantifying the relative frequency of a “cluster” or “semantic field” of words suggestive of “childhood”. In that work, I discovered a proportionally higher incidence of the theme during the Victorian period, a finding that tends to confirm the idea that childhood was an “invention” of the Victorians. But, then again, a corpus of 250 novels doesn’t even scratch the surface.

I’m not sure just what’s included in the British Library’s 65,000 texts. I assume these are not just British texts, but American, German, etc. Franco Moretti has estimated that there were 8,000 to 10,000 novels published in the Great Britain in the 19th century (20-40,000 works of prose fiction). Surely a good many of these are part of the BL’s 65,000. Which brings us back to the metadata question. Will it be possible to generate a list of which texts in the 65,000 are British-authored and British-published *novels*? If the answer is yes, then the game is on.

Get the texts, convert from mobi to pdf, html, or other text format using any number of open source apps and then poof! You’ve got a COUS–Corpus of Unusual Size! Of course, it’d be a lot easier if the BL would make the texts available (for researchers at least) through a channel that doesn’t involve Amazon or one of the eBook formats. I’m investigating that path now and will report on any progress.

Feb 13 2009

Academic, Commentary, Text-Mining

Machine-Classifying Novels and Plays by Genre

In the post that follows here, I describe some recent experiments that I (and others) have conducted. The goal of these experiments was to accurately machine-classify novels and plays (Shakespeare’s) by genre. One of the most interesting results ends up having more to do with feature extraction than classification algorithm

Background

Several weeks ago, Mike Witmore visited the Beyond Search workshop that I organize here at Stanford. In prior work, Witmore and some colleagues utilized a program called Docuscope (Developed at Carnegie Mellon) to distinguish between and classify (statistically) Shakespeare’s histories and comedies.

“Equipped with a specialized dictionary, Docuscope is able to divide texts into strings of words that are then sorted into one of eighteen word categories, such as “Inner Thinking” and “Past Events.” The program turns differentiating amongst genres into a statistical task by testing the frequency of occurence of words in each of the categories for each individual genre and recognizing where significant differences occur.”

Docuscope was designed as a tool for analyzing student writing, but Witmore (et. al.) discovered that it could also be employed as a specialized sort of feature extraction tool.

To test the efficacy of Docuscope as a tool for detecting and clustering novels by genre, Franco Moretti and I created a full text corpus that included 36 19th century novels (striped of title page and other identifying information). We divided this corpus into three groups and organized them by genre:

Group one consisted of 12 texts belonging to 3 different (but fairly similar) genres (gothic, historical tale, and national tale)
Group two consisted of 12 texts belonging to 3 different genres that were quite different (industrial, silver-fork, bildungsroman).
Group three consisted of 12 texts belonging to 6 different genres that mix 3 genres from those already included in group one or two and 3 new genres (evangelical, newgate, and anti-jacobin).

Witmore was given this corpus in electronic form (each novel in plain text). For identification purposes (since Mike was not privy to the actual genres or titles of the novels), he labeled each of the 12 genre groups with a number 1-12. Witmore’s numberings correspond to genres as follows:

Gothic
Historical Novels
National Tales
Industrial Novels
Silver-Fork Novels
Bildungsroman
Anti-Jacobin
Industrial
Gothic
Evangelical
Newgate
Bildungsroman

Using Docuscope, Witmore ran a series of tests in attempt to cluster the similar genres together. The experiment was designed to pick the three groups from 7-12 that have genre cognates in 1-6. Witmore’s results for the closest affiliated genres were impressive:

2:9 (Historical with Gothic)
1:9 (Gothic with Gothic) Witmore notes that this 2nd cluster was a close (statistically) second to the above
4:8 (Industrial with Industrial)
6:12 (Bildungsroman with Bildungsroman)

Witmore’s results also suggested an especially close relationship between the Gothic and Historical, Witmore writes that “groups 1 and 2 looked like they paired with the same candidate group (9).”

Additional Experiments

All of this work Witmore had done and the results he derived got me thinking more completely about the problem of genre classification. In many ways, genre classification is akin to authorship attribution. Generally speaking though, with authorship problems one attempts to extract a feature set that excludes context sensitive features from the analysis. (The consensus in most authorship attribution research suggests that a feature set made up primarily of frequent, or closed-class, word features yields the most accurate results) For genre classification, however, one would intuitively assume that context words would be critical (e.g. Gothic novels often have “castles” so we would not want to exclude context sensitive words like “castle.”) But my preliminary experiments have suggested just the opposite, namely that a distinct and detectable genre “signal” may be derived from a limited set of high-frequency features

Using just 42 word and punctuation features, I was able to classify the novels in the corpus described above equally as well as Witmore did using Docuscope (and a far more complex feature set). To derive my feature set, I lowercase the texts, count and convert to relative frequency the various features types, and then winnow the feature set by choosing only those features that have a mean relative frequency of 3% or greater. This results in the following 42 features (The prefix “p_” indicates a punctuation token instead of a word token.):

“a”, “all”, “an”, “and”, “as”, “at”, “be”, “but”, “by”, “for”, “from”, “had”, “have”, “he”, “her”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “on”, “p_apos”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_quote”, “p_semi”, “she”, “that”, “the”, “this”, “to”, “was”, “were”, “which”, “with”, “you”

Using the “dist” and “hclust” functions in the open-source “R” statistics application, I cluster the texts and output the following dendrogram:

These results were compelling, and after I shared them with Mike Witmore, he suggested testing this methodology on his Shakespeare corpus. Again the results were compelling and this process accurately clustered the majority of Shakespeare’s plays into appropriate clusters of “tragedy,” “comedy,” and “history”. The dendrogram below shows the results of my Shakespeare experiment using these 37 features

“a”, “and”, “as”, “be”, “but”, “for”, “have”, “he”, “him”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “p_apos”, “p_colon”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_ques”, “p_semi”, “so”, “that”, “the”, “this”, “thou”, “to”, “what”, “will”, “with”, “you”, “your”.

These initial tests raise a number of important questions, not the least of which is the question of how much of a factor genre plays in determining the usage of high frequency word and punctuation tokens. We have plans to conduct a series of more rigorous experiments, and the results of these tests will be forthcoming. In the meantime, my initial tests appear to confirm, again, the significant role that common function words play in defining literary style .

Matthew L. Jockers

"Everything . . . in nature's vast workshop from the extinction of some remote sun to the blossoming of one of the countless flowers which beautify our public parks is subject to a law of numeration as yet unascertained.” (Joyce, Ulysses, 1922)

Category Archives: Text-Mining

Revisiting Chapter Nine of Macroanalysis

Resurrecting a Low Pass Filter (well, kind of)

That Sentimental Feeling

Cumulative Sentiments

My Sentiments (Exactly?)

A Ringing Endorsement of Smoothing

Is that Your Syuzhet Ringing?

Some thoughts on Annie’s thoughts . . . about Syuzhet

The Rest of the Story

Reading Macroanalysis: The Hard Way!

A Novel Method for Detecting Plot

Text Analysis with R . . . coming soon.

Simple Point of View Detection

Experimenting with “gender” package in R

Text Analysis with R for Students of Literature

“Secret” Recipe for Topic Modeling Themes

Pronouns in 19th Century Fiction

Unfolding the Novel

Computing and Visualizing the 19th-Century Literary Genome

Macroanalysis

The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

Aberrant Adjectives in 19th Century Novels

Unigrams, and bigrams, and trigrams, oh my

SEASR Grant

Panning for Memes

Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Analyze This (Page)

65,000 Texts to Mine?

Machine-Classifying Novels and Plays by Genre

Background

Additional Experiments