Cumulative Sentiments

28 Tuesday Apr 2015

Posted by Matthew Jockers in Text-Mining

≈ Comments Off on Cumulative Sentiments

This morning Andrew N. Jackson posted an interesting alternative to the smoothing of sentiment trajectories. Instead of smoothing the trajectories with a moving average, lowess, or, dare I say it, low-pass filter, Andrew suggests cumulative summing as a “simple but potentially powerful way of re-plotting” the sentiment data. I spent a little time exploring and thinking about his approach this morning, and I’m posting below a series of “plot plots” from five novels.[1]

I’m not at all sure about how we could/should/would go about interpreting these cumulative sum graphs, but the lack of information loss is certainly appealing. Looking at these graphs requires something of a mind shift away from the way that I/we have been thinking about emotional trajectories in narrative. Making that shift requires reframing plot movement as an aggregation of emotional valence over time, a reframing that seems to be modeling something like the “cumulative effect on the reader” as Andrew writes, or perhaps it’s the cumulative effect on the characters? Whatever the case, it’s a fascinating idea that while not fully in line with Vonnegut’s conception of plot shape does have some resonance with Vonnegut’s notion of relativity. The cumulative shapes seen below in Portrait and Gone Girl are especially intriguing . . . to me.

portrait_sum

bovary_sum

[1] All of these plots use sentiment values extracted with the AFinn method, which is what Andrew implemented in Python. Andrew’s iPython notebook, by the way, is worth a close read; it provides a lot of detail that is not in his blog post, including some deeper thinking around the entire business of modeling narrative in this way.

Requiem for a low pass filter

06 Monday Apr 2015

Posted by Matthew Jockers in Commentary

≈ Comments Off on Requiem for a low pass filter

Ben Schmidt’s and Scott Enderle’s recent entries into the syuzhet discussion have beaten the last of the low pass filter out of me. I’m not entirely ready to concede that Fourier is useless for the larger problem, but they have convinced me that a better solution than the low pass is possible and probably warranted. What that better solution is remains an open question, but Ben has given us some things to consider.

In a nutshell, there were two essential elements to Vonnegut’s challenge that the low pass method seemed to be solving. According to Vonnegut, this business of story shape “is an exercise in relativity” in which “it is the shape of the curves that matter and not their point of origin.” Vonnegut imagined a system of plot in which the high and lows of good fortune and ill fortune are internally relative. In this way, a very negative book such as Blood Meridian will have an absolute high and an absolute low that can be compared to another book that, though more positive on a whole, will also have an absolute high and an absolute low. The object of analysis is not the degree of positive or negative valence but the location of the spikes and troughs of that valence relative to the beginning and end of the book. When conceived of in these terms, the ringing artifacts of the low pass filter seem rather trivial because the objective was not to perfectly represent the valence but to dramatize the shifts in valence.

As Ben has pointed out, however, the edges of the Fourier method present a different sort of problem; they assume that story plots are periodic, repeating signals. The problem, as Ben puts it, is that the method “imposes an assumption that the start of [a] plot lines up with the end of a plot.”

Over the weekend, Ben and I exchanged a few emails, and I acknowledged that I had been overlooking these edge distortions in favor of a big picture perspective of the general shape. Some amount of distortion, after all, must be tolerated if we want to produce a smooth shape. As Israel Arroyo pointed out in a tweet, “endpoints are problematic in most smoothers and filters.” With a simple rolling window, for example, the averaging can’t start until we are already half the distance of the window into the sequence. Figure 1, which shows four options for smoothing Portrait of the Artist, highlights the moving average problem in blue.[1]

Figure 1

Looking only at figure one, it would be hard to argue against Fourier as a beautiful representation of the plot shape. Figure 2 shows the same four methods applied to Dorian Gray. Here again, the Fourier method seems to provide a fair representation. In this case, however, we begin to see a problem forming at the end of the book. The red lowess line is trending down while the green Fourier is reaching up in order to complete its cycle. The beginning still looks good, and perhaps the distortion at the end can be tolerated, but it’s certainly not ideal.

Figure 2

Unfortunately, some sentiment trajectories appear to create a far more pronounced problem. At Ben’s suggestion, I ran the same experiments with Madame Bovary. The resulting plot is shown in figure 3. I’ve not read Bovary in many years, so I can’t recall too many details about plot, but I do remember that it does not end well for anyone. The shape of the green Fourier line at the end of figure 3, however, suggests some sort of uptick in positive sentiment that I suspect is not present in the text. The start of the shape, on the left, also looks problematic compared to the other smoothers.

Figure 3

With the first two figures, I think a case can be made that the Fourier line offers a fair representation of the emotional trajectory. Making such a case for Bovary is not inconceivable if we ignore the edges, but it is clearly a stretch, and there is no denying that the lowess smoother does a better job.

In our email exchange about these different options, Ben included a graphic showing how various methods model four different books. At least in these examples, loess (fifth row of figure 4) appears to be the top contender if we seek a representation that is both maximally smooth and maximally approximate.

Figure 4

In order to fully solve Vonnegut’s challenge, an alternative to percentage chunking is still necessary. Longer segments in longer books will tend toward a neutral valence. Figuring that out is work for the future. For now, the Bovary example provides precisely the sort of validation/invalidation I was hoping to elicit by putting the package online.

RIP low-pass filter.[2]

FOOTNOTES:

[1] There are some more elegant ways to deal with filling in the flat edges, but keeping it simple here for illustration.

[2] I’m grateful to everyone who has engaged in this discussion, especially Annie Swafford, Daniel Lepage, Ted Underwood, Andrew Piper, David Bamman, Scott Enderle, and Ben Schmidt. It has been a very engaging couple of weeks, and along the way I could not help but think of what this discussion might have looked like in print: it would have taken years to unfold! Despite some emotional high and lows of its own, this has been a productive exercise and a great example of how valuable open code and the digital commons can be for progress.

My Sentiments (Exactly?)

01 Wednesday Apr 2015

Posted by Matthew Jockers in Commentary, R-Code, Text-Mining

≈ Comments Off on My Sentiments (Exactly?)

While developing the Syuzhet package–a tool for tracking relative shifts in narrative sentiment–I spent a fair amount of time gut-checking whether the sentiment values returned by the machine methods were a good match for my own sense of the narrative sentiment. Between 70% and 80% of the time, they were what I considered to be good sentence level matches. . . but sentences were not my primary unit of interest.

Rather, I wanted a way to assess whether the story shapes that the tool produced by tracking changes in sentiment were a good approximation of central shifts in the “emotional trajectory” of a narrative. This emotional trajectory was something that Kurt Vonnegut had described in a lecture about the simple shapes of stories. On a chalkboard, Vonnegut graphed stories of good fortune and ill fortune in a demonstration that he calls “an exercise in relativity.” He was not interested in the precise high and lows in a given book, but instead with the highs and lows of the book relative to each other.

Blood Meridian and The Devil Wears Prada are two very different books. The former is way, way more negative. What Vonnegut was interested in understanding was not whether McCarthy’s book was more wholly negative than Weisberger’s, he was interested in understanding the internal dynamics of shifting sentiment: where in a book would we find the lowest low relative to the highest high. Implied in Vonnegut’s lecture was the idea that this tracking of relative high and lows could serve as a proxy for something like “plot structure” or “syuzhet.”

This was an interesting idea, and sentiment analysis offered a possible way forward. Unfortunately, the best work in sentiment analysis has been in very different domains. Could sentiment analysis tools and dictionaries that were designed to assess sentiment in movie reviews also detect subtle shifts in the language of prose fiction? Could these methods handle irony, metaphor, and so forth? Some people, especially if they looked only at the results of a few sentences, might reject the whole idea out of hand. Movie reviews and fiction, hogwash! Instead of rejecting the idea, I sat down and human coded the sentiment of every sentence in Joyce’s Portrait of the Artist. I then developed Syuzhet so that I could apply and compare four different sentiment detection techniques to my own human codings.

This human coding business is nuanced. Some sentences are tricky. But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment. Consider this contrived example:

“I hated the way he looked at me that morning, and I was glad that he had become my friend.”

Is that a positive or negative sentence? Given the coordinating “and” perhaps the second half is more important than the first part? I coded sentences such as this as neutral, and thankfully these were the outliers and not the norm. Most of the time–even in a complex novel like Portrait where the style and complexity of the sentences are both evolving with the maturation of the protagonist–it was fairly easy to make a determination of positive, negative, or neutral.

It turns out that when you do this sort of close reading you learn a lot about the way that authors write/express/manipulate “sentiment.” One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky. In fact, in many random passages that I examined from other books, and in the entirety of Portrait, tricky sentences were usually followed or preceded by other simple sentences that would clarify the sentiment of the larger passage. This is an important observation because at the level of an individual sentence, we know that the various machine methods are not super effective.[1] That said, I was pretty surprised by the amount of sentence level agreement in my ad hoc test. On a sentence by sentence basis, here is how the four methods in the package performed:[2]

Bing 84% agreement
Afinn 80% agreement
Stanford 60% agreement
NRC 50% agreement

These results surprised me. I was shocked that the more awesome Stanford method did not outperform the others. I was so shocked, in fact, that I figured I must have done something wrong. The Stanford sentiment tagger, for example, thinks that the following sentence from Joyces Portrait is negative.

“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo.”

It was a “very good time.” How could that be negative? I think “a very good time” is positive and so do the other methods. The Stanford tagger also indicated that the sentence “He sang that song” is slightly negative. All of the other methods scored it as neutral, and so did I.

I’m a huge fan of the Stanford tagger; I’ve been impressed by the way that it handles negation, but perhaps when all is said and done it is simply not well-suited to literary prose where the syntactical constructions can be far more complicated than typical utilitarian prose? I need more time to study how the Stanford tagger behaved on this problem, so I’m just going to exclude it from the rest of this report. My hypothesis, however, is that it is far more sensitive to register/genre than the dictionary based methods.

So, as I was saying, what happens with sentiment in actual prose fiction is usually achieved over a series of sentences. That simile, that bit of irony, that negated sentence is typically followed and/or preceded by a series of more direct sentences expressing the sentiment of the passage. For example,

“She was not ugly. She was exceedingly beautiful.”
“I watched him with disgust. He ate like a pig.”

Prose, at least the prose that I studied in this experiment, is rarely composed of sustained irony, sustained negation, sustained metaphor, etc. Usually authors provide us with lots of clues about the sentiment we are meant to experience, and over the course of several sentences, a paragraph, or a page, the sentiment tends to become less ambiguous.

So instead of just testing the machine methods against my human sentiments on a sentence by sentence basis, I split Joyce’s portrait into 20 equally sized chunks, and calculated the mean sentiment of each. I then compared those means to the means of my own human coded sentiments. These were the results:

Bing 80% agreement
Afinn 85% agreement
NRC 90% agreement

Not bad. But of course any time we apply a chunking method like this we risk breaking the text right in the middle of a key passage. And, as we increase the number of chunks and effectively decrease the size of each passage, the values tend to decrease. I ran the same test using 100 segments and saw this:

Bing 73% agreement
Afinn 77% agreement
NRC 58% agreement (ouch)

Figure 1 graphs how the AFinn method (with 77% agreement over 100 segments) tracked the sentiment compared to my human sentiments.

Figure 1

Next I transformed all of the sentiment vectors (machine and human) using the get_transformed_values function. I then calculated the amount of agreement. With the low pass filter set to the default of 3, I observed the following agreement:

Bing 73% agreement
Afinn 74% agreement
NRC 86% agreement

With the low pass filter set to 5, I observed the following agreement:

Bing 87% agreement
Afinn 93% agreement
NRC 90% agreement

Figure 2 graphs how the transformed AFinn method tracked narrative changes in sentiment compared to my human sentiments.[3]

Figure 2

As I have said elsewhere, my primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories. If you do that, and you have successes or failure, I’d be very interested in hearing from you (please send me an email).

Given all of the above, I suppose my current working benchmark for human to machine accuracy is something like ~80%. Frankly, though, I’m more interested in the big picture and whether or not the overall shapes produced by this method map well onto our human sense of a book’s emotional trajectory. They certainly do seem to map well with my sense of Portrait of the Artist, and with many other books in my library, but what about your favorite novel?

FOOTNOTES:
[1] For what it is worth, the same can probably be said about us, the human beings. Given a single sentence with no context, we could probably argue about its positive or negativeness.
[2] Each method uses a slightly different value range, so when I write of “agreement,” I mean only that the machine method agreed with the human (me) that a given sentence was positively or negatively charged. My rating scale consisted of three values: 1, 0, -1 (positive, neutral, negative). I did not test the extent of the positiveness or the degree of negativeness.
[3] I explored low-pass values in increments of 5 all the way to 100. The percentages of agreement were consistently between 70 and 90.

Matthew L. Jockers

Monthly Archives: April 2015

Cumulative Sentiments

Requiem for a low pass filter

My Sentiments (Exactly?)