While developing the Syuzhet package–a tool for tracking relative shifts in narrative sentiment–I spent a fair amount of time gut-checking whether the sentiment values returned by the machine methods were a good match for my own sense of the narrative sentiment.  Between 70% and 80% of the time, they were what I considered to be good sentence level matches. . . but sentences were not my primary unit of interest.

Rather, I wanted a way to assess whether the story shapes that the tool produced by tracking changes in sentiment were a good approximation of central shifts in the “emotional trajectory” of a narrative.  This emotional trajectory was something that Kurt Vonnegut had described in a lecture about the simple shapes of stories.  On a chalkboard, Vonnegut graphed stories of good fortune and ill fortune in a demonstration that he calls “an exercise in relativity.”  He was not interested in the precise high and lows in a given book, but instead with the highs and lows of the book relative to each other.

Blood Meridian and The Devil Wears Prada are two very different books. The former is way, way more negative.  What Vonnegut was interested in understanding was not whether McCarthy’s book was more wholly negative than Weisberger’s, he was interested in understanding the internal dynamics of shifting sentiment: where in a book would we find the lowest low relative to the highest high. Implied in Vonnegut’s lecture was the idea that this tracking of relative high and lows could serve as a proxy for something like “plot structure” or “syuzhet.”

This was an interesting idea, and sentiment analysis offered a possible way forward.  Unfortunately, the best work in sentiment analysis has been in very different domains.  Could sentiment analysis tools and dictionaries that were designed to assess sentiment in movie reviews also detect subtle shifts in the language of prose fiction? Could these methods handle irony, metaphor, and so forth?  Some people, especially if they looked only at the results of a few sentences, might reject the whole idea out of hand. Movie reviews and fiction, hogwash!  Instead of rejecting the idea, I sat down and human coded the sentiment of every sentence in Joyce’s Portrait of the Artist. I then developed Syuzhet so that I could apply and compare four different sentiment detection techniques to my own human codings.

This human coding business is nuanced.  Some sentences are tricky.  But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment. Consider this contrived example:

“I hated the way he looked at me that morning, and I was glad that he had become my friend.”

Is that a positive or negative sentence?  Given the coordinating “and” perhaps the second half is more important than the first part?  I coded sentences such as this as neutral, and thankfully these were the outliers and not the norm. Most of the time–even in a complex novel like Portrait where the style and complexity of the sentences are both evolving with the maturation of the protagonist–it was fairly easy to make a determination of positive, negative, or neutral.

It turns out that when you do this sort of close reading you learn a lot about the way that authors write/express/manipulate “sentiment.”  One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky.  In fact, in many random passages that I examined from other books, and in the entirety of Portrait, tricky sentences were usually followed or preceded by other simple sentences that would clarify the sentiment of the larger passage.  This is an important observation because at the level of an individual sentence, we know that the various machine methods are not super effective.[1]  That said, I was pretty surprised by the amount of sentence level agreement in my ad hoc test.  On a sentence by sentence basis, here is how the four methods in the package performed:[2]

Bing 84% agreement
Afinn 80% agreement
Stanford 60% agreement
NRC 50% agreement

These results surprised me.  I was shocked that the more awesome Stanford method did not outperform the others. I was so shocked, in fact, that I figured I must have done something wrong.  The Stanford sentiment tagger, for example, thinks that the following sentence from Joyces Portrait is negative.

“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo.”

It was a “very good time.” How could that be negative?  I think “a very good time” is positive and so do the other methods. The Stanford tagger also indicated that the sentence “He sang that song” is slightly negative.  All of the other methods scored it as neutral, and so did I.

I’m a huge fan of the Stanford tagger; I’ve been impressed by the way that it handles negation, but perhaps when all is said and done it is simply not well-suited to literary prose where the syntactical constructions can be far more complicated than typical utilitarian prose? I need more time to study how the Stanford tagger behaved on this problem, so I’m just going to exclude it from the rest of this report.  My hypothesis, however, is that it is far more sensitive to register/genre than the dictionary based methods.

So, as I was saying, what happens with sentiment in actual prose fiction is usually achieved over a series of sentences. That simile, that bit of irony, that negated sentence is typically followed and/or preceded by a series of more direct sentences expressing the sentiment of the passage.  For example,

“She was not ugly.  She was exceedingly beautiful.”
“I watched him with disgust. He ate like a pig.”

Prose, at least the prose that I studied in this experiment, is rarely composed of sustained irony, sustained negation, sustained metaphor, etc.  Usually authors provide us with lots of clues about the sentiment we are meant to experience, and over the course of several sentences, a paragraph, or a page, the sentiment tends to become less ambiguous.

So instead of just testing the machine methods against my human sentiments on a sentence by sentence basis, I split Joyce’s portrait into 20 equally sized chunks, and calculated the mean sentiment of each.  I then compared those means to the means of my own human coded sentiments.  These were the results:

Bing 80% agreement
Afinn 85% agreement
NRC 90% agreement

Not bad.  But of course any time we apply a chunking method like this we risk breaking the text right in the middle of a key passage.  And, as we increase the number of chunks and effectively decrease the size of each passage, the values tend to decrease. I ran the same test using 100 segments and saw this:

Bing 73% agreement
Afinn 77% agreement
NRC 58% agreement (ouch)

Figure 1 graphs how the AFinn method (with 77% agreement over 100 segments) tracked the sentiment compared to my human sentiments.


Figure 1

Next I transformed all of the sentiment vectors (machine and human) using the get_transformed_values function.  I then calculated the amount of agreement. With the low pass filter set to the default of 3, I observed the following agreement:

Bing 73% agreement
Afinn 74% agreement
NRC 86% agreement

With the low pass filter set to 5, I observed the following agreement:

Bing 87% agreement
Afinn 93% agreement
NRC 90% agreement

Figure 2 graphs how the transformed AFinn method tracked narrative changes in sentiment compared to my human sentiments.[3]


Figure 2

As I have said elsewhere, my primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories.  If you do that, and you have successes or failure, I’d be very interested in hearing from you (please send me an email).

Given all of the above, I suppose my current working benchmark for human to machine accuracy is something like ~80%.  Frankly, though, I’m more interested in the big picture and whether or not the overall shapes produced by this method map well onto our human sense of a book’s emotional trajectory.  They certainly do seem to map well with my sense of Portrait of the Artist, and with many other books in my library, but what about your favorite novel?

[1] For what it is worth, the same can probably be said about us, the human beings.  Given a single sentence with no context, we could probably argue about its positive or negativeness.
[2] Each method uses a slightly different value range, so when I write of “agreement,”  I mean only that the machine method agreed with the human (me) that a given sentence was positively or negatively charged.  My rating scale consisted of three values: 1, 0, -1 (positive, neutral, negative). I did not test the extent of the positiveness or the degree of negativeness.
[3] I explored low-pass values in increments of 5 all the way to 100.  The percentages of agreement were consistently between 70 and 90.