Annie Swafford has raised a couple of interesting points about how the syuzhet package works to estimate the emotional trajectory in a novel, a trajectory which I have suggested serves as a handy proxy for plot (in the spirit of Kurt Vonnegut).
Annie expresses some concern about the level of precision the tool provides and suggest that dictionary based methods (such as the three I include as options in syuzhet) are not reliable. She writes “Sentiment analysis based solely on word-by-word lexicon lookups is really not state-of-the-art at all.” That’s fair, I suppose, but those three lexicons are benchmarks of some importance, and they deserve to be included in the package if for no other reason than for comparison. Frankly, I don’t think any of the current sentiment detection methods are especially reliable. The Stanford tagger has a reputation for being the main contender for the title of “best in the open source market,” but even it hovers around 80 – 83% accuracy. My own tests have shown that performance depends a good deal on genre/register.
But Annie seems especially concerned about the three dictionary methods in the package. She writes “sentiment analysis as it is implemented in the syuzhet package does not correctly identify the sentiment of sentences.” Given that sentiment is a subtle and nuanced thing, I’m not sure that “correct” is the right word here. I’m not convinced there is a “correct” answer when it comes to this question of valence. I do agree, however, that some answers are more or less correct than others and that to be useful we need to be on the closer side. The question to address, then, is whether we are close enough, and that’s a hard one. We would probably find a good deal of human agreement when it comes to the extremes of sentiment, but there are a lot of tricky cases, grey areas where I’m not sure we would all agree. We certainly cannot expect the tool to perform better than a person, so we need some flexibility in our definition of “correct.”
Take, for example, the sentence “I studied at Leland Stanford Junior University.” The state-of-the-art Stanford sentiment parser scores this sentence as “negative.” I think that is incorrect (you are welcome to disagree;-). The “bing” method, that I have implemented as the default in syuzhet, scores this sentence as neutral, as does the “afinn” method (also in the package). The NRC method scores it as slightly positive. So, which one is correct? We could go all Derrida on this sentence and deconstruct each word, unpack what “junior” really means. We could probably even “problematize” it! . . . But let’s not.
What Annie writes about dictionary based methods not being the most state-of-the-art is true from a technical standpoint but sophisticated methods and complexity do not necessarily correlate with results. Annie suggest that “getting the Stanford package to work consistently would go a long way towards addressing some of these issues,” but as we saw with the sentence above, simple beat sophisticated, hands down[1].
Consider another sentence: “Syuzhet is not beautiful.” All four methods score this sentence as positive, even the Stanford tool, which tends to do a better job with negation, says “positive.”
It is easy to find opposite cases where sophisticated wins the day. Consider this more complex sentence: “He was not the sort of man that one would describe as especially handsome.” Both NRC and Afinn score this sentence as neutral, Bing scores it slightly positive and Stanford scores it slightly negative. When it comes to negation, the Stanford tool tends to perform a bit better, but not always. The very similar sentence “She was not the sort of woman that one would describe as beautiful” is scored slightly positive by all four methods.
What I have found in my testing is that these four methods usually agree with each other, not exactly but close enough. Because the Stanford parser is very computationally expensive and requires special installation, I focused the examples in the Syuzhet Package Vignette on the three dictionary based methods. All three are lightning fast by comparison, and all three have the benefit of simplicity.
But, are they good enough compared to the more sophisticated Stanford parser?
Below are two graphics showing how the methods stack up over a longer piece of text. The first image shows sentiment using percentage based segmentation as implemented in the get_percentage_values() function.

Four Methods Compared using Percentage Segmentation
The three dictionary methods appear to be a bit closer, but all four methods do create the same basic shape. The next image shows the same data after normalization using the get_transformed_values() function. Here the similarity is even more pronounced.

Four Methods Compared Using Transformed Values
While we could legitimately argue about the accuracy of one sentence here or one sentence there, as Annie has done, that is not the point. The point is to reveal a latent emotional trajectory that represents the general sense of the novel’s plot. In this example, all four methods make it pretty clear what that shape is: This is what Vonnegut called “Man in Hole.”
The sentence level precision that Annie wants is probably not possible, at least not right now. While I am sympathetic to the position, I would argue that for this particular use case, it really does not matter. The tool simply has to be good enough, not perfect. If the overall shape mirrors our sense of the novel’s plot, then the tool is working, and this is the area where I think there is still a lot of validation work to do. Part of the impetus for releasing the package was to allow other people to experiment and report results. I’ve looked at a lot of graphs, but there is a limit to the number of books that I know well enough to be able to make an objective comparison between the Syuzhet graph and my close reading of the book.
This is another place where Annie raises some red flags. Annie calls attention to these two images (below) from my earlier post and complains that the transformed graph is not a good representation of the noisy raw data. She writes:
The full trajectory opens with a largely flat stretch and a strong negative spike around x=1100 that then rises back to be neutral by about x=1500. The foundation shape, on the other hand, opens with a rise, and in fact peaks in positivity right around where the original signal peaks in negativity. In other words, the foundation shape for the first part of the book is not merely inaccurate, but in fact exactly opposite the actual shape of the original graph.
Annie’s reading of the graphs, though, is inconsistent with the overall plot of the novel, whereas the transformed plot is perfectly consistent with the novel. What Annie calls a “strong negative spike” is the scene in which Stephen is pandied by Father Arnell. It is an important negative moment, to be sure, but not nearly as important, or as negative, as the major dip that occurs midway through the novel–when Stephen experiences Hell. The scene with Arnell is a minor blip compared to the pages and pages of hell and the pages and pages of anguish Stephen experiences before his confession.

Annie is absolutely correct in noting that there is information loss, but wrong in arguing that the graph fails to represent the novel. The tool has done what it was designed to do: it successfully reveals the overall shape of the narrative. The first third of the novel and the last third of the novel are considerably more positive than the middle section. But this is not meant to say or imply that the beginning and end are without negative moments.
It is perfectly reasonable to want to see more of the page to page, or scene by scene fluctuations in sentiment, and that can be easily achieved by using the percentage segmentation method or by altering the low-pass filter size. Changing the filter size to retain five components instead of three results in the graph below. This new graph captures that “strong negative spike” (not so “strong” compared to hell) and reveals more of the novel’s ups and downs. This graph also provides more detail about the end of the novel where Stephen comes down off his bird-girl high and moves toward a more sober perspective for his future.

Portrait with Five Components
Of course, the other reason for releasing the code is so that I can get suggestions for improvements. Annie (and a few others) have already propelled me to tweak several functions. Annie found (and reported on her blog) some legitimate flaws in the openNLP sentence parser. When it comes to passages with dialog, the openNLP parser falls down on the job. I ran a few dialog tests (including Annie’s example) and was able to fix the great majority of the sentence parsing errors by simply stripping out the quotation marks in advance. Based on Annie’s feedback, I’ve added a “quote stripping” parameter to the get_sentences() function. It’s all freshly baked and updated on github.
But finally, I want to comment on Annie’s suggestion that
some texts use irony and dark humor for more extended periods than you [that’s me] suggest in that footnote—an assumption that can be tested by comparing human-annotated texts with the Syuzhet package.
I think that would be a great test, and I hope that Annie will consider working with me, or in parallel, to test it. If anyone has any human annotated novels, please send them my/our way!
Things like irony, metaphor, and dark humor are the monsters under the bed that keep me up at night. Still, I would not have released this code without doing a little bit of testing:-). These monsters can indeed wreak a bit of havoc, but usually they are all shadow and no teeth. Take the colloquial expression “That’s some bad R code, man.” This sentence is supposed to mean the opposite, as in “That is a fine bit of R coding, sir.” This is a sentence the tool is not likely to get right; but, then again, this sentence also messes up my young daughter, and it tends to confuse English language learners. I have yet to find any sustained examples of this sort of construction in typical prose fiction, and I have made a fairly careful study of the emotional outliers in my corpus.
Satire, extended satire in particular, is probably a more serious monster. Still, I would argue that the sentiment tools performs exactly as expected; they just don’t understand what they are “reading” in the way that we do. Then again, and this is no fabrication, I have had some (as in too many) college students over the years who haven’t understood what they were reading and thought that Swift was being serious about eating succulent little babies in his Modest Proposal (those kooky Irish)!
So, some human beings interpret the sentiment in Modest Proposal exactly as the sentiment parser does, which is to say, literally! (Check out the special bonus material at the bottom of this post for a graph of Modest Proposal.) I’d love to have a tool that could detect satire, irony, dark humor and the like, but such a tool is still a good ways off. In the meantime, we can take comfort in incremental progress.
Special thanks to Annie Swafford for prompting a stimulating discussion. Here is all the code necessary to repeat the experiments discussed above. . .
library(syuzhet)
path_to_a_text_file <- system.file("extdata", "portrait.txt",
package = "syuzhet")
joyces_portrait <- get_text_as_string(path_to_a_text_file)
poa_v <- get_sentences(joyces_portrait)
# Get the four sentiment vectors
stanford_sent <- get_sentiment(poa_v, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(poa_v, method="bing")
afinn_sent <- get_sentiment(poa_v, method="afinn")
nrc_sent <- get_sentiment(poa_v, method="nrc")
######################################################
# Plot them using percentage segmentation
######################################################
plot(
scale(get_percentage_values(stanford_sent, 10)),
type = "l",
main = "Joyce's Portrait Using All Four Methods\n and Percentage Based Segmentation",
xlab = "Narrative Time",
ylab = "Emotional Valence",
ylim = c(-3, 3)
)
lines(
scale(get_percentage_values(bing_sent, 10)),
col = "red",
lwd = 2
)
lines(
scale(get_percentage_values(afinn_sent, 10)),
col = "blue",
lwd = 2
)
lines(
scale(get_percentage_values(nrc_sent, 10)),
col = "green",
lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)
######################################################
# Transform the Sentiments
######################################################
stan_trans <- get_transformed_values(
stanford_sent,
low_pass_size = 3,
x_reverse_len = 100,
scale_vals = TRUE,
scale_range = FALSE
)
bing_trans <- get_transformed_values(
bing_sent,
low_pass_size = 3,
x_reverse_len = 100,
scale_vals = TRUE,
scale_range = FALSE
)
afinn_trans <- get_transformed_values(
afinn_sent,
low_pass_size = 3,
x_reverse_len = 100,
scale_vals = TRUE,
scale_range = FALSE
)
nrc_trans <- get_transformed_values(
nrc_sent,
low_pass_size = 3,
x_reverse_len = 100,
scale_vals = TRUE,
scale_range = FALSE
)
######################################################
# Plot them all
######################################################
plot(
stan_trans,
type = "l",
main = "Joyce's Portrait Using All Four Methods",
xlab = "Narrative Time",
ylab = "Emotional Valence",
ylim = c(-2, 2)
)
lines(
bing_trans,
col = "red",
lwd = 2
)
lines(
afinn_trans,
col = "blue",
lwd = 2
)
lines(
nrc_trans,
col = "green",
lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)
######################################################
# Sentence Parsing Annie's Example
######################################################
annies_sentences_w_quotes <- '"Mrs. Rachael, I needn’t inform you who were acquainted with the late Miss Barbary’s affairs, that her means die with her and that this young lady, now her aunt is dead–" "My aunt, sir!" "It is really of no use carrying on a deception when no object is to be gained by it," said Mr. Kenge smoothly, "Aunt in fact, though not in law."'
# Strip out the quotation marks
annies_sentences_no_quotes <- gsub("\"", "", annies_sentences)
# With quotes, Not Very Good:
s_v <- get_sentences(annies_sentences_w_quotes)
s_v
# Without quotes, Better:
s_v_nq <- get_sentences(annies_sentences_no_quotes)
s_v_nq
######################################################
# Some Sentence Comparisons
######################################################
# Test one
test <- "He was not the sort of man that one would describe as especially handsome."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent
# test 2
test <- "She was not the sort of woman that one would describe as beautiful."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent
# test 3
test <- "That's some bad R code, man."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent
SPECIAL BONUS MATERIAL
Swift’s classic satire presents some sentiment challenges. There is disagreement between the Stanford method and the other three in segment four where the sentiments move in opposite directions.

FOOTNOTE
[1] By the way, I’m not sure if Annie was suggesting that the Stanford parser was not working because she could not get it to work (the NAs) or because there was something wrong in the syuzhet package code. The code, as written, works just fine on the two machines I have available for testing. I’d appreciate hearing from others who are having problems; my implementation definitely qualifies as a first class hack.
My Sentiments (Exactly?)
While developing the Syuzhet package–a tool for tracking relative shifts in narrative sentiment–I spent a fair amount of time gut-checking whether the sentiment values returned by the machine methods were a good match for my own sense of the narrative sentiment. Between 70% and 80% of the time, they were what I considered to be good sentence level matches. . . but sentences were not my primary unit of interest.
Rather, I wanted a way to assess whether the story shapes that the tool produced by tracking changes in sentiment were a good approximation of central shifts in the “emotional trajectory” of a narrative. This emotional trajectory was something that Kurt Vonnegut had described in a lecture about the simple shapes of stories. On a chalkboard, Vonnegut graphed stories of good fortune and ill fortune in a demonstration that he calls “an exercise in relativity.” He was not interested in the precise high and lows in a given book, but instead with the highs and lows of the book relative to each other.
Blood Meridian and The Devil Wears Prada are two very different books. The former is way, way more negative. What Vonnegut was interested in understanding was not whether McCarthy’s book was more wholly negative than Weisberger’s, he was interested in understanding the internal dynamics of shifting sentiment: where in a book would we find the lowest low relative to the highest high. Implied in Vonnegut’s lecture was the idea that this tracking of relative high and lows could serve as a proxy for something like “plot structure” or “syuzhet.”
This was an interesting idea, and sentiment analysis offered a possible way forward. Unfortunately, the best work in sentiment analysis has been in very different domains. Could sentiment analysis tools and dictionaries that were designed to assess sentiment in movie reviews also detect subtle shifts in the language of prose fiction? Could these methods handle irony, metaphor, and so forth? Some people, especially if they looked only at the results of a few sentences, might reject the whole idea out of hand. Movie reviews and fiction, hogwash! Instead of rejecting the idea, I sat down and human coded the sentiment of every sentence in Joyce’s Portrait of the Artist. I then developed Syuzhet so that I could apply and compare four different sentiment detection techniques to my own human codings.
This human coding business is nuanced. Some sentences are tricky. But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment. Consider this contrived example:
“I hated the way he looked at me that morning, and I was glad that he had become my friend.”
Is that a positive or negative sentence? Given the coordinating “and” perhaps the second half is more important than the first part? I coded sentences such as this as neutral, and thankfully these were the outliers and not the norm. Most of the time–even in a complex novel like Portrait where the style and complexity of the sentences are both evolving with the maturation of the protagonist–it was fairly easy to make a determination of positive, negative, or neutral.
It turns out that when you do this sort of close reading you learn a lot about the way that authors write/express/manipulate “sentiment.” One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky. In fact, in many random passages that I examined from other books, and in the entirety of Portrait, tricky sentences were usually followed or preceded by other simple sentences that would clarify the sentiment of the larger passage. This is an important observation because at the level of an individual sentence, we know that the various machine methods are not super effective.[1] That said, I was pretty surprised by the amount of sentence level agreement in my ad hoc test. On a sentence by sentence basis, here is how the four methods in the package performed:[2]
Bing 84% agreement
Afinn 80% agreement
Stanford 60% agreement
NRC 50% agreement
These results surprised me. I was shocked that the more awesome Stanford method did not outperform the others. I was so shocked, in fact, that I figured I must have done something wrong. The Stanford sentiment tagger, for example, thinks that the following sentence from Joyces Portrait is negative.
“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo.”
It was a “very good time.” How could that be negative? I think “a very good time” is positive and so do the other methods. The Stanford tagger also indicated that the sentence “He sang that song” is slightly negative. All of the other methods scored it as neutral, and so did I.
I’m a huge fan of the Stanford tagger; I’ve been impressed by the way that it handles negation, but perhaps when all is said and done it is simply not well-suited to literary prose where the syntactical constructions can be far more complicated than typical utilitarian prose? I need more time to study how the Stanford tagger behaved on this problem, so I’m just going to exclude it from the rest of this report. My hypothesis, however, is that it is far more sensitive to register/genre than the dictionary based methods.
So, as I was saying, what happens with sentiment in actual prose fiction is usually achieved over a series of sentences. That simile, that bit of irony, that negated sentence is typically followed and/or preceded by a series of more direct sentences expressing the sentiment of the passage. For example,
“She was not ugly. She was exceedingly beautiful.”
“I watched him with disgust. He ate like a pig.”
Prose, at least the prose that I studied in this experiment, is rarely composed of sustained irony, sustained negation, sustained metaphor, etc. Usually authors provide us with lots of clues about the sentiment we are meant to experience, and over the course of several sentences, a paragraph, or a page, the sentiment tends to become less ambiguous.
So instead of just testing the machine methods against my human sentiments on a sentence by sentence basis, I split Joyce’s portrait into 20 equally sized chunks, and calculated the mean sentiment of each. I then compared those means to the means of my own human coded sentiments. These were the results:
Bing 80% agreement
Afinn 85% agreement
NRC 90% agreement
Not bad. But of course any time we apply a chunking method like this we risk breaking the text right in the middle of a key passage. And, as we increase the number of chunks and effectively decrease the size of each passage, the values tend to decrease. I ran the same test using 100 segments and saw this:
Bing 73% agreement
Afinn 77% agreement
NRC 58% agreement (ouch)
Figure 1 graphs how the AFinn method (with 77% agreement over 100 segments) tracked the sentiment compared to my human sentiments.
Figure 1
Next I transformed all of the sentiment vectors (machine and human) using the get_transformed_values function. I then calculated the amount of agreement. With the low pass filter set to the default of 3, I observed the following agreement:
Bing 73% agreement
Afinn 74% agreement
NRC 86% agreement
With the low pass filter set to 5, I observed the following agreement:
Bing 87% agreement
Afinn 93% agreement
NRC 90% agreement
Figure 2 graphs how the transformed AFinn method tracked narrative changes in sentiment compared to my human sentiments.[3]
Figure 2
As I have said elsewhere, my primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories. If you do that, and you have successes or failure, I’d be very interested in hearing from you (please send me an email).
Given all of the above, I suppose my current working benchmark for human to machine accuracy is something like ~80%. Frankly, though, I’m more interested in the big picture and whether or not the overall shapes produced by this method map well onto our human sense of a book’s emotional trajectory. They certainly do seem to map well with my sense of Portrait of the Artist, and with many other books in my library, but what about your favorite novel?
FOOTNOTES:
[1] For what it is worth, the same can probably be said about us, the human beings. Given a single sentence with no context, we could probably argue about its positive or negativeness.
[2] Each method uses a slightly different value range, so when I write of “agreement,” I mean only that the machine method agreed with the human (me) that a given sentence was positively or negatively charged. My rating scale consisted of three values: 1, 0, -1 (positive, neutral, negative). I did not test the extent of the positiveness or the degree of negativeness.
[3] I explored low-pass values in increments of 5 all the way to 100. The percentages of agreement were consistently between 70 and 90.