This post is a followup to A Novel Method for Detecting Plot posted June 15, 2014.

For the past few years, I have been exploring the relationship between sentiment and plot shape in fiction. Earlier today I posted an R package titled “syuzhet” to github. The package is designed to extract sentiment and plot information from prose. Methods for text import, sentiment extraction, and plot arc modeling are described in the documentation and in the package vignette. What follows below is a blog-friendly version of a longer academic paper describing how I employed this package to study plot in a corpus of ~50,000 novels.



When I began the research that lead to this package, my goal was to study positive and negative emotions in literature across time, much in the same way that I had studied style and theme in Macroanalysis. Along the way, however, I discovered that fluctuations in sentiment can serve as a rather natural proxy for fluctuations in plot movement. Studying plot shifts via sentiment analysis turned out to be a far more interesting project than the simple study of sentiment, and my research got a huge boost when I stumbled upon a video of Kurt Vonnegut describing plot in precisely these terms.

After seeing the video and hearing Vonnegut’s opening challenge (“There’s no reason why the simple shapes of stories can’t be fed into computers”), I set out to develop a systematic way of extracting plot arcs from fiction. I felt this might help me to better understand and visualize how narrative is constructed. The fundamental idea, of course, was nothing new. What I was after is what the Russian formalist Vladimir Propp had defined as the narrative’s syuzhet (the organization of the narrative) as opposed to its fabula (raw elements of the story).

Syuzhet is concerned with the linear progression of narrative from beginning (first page) to the end (last page), whereas fabula is concerned with the specific events of a story, events which may or may not be related in chronological order. When we study fabula, which is what we typically do in literature courses, we mentally reconstruct the events into chronological order. We hope that this reconstruction of the fabula will help us understand the experience of the characters, the core story, etc. When we study the syuzhet, we are not so much concerned with the order of the fictional events but specifically interested in the manner in which the author presents those events to readers.

Consider the technique that radio personality Paul Harvey used in his iconic radio show “The Rest of the Story.” In each story, Harvey would hold back certain key elements until the very end of the program. The narrative would appear to have reached its conclusion, and then Harvey would say, “and now, the rest of the story.” At this point, he would reveal the held back information and the listener would reconstruct the entire fabula. The effect (and affect) of Harvey’s technique, the syuzhet, was usually stunning and pleasantly surprising. Had the story been told in simple chronological order, it would have been bland, perhaps even boring. What gave Harvey’s show power was his narrative technique.

This power was largely derived from the organization of the narrative elements and the manner in which Harvey offered listeners clues and then used narrative and language to evoke both curiosity and emotional response. What Harvey said and how he said it, were critical elements to the overall effect of the story. Harvey’s success was in finding and mastering a particular style of plot, a plot that has much in common with those found in mystery and detective fiction. A series of clues is presented along side a series of misdirections and the mystery is ultimately resolved in some grand reveal that defies expectations.

A Finite Number of Plots

But this Harvey method is just one among many possible plots. Countless scholars and non-scholars have pontificated about the possibility of a finite set of fundamental or archetypal plot shapes.

One of the more recent and famous/infamous of these scholars is Christopher Booker, whose 2004 book, titled The Seven Basic Plots: Why We Tell Stories, argues for a Jungian inspired understanding of plot in terms of seven basic archetypes. Booker’s work appears to be strongly influenced by prior work describing plot in terms of conflict. These core conflicts will be familiar to students of literature: such constructions were once taught to us under the headings of “man vs. man,” “man against nature,” “man vs. society,” and so on.

Other scholars have offered other numbers. William Foster-Harris has argued in favor of three basic patterns of plot The Basic Patterns of Plot (Foster-Harris. University of Oklahoma Press, 1959.); Ronald B. Tobias has argued for twenty (Tobias, Ronald B. 20 Master Plots. Cincinnati: Writer’s Digest Books, 1993.), and Georges Polti claims that there are thirty six (The Thirty-Six Dramatic Situations. trans. Lucille Ray). So the story goes.

All of these discussions about plot typically involve some discussion of a story’s central conflict. But discussions of conflict are more appropriately classified as fabula. Nevertheless, many of these same discussions also explore the flow, or trajectory, of the narrative, and these I consider to be appropriately categorized as syuzhet. Often these discussions of plot engage visualization in order to convey the “movement” of the narrative. Perhaps the best example of this is the one offered by Vonnegut.


A Significant Problem

Still, all of these explanations of plot suffer from a significant problem: a lack of data. Each of these proposed taxonomies suffers from anecdotalism. Vonnegut draws the plot of Cinderella for us on his chalk board, and we can imagine a handful of similar plot shapes. He describes another plot and names it “man in hole,” and we can imagine a few similar stories. But our imaginations are limited.

This limitation led me to think hard about the problem of how to compare, mathematically and computationally, the shape of one story to another. Assuming I could use computers and some NLP magic to extract plot shape from narrative (see A Novel Method for Detecting Plot), it would still be impossible to compare one shape to another because of the simple fact that stories are not the same length. Vonnegut solved this problem by creating an x-axis that runs from B to E, that is, from beginning to end. What Vonnegut did not solve, however, was the real computational problem of text length.

It was tempting to consider simply breaking each book into ten or one-hundred equally sized pieces and then taking measurements of the mean emotional valence in each chunk.


Unfortunately, some of the books would have much larger chunks and with larger chunks would come the possibility of more and more diverse valence markers. What happens, in fact, is that larger chunks of text tend to have a preponderance of both positive and negative valence markers. The end result is that all the means end up very close to neutral on the y-axis of emotional valence. Indeed, books as a whole tend to have a mean valence close to zero on a scale of -1 to 1. (I tested this by calculating the mean valence for 3500 novels in my nineteenth century novels corpus and then plotting the results as a histogram. The distribution showed a clustering around zero with very few books on the extremes.)

So, I needed a way to deal with length. I needed a way to compare the shapes of the stories regardless of the length of the novels. Luckily, since coming to UNL, I’ve become acquainted with a physicist who is one of the team of scientists who recently discovered the Higgs Boson at CERN. Over coffee one afternoon, this physicist, Aaron Dominguez, helped me figure out how to travel through narrative time.

A Solution

Aaron introduced me to a mathematical formula from signal processing called the Fourier transformation. The Fourier transformation provides a way of decomposing a time based signal and reconstituting it in the frequency domain. A complex signal (such as the one seen above in the first figure in this post) can be decomposed into series of symmetrical waves of varying frequencies. And one of the magical things about the Fourier equation is that these decomposed component sine waves can be added back together (summed) in order to reproduce the original wave form–this is called a backward or reverse transformation. Fourier provides a way of transforming the sentiment-based plot trajectories into an equivalent data form that is independent of the length of the trajectory from beginning to end. The frequency domain begins to solve the book length problem.

It turns out that not all of these sine waves in the frequency domain are created equal; some play a bigger role in the construction of the original signal. In signal processing, a low-pass filter can be used to remove the background “hiss” in an audio recording, and a similar approach can be used to filter out the extremes in the sentiment trajectories. When a low-pass filter is applied to the sentiment data, it’s possible to filter and thereby smooth out a great deal of the affectual noise.

The filtered data from the frequency domain can then be reconstituted back into the time domain using the reverse transformation. At the same time, the x-axis can be normalized and the foundation shape of the story revealed.


Above you can see the core shape of Joyce’s Portrait revealed using the “bing” method of the get_sentiment function in the syuzhet package. (Check the package documentation and vignette for details on the various options and methods.)

Once a book’s plot trajectory is converted into this normalized space, we no longer have the problem of comparing books of different lengths. Compare the foundation shape of Joyce’s Portrait (above) to Wilde’s Picture of Dorain Grey (below).


The models reflect the key narrative movements in both of these plots. Young Stephen reaches a low point during and just after the sermon on hell which occurs midway through the narrative. Dorian’s life takes a dark turn as the reality of the portrait becomes apparent. But the full power of these transformed plots does not sit simply in visualization. The values that inform these visualizations can now be compared. In a follow up post, I’ll discuss how I measured and compared 40,000+ plot shapes and then clustered the resulting data in order to reveal six common, perhaps archetypal, plot shapes. . .