A Ringing Endorsement of Smoothing

On March 7, Annie Swafford posted an interesting critique of the transformation method implemented in Syuzhet.  Her basic argument is that setting the low-pass filter too low may result in misleading ringing artifacts.[1]  This post takes up the issue of ringing artifacts more directly and explains how Annie’s clever method of neutralizing values actually demonstrates just how effective the Syuzhet tool is in doing what it was designed to do!   But lest we begin chasing any red herring, let me be very clear about the objectives of the software.

  1. The tool is meant to reveal the simple (and latent) shape of stories, not the complex shape of stories, not the perfect shape of stories, not the absolute shape of stories, just the simple foundational shapes.[2]  This was the challenge that Vonnegut put forth when he said “There is no reason why the simple shape of stories cannot be fed into computers.”
  2. The tool uses sentiment, as detected by four possible methods, as a proxy for “plot.”  This is in keeping with Vonnegut’s conception of “plot” as a movement between what he called “good fortune” and “ill fortune.”  The gamble Syuzhet makes is that the sentiment detection methods are both “good enough” and also may serve as a satisfying proxy for the “good” and “ill” fortune Vonnegut describes in his essay and lecture.
  3. Despite some complex mathematics, there is an interpretive dimension to this work. I suppose this is why folks call it “digital humanities” instead of physics. Syuzhet was designed to estimate and smooth the emotional highs and lows of a narrative; it was not designed to provide a perfect mapping of emotional valence. I don’t think such perfect mapping is computationally possible; if you want/need that kind of experience, then you need to read the book (some of ‘em are even worth it).  I’m interested in detecting/revealing the simple shape of stories by approximating the fundamental highs and lows of emotional valence. I believe that this is what Vonnegut had in mind.
  4. Finally, when examining the shapes produced by graphing the Syuzhet values, we must remember what Vonnegut said: “This is an exercise in relativity, really. It is the shape of the curves that matters and not their origins.”  When Vonnegut speaks of the shapes, he speaks of them as “simple” shapes.

In her critique of the software, Annie expresses concern over the potential for ringing artifacts when using a Fourier transformation and a very low, low-pass filter.  She introduces an innovative method for detecting this possible ringing.  To demonstrate the efficacy of her method, she “neutralizes” one third of the sentiment values in Joyce’s Portrait of the Artist as a Young Man and then retransforms and graphs the new neutralized shape against the original foundation shape of the story.

Annie posits that if the Syuzhet tool is working as she thinks it should, then the last third of the foundational shape should change in reaction to this neutralization.  In Annie’s example, however, no significant change is observed, and she concludes that this must be due to a ringing artifact.  Figure 1 (below) is the evidence she presents on her blog.

portrait_no_last_third1
Figure 1: last third neutralized

For what it is worth, we do see some minor differences between the blue and the orange lines, but really, these look like the same “Man in Hole” plot shapes.  Ouch, this does look like a bad ringing artifact.  But could there be another explanation?

There may, indeed, be some ringing here, but it’s not nearly so extreme as Figure 1 suggests.  An alternative conclusion is that the similarity we observe in the two lines is due to a similarity between the actual values and the neutralized values.  As it happens, the last third of the novel is already pretty neutral compared to the rest of the novel.  In fact, the mean valence for the entire last third of the novel is -0.05.  So all we have really achieved in this test is to replace a section of relatively neutral valence with another segment of totally neutral valence.

This is not, therefore, a very good book in which to test for the presence of a ringing artifacts using this particular method of neutralization.  What we see here is a case of the right result but the wrong conclusion.  Which is not to say that there is not some ringing present; I’ll get to that in a moment.  But first another experiment.

If, instead of resetting those values to zero, we set them to 3 (making Portrait end on a very happy note indeed), we get a much different shape (blue line in figure 3).  The earlier parts of the novel are now represented as comparatively less negative and the end of the novel is mucho positive.

last_third_happy

Figure 3: Portrait with artificial positive ending

And, naturally, we can also set those values very negative and produce the graph seen in figure 4.  Oh, poor Stephen.

very_negative

Figure 4: Portrait with artificial negative ending

“But wait, Jockers, you can’t deny that there is still an artificial “hump” there at the end of figure 3 and an artificial trough at the end of figure 4.”   Nope, I won’t deny it, there really can be ringing artifacts.  Let’s see if we can find some that actually matter . . .

First let’s test the beginning of the novel using the Swafford method.  We neutralize the beginning third of the novel and graph it against the original shape (figure 5).  Hmm, again it seems that the foundation shape is pretty close to the original.  Is this a ringing artifact?

first_third

Figure 5: first third neutralized

Could be, but in this case it is probably just another false ringer.  Guess what, the beginning of Joyce’s novel is also comparatively neutral.  This is why the Swafford method results in something similar when we neutralize the first third of the book.  Do note that the first third is a little bit less neutral than the last third.  This is why we see a slightly larger difference between the blue and orange lines in figure 5 compared to figure 1.

But what about the middle section?

If we set the middle third of the novel to neutral, what we get is a very different shape (and a very different novel)!  Figure 6 uses the Swafford method to remove the central crisis of the novel. This is no longer a “man in hole” story, and the resulting shape is precisely what we would expect.  Make no mistake, that hump of happiness is not a ringing artifact.  That hump in the middle is now the most sustained non-negative moment in the book.  We have replaced hell with limbo (not heaven because these are neutral values), and in comparison to the other parts of the book, limbo looks pretty good!  Keep in mind Vonnegut’s message from #4 above: “This is an exercise in relativity.”  Also keep in mind that there is some scaling going on over the y-axis; in other words, we should not get too hung up on the precise position on the y-axis at the expense of seeing the simple shape.

In the new graph, the deepest trough has now shifted to the early part of the novel, which is now the location of the greatest negative valence in the story (it’s the section where Stephen gets sick and is then beaten by father Dolan). The end of the book now looks relatively darker since we no longer have the depths of hell from the midsection for comparison, but the end third of Portrait is definitely not as negative as the beginning third and this is reflected nicely in figure 6.  (This more positive ending is also evident, by the way, in the original shape–orange line–where the hump in the last third is slightly higher than the early hump.)

neutral_middle

Figure 6: Portrait with Swaffordized Middle

So, the Swafford method proves to be a very useful tool for testing and confirming our expectations.  If we remove the most negative section of the novel, then we should see the nadir of the simple shape shift to the next most negative section.  That is precisely what we see.  I have tested this over a series of other novels, and the effect is the same (see figure 9 below, for example).  This is a great method for validating that the tool is working as expected. Thanks Annie Swafford!

“But wait a second Jockers, what about those rascally ringing artifacts you promised.”

Yes, yes, there can indeed be ringing artifacts.  Let’s go get some. . . .

Annie follows her previous analysis with what seems like an even more extreme example.  She neutralizes everything in Joyce’s Portrait except for the middle 20 sentences of the novel.[3] When the resulting graph looks a lot like the original man-in-hole graph, she says, in essence: “Busted! there is your ringing artifact Dr. J!”  Figure 7 is the graphic from her blog.

portrait_middle_201

Figure 7: Only 20 (sic) sentences of Portrait

Busted indeed!  Those positive valence humps, peaking at 25 and 75 on the x-axis are dead ringers for ringers.  We know from constructing the experiment in this manner, that everything from 0 to ~49 and everything from ~51 to 100 on the x-axis is perfectly neutral, and yet the tool, the visualization, is revealing two positive humps before and after the middle section: horrible, happy, phantom humps that do not exist in the story!

But wait. . .

With all smoothing methods some degree of imprecision is to be expected.  Remember what Vonnegut points out: this is “an exercise in relativity.”  Relatively speaking, even the extreme example in figure 7 is, in my opinion, not too bad.  Just imagine a hypothetical protagonist cruising along in a hypothetical novel such as the one Annie has written with her neutral values.  This protagonist is feeling pretty good about all that neutrality; she ain’t feeling great, but she’s better than bad.  Then she hits that negative section . . . as Vonnegut would say, “oh, God damn it.”[4]  But then things get better, or more precisely, things get comparatively better.  So, the blue line is not a great representation of the narrative, but it’s not a bad approximation either.

But look, I understand my colleague’s concern for more precision, and I don’t want it to appear that I’m taking this ringing business too lightly.  Figure 8 (below) was produced using precisely the same data that Annie used in her two-sentence example; everything is neutralized except for those two sentences from the exact middle of the novel.  This time, however,  I have used a low pass filter set at 100.  Voila!  The new shape (blue) is nothing at all like the original (orange), and the new shape also provides the deep level of detail–and lack of ringing–that some users may desire.[5]  Unfortunately, using such a high, low-pass filter does not usually produce easily interpretable graphs such as seen in figure 8.

low_pass_100

Figure 8: Original shape with neutralized “Swafford Shape” using 100 components

In this very simple example, turning the low-pass filter up to 100 produces a graph that is very easy to read/interpret.   When we begin looking at real novels, however, a low-pass of 100 does not result in shapes that are very easy to visually interpret, and it becomes necessary to smooth them.  I think that is what visualization is all about, which is to say, simplifying the complex so that we can get the gist.  One way to simplify these emotional trajectories is to use a low, low pass filter.  Given that going low may cause more ringing, you need to decide just how low you can go.  Another option, that I demonstrated in my previous post, is to use a high value for the low pass filter (to avoid potential ringing) and then apply a lowess smoother (or your own favorite smoother) in order to reveal the “simple shape” (see figure 1 of http://www.matthewjockers.net/2015/03/09/is-that-your-syuzhet-ringing/).

In a future post, I’ll explore something I mentioned to Annie in our email exchange (prior to her public critique): an ad hoc method I’ve been working on that seeks to identify an “ideal” number of components for the low pass filter.

dorian_neutral

Figure 9: Dorian Gray behaving exactly as we would expect with last third neutralized

FOOTNOTES:

[1] Annie does not actually explain that the low-pass filter is a user controlled parameter or that what she is actually testing is the efficacy of the default value.  Users of the tool are welcome to experiment with different values for the low pass filter as I have done here: Is that your Syuzhet Ringing.

[2] I’ve been calling these simple shapes “emotional trajectories” and “plot.” Plot is potentially controversial here, so if folks would like to argue that point, I’m sympathetic.  For the first year of this research, I never used the word “plot,” choosing instead “emotional trajectory” or “simple shape,” which is Vonnegut’s term.  I realize plot is a loaded and nuanced word, but “emotional trajectory” and “simple shape” are just not really part of our nomenclature, so plot is my default term.

[3] There is a small discrepancy between Annie’s blog and her code.  Correction: Annie writes about and includes a graph showing the middle “20” sentences, but then provides code for retaining both the middle 2 and the middle 20 sentences.  Either way the point is the same.

[4] The two negative valence sentences from the middle of Portrait are as follows: “Nay, things which are good in themselves become evil in hell. Company, elsewhere a source of comfort to the afflicted, will be there a continual torment: knowledge, so much longed for as the chief good of the intellect, will there be hated worse than ignorance: light, so much coveted by all creatures from the lord of creation down to the humblest plant in the forest, will be loathed intensely.”

[5]  Annie has written that “Syuzhet computes foundation shapes by discarding all but the lowest terms of the Fourier transform.” That is a rather misleading comment. The low-pass-filter is set to 3 by default, but it is a user tunable parameter.  I explained my reasons for choosing 3 as the default in my email exchange with Annie prior to her critique.   It is unclear to me why Annie does not to mention my explanation, so here it is from our email exchange:

“. . . The short and perhaps unsatisfying answer is that I selected 3 based on a good deal of trial and error and several attempts to employ some standard filters that seek to identify a cutoff / threshold by examining the frequencies (ideal, butterworth, and several others that I don’t remember any more).  The trouble with these, and why I selected 3 as the default, is that once you go higher than 3 the resulting plots get rather more complicated, and the goal, of course, is to do the opposite, which is to say that I seek to reduce the plot to a simple base form (along the lines of what Vonnegut is suggesting).  Three isn’t magic, but it does seem to work well at rooting out the foundational shape of the story.  Does it miss some of the subtitles, yep, but again, that is the point, in part.  The longer answer is that is that this is something I’m still experimenting with.  I have one idea that I’m working with now…”

Is that Your Syuzhet Ringing?

Over the weekend, Annie Swafford published another installment in her ongoing critique of Syuzhet, the R package that I released in early February. In her recent blog post, an interesting approach for testing the get_transformed_values function is proposed[1].

Previously Annie had noted how using the default values for the low-pass filter may result in too much information loss, to which I replied that that is the point.  (Readers hung up on this point are advised to go back and watch the Vonnegut video again.) With any kind of smoothing, there is going to be information loss.  The function is designed to allow the user to tune the low pass filter for greater or lesser degrees of noise (an important point that I shall return to in a moment).

In the new post, Annie explores the efficacy of leaving the low pass filter at its default value of 3; she demonstrates how this value appears to produce a ringing artifact.  This is something that the two of us had discussed at some length in an email correspondence prior to this blogging frenzy.  In that correspondence, I promised to explore adding a gaussian filter to the package, a filter she believes would be more appropriate. Based on her advice, I have explored that option, and will do so further, but for now I remain unconvinced that there is a problem for Gauss to solve.[2]

As I said in my previous post, I believe the true test of the method lies in assessing whether or not the shapes produced by the transformation are a good approximation of the shape of the story. But remember too, that the primary point of the transformation function is to solve the problem of length; it is hard to compare the plot shape of a long novel to a short one.  The low-pass argument is essentially a visualization and noise reduction parameter.   Users who want a closer, scene by scene or sentence by sentence representation of the sentiment data, will likely gravitate to the get_percentage_values function (and a very large number of bins) as, for example, Lincoln Mullen has done on Rpubs.[3]

The downside to that approach, of course, is that you cannot compare two sentiment arcs mathematically; you can only do so by eye.  You cannot compare them mathematically because the amount of text inside each percentage segment will be quite different if the novels are of different lengths, and that would not be a fair comparison.  The transformation function is my attempt at solving this time domain conundrum.  While I believe that it solves the problem well, I’m certainly open to other options.  If we decide that the transformation function is no good, that it produces too much ringing, etc. then we should look for a more attractive alternative.  Until an alternative is found and demonstrated, I’m not going to allow the perfect to become the enemy of the good.

But, alas, here we are once again on the question of defining what is “good” and what is “good enough.”  So let us turn now to that question and this matter of ringing artifacts.

The problem of ringing artifacts is well understood in the signal processing literature if a bit less so in the narratological literature:-)  Annie has done a fine job of explicating the nature of this problem, and I can’t help thinking that this is a very clever idea of hers.  In fact, I wrote to Annie acknowledging this and noting how I wish I had thought of it myself.

But after repeating her experiment a number of times, with greater and lesser degrees of success, I decided that this exercise is ultimately a bit of a red herring.  Among other things, there are no books with zero neutral values for an entire third, but more importantly the exercise has more to do with the setting of a particular user parameter than it does with the package.

I’d like to now offer a bit of cake and eat it too.  This most recent criticism has focused on the default values for the low-pass filter that I set for the function. There is, of course, nothing preventing adjustment of that parameter by those with a taste for adventure.  The higher the number, the greater the number of components that are retained; the more components we retain, the less ringing and the closer we get to reproducing the original signal.

So let us assume for a moment that the sentiment detection methods all work perfectly. We know as a matter of fact that they don’t work perfectly (you know, like human beings), but this matter of imprecision is something we have already covered in a previous post where I showed that the three dictionary based methods tend to agree with each other and with the more sophisticated Stanford method.  So even though we know we are not getting every sentence’s sentiment just right, let’s pretend that we are, if only for a moment.

With that assumed, let us now recall the primary rationale for the Fourier transformation: to normalize the length of the x-axis.  As it happens, we can do that normalization (the cake) and also retain a great many more components than the 3 default components (eating it).  Figure 1 shows Joyce’s Portrait of the Artist transformed using a low pass filter size of 100.

This produces a graph with a lot more noise, but we have effectively eliminated any objectionable ringing.  With the addition of a smoothing line (lowess function in R), what we see once again (ta da) is a beautiful, if rather less dramatic, example of Vonnegut’s Man in Hole!  And this is precisely the goal, to reveal the plot shape latent in the noise.  The smaller low-pass filter accentuates this effect, the higher low-pass filter provides more information: both show the same essential shape.

Figure 4: Portrait with low pass at 100

Figure 1: Portrait with low pass at 100

foundation

Figure 2: Portrait with low pass at 3

low_pass_20

Figure 3: Portrait with low pass at 20

In the course of this research, I have hand examined the transformed shapes for several dozen novels.  The number of novels I have examined corresponds to the number that I feel I know well enough to assess (and also happen to possess in digital form).  These include such old and new favorites as:

  • Portrait of the Artist
  • Picture of Dorian Grey
  • Ulysses
  • Blood Meridian
  • Gone Girl
  • Finnegans Wake (nah, just kidding)
  • . . .
  • And many more.

As I noted in my previous post, the only way to determine the efficacy of this model is to see if it approximates reality.  We have now plotted Portrait of the Artist six ways to Sunday, and every time we have seen a version of the same man in hole shape.  I’ve read this book 20 times, I have taught this book a dozen times.  It is a man in hole plot.

In my (admittedly) anecdotal evaluations, I have continued to see convincing graphs, such as the one above (and the one below in figure 4).  I have found a few special books that don’t do very well, but that is a story you will have to wait for (spoiler alert, they are not works of satire or dark humor, but they are multi-plot novels involving parallel stories).

Still, I am open to the possibility that there is some confirmation bias possible here.  And this is why I wanted to release the package in the first place.  I had hoped that putting the code on gitHub would entice others toward innovation within the code, but the unexpected criticism has certainly been healthy too, and this conversation has certainly made me think of ways that the functions could be improved.

In retrospect, it may have been better to wait until the full paper was complete before distributing the code.  Most of the things we have covered in the last few weeks on this blog are things that get discussed in finer detail in the paper. Despite more details to come, I believe, as Dryden might say, that the last (plot) line is now sufficiently explicated.

Bonus Images:

dorian_100

Figure 4

In terms of basic shape, Figure 4 is remarkably similar to the more dramatized version seen in figure 5 below.  If you can’t see it, you aren’t reading enough Vonnegut.

dorian_3

Figure 5

[1] How’s that for some awkward passive voice? A few on Twitter have expressed some thoughts on my use of Annie’s first name in my earlier response.  Regular readers of this blog will know that I am consistent in referring to people by their full names upon first mention and by their first names thereafter.  Previous victims of my “house style” have included David Mimno, David;  Dana Mackenzie, Dana; Ben Schmidt, Ben; Franco Moretti, Franco, and Julia Flanders, Julia.  There are probably others.

[2] Anyone losing sleep over this gaussian filter business is welcome to grab the code and give it a whirl.

[3] In the essay I am writing about this work, I address a number of the nuances that I have skipped over in these blog posts.  One of the nuances I discuss is an automated process for the selection of a low-pass filter size.

Some thoughts on Annie’s thoughts . . . about Syuzhet

Annie Swafford has raised a couple of interesting points about how the syuzhet package works to estimate the emotional trajectory in a novel, a trajectory which I have suggested serves as a handy proxy for plot (in the spirit of Kurt Vonnegut).

Annie expresses some concern about the level of precision the tool provides and suggest that dictionary based methods (such as the three I include as options in syuzhet) are not reliable. She writes “Sentiment analysis based solely on word-by-word lexicon lookups is really not state-of-the-art at all.” That’s fair, I suppose, but those three lexicons are benchmarks of some importance, and they deserve to be included in the package if for no other reason than for comparison.  Frankly, I don’t think any of the current sentiment detection methods are especially reliable. The Stanford tagger has a reputation for being the main contender for the title of “best in the open source market,” but even it hovers around 80 – 83% accuracy.  My own tests have shown that performance depends a good deal on genre/register.

But Annie seems especially concerned about the three dictionary methods in the package. She writes “sentiment analysis as it is implemented in the syuzhet package does not correctly identify the sentiment of sentences.” Given that sentiment is a subtle and nuanced thing, I’m not sure that “correct” is the right word here. I’m not convinced there is a “correct” answer when it comes to this question of valence. I do agree, however, that some answers are more or less correct than others and that to be useful we need to be on the closer side. The question to address, then, is whether we are close enough, and that’s a hard one. We would probably find a good deal of human agreement when it comes to the extremes of sentiment, but there are a lot of tricky cases, grey areas where I’m not sure we would all agree.  We certainly cannot expect the tool to perform better than a person, so we need some flexibility in our definition of “correct.”

Take, for example, the sentence “I studied at Leland Stanford Junior University.” The state-of-the-art Stanford sentiment parser scores this sentence as “negative.” I think that is incorrect (you are welcome to disagree;-). The “bing” method, that I have implemented as the default in syuzhet, scores this sentence as neutral, as does the “afinn” method (also in the package). The NRC method scores it as slightly positive. So, which one is correct? We could go all Derrida on this sentence and deconstruct each word, unpack what “junior” really means. We could probably even “problematize” it! . . . But let’s not.

What Annie writes about dictionary based methods not being the most state-of-the-art is true from a technical standpoint but sophisticated methods and complexity do not necessarily correlate with results.  Annie suggest that “getting the Stanford package to work consistently would go a long way towards addressing some of these issues,” but as we saw with the sentence above, simple beat sophisticated, hands down[1].

Consider another sentence: “Syuzhet is not beautiful.” All four methods score this sentence as positive, even the Stanford tool, which tends to do a better job with negation, says “positive.”

It is easy to find opposite cases where sophisticated wins the day. Consider this more complex sentence: “He was not the sort of man that one would describe as especially handsome.” Both NRC and Afinn score this sentence as neutral, Bing scores it slightly positive and Stanford scores it slightly negative. When it comes to negation, the Stanford tool tends to perform a bit better, but not always. The very similar sentence “She was not the sort of woman that one would describe as beautiful” is scored slightly positive by all four methods.

What I have found in my testing is that these four methods usually agree with each other, not exactly but close enough. Because the Stanford parser is very computationally expensive and requires special installation, I focused the examples in the Syuzhet Package Vignette on the three dictionary based methods. All three are lightning fast by comparison, and all three have the benefit of simplicity.

But, are they good enough compared to the more sophisticated Stanford parser?

Below are two graphics showing how the methods stack up over a longer piece of text. The first image shows sentiment using percentage based segmentation as implemented in the get_percentage_values() function.

percent_based

Four Methods Compared using Percentage Segmentation

The three dictionary methods appear to be a bit closer, but all four methods do create the same basic shape.  The next image shows the same data after normalization using the get_transformed_values() function.  Here the similarity is even more pronounced.

four_methods

Four Methods Compared Using Transformed Values

While we could legitimately argue about the accuracy of one sentence here or one sentence there, as Annie has done, that is not the point. The point is to reveal a latent emotional trajectory that represents the general sense of the novel’s plot. In this example, all four methods make it pretty clear what that shape is: This is what Vonnegut called “Man in Hole.”

The sentence level precision that Annie wants is probably not possible, at least not right now.  While I am sympathetic to the position, I would argue that for this particular use case, it really does not matter.  The tool simply has to be good enough, not perfect.  If the overall shape mirrors our sense of the novel’s plot, then the tool is working, and this is the area where I think there is still a lot of validation work to do.  Part of the impetus for releasing the package was to allow other people to experiment and report results.  I’ve looked at a lot of graphs, but there is a limit to the number of books that I know well enough to be able to make an objective comparison between the Syuzhet graph and my close reading of the book.

This is another place where Annie raises some red flags.  Annie calls attention to these two images (below) from my earlier post and complains that the transformed graph is not a good representation of the noisy raw data.  She writes:

The full trajectory opens with a largely flat stretch and a strong negative spike around x=1100 that then rises back to be neutral by about x=1500. The foundation shape, on the other hand, opens with a rise, and in fact peaks in positivity right around where the original signal peaks in negativity. In other words, the foundation shape for the first part of the book is not merely inaccurate, but in fact exactly opposite the actual shape of the original graph.

Annie’s reading of the graphs, though, is inconsistent with the overall plot of the novel, whereas the transformed plot is perfectly consistent with the novel. What Annie calls a “strong negative spike” is the scene in which Stephen is pandied by Father Arnell.  It is an important negative moment, to be sure, but not nearly as important, or as negative, as the major dip that occurs midway through the novel–when Stephen experiences Hell. The scene with Arnell is a minor blip compared to the pages and pages of hell and the pages and pages of anguish Stephen experiences before his confession.

noisy foundation

Annie is absolutely correct in noting that there is information loss, but wrong in arguing that the graph fails to represent the novel.  The tool has done what it was designed to do: it successfully reveals the overall shape of the narrative.  The first third of the novel and the last third of the novel are considerably more positive than the middle section.  But this is not meant to say or imply that the beginning and end are without negative moments.

It is perfectly reasonable to want to see more of the page to page, or scene by scene fluctuations in sentiment, and that can be easily achieved by using the percentage segmentation method or by altering the low-pass filter size.  Changing the filter size to retain five components instead of three results in the graph below.  This new graph captures that “strong negative spike” (not so “strong” compared to hell) and reveals more of the novel’s ups and downs.  This graph also provides more detail about the end of the novel where Stephen comes down off his bird-girl high and moves toward a more sober perspective for his future.

Portrait with Five Components

Portrait with Five Components

Of course, the other reason for releasing the code is so that I can get suggestions for improvements. Annie (and a few others) have already propelled me to tweak several functions.  Annie found (and reported on her blog) some legitimate flaws in the openNLP sentence parser. When it comes to passages with dialog, the openNLP parser falls down on the job. I ran a few dialog tests (including Annie’s example) and was able to fix the great majority of the sentence parsing errors by simply stripping out the quotation marks in advance. Based on Annie’s feedback, I’ve added a “quote stripping” parameter to the get_sentences() function. It’s all freshly baked and updated on github.

But finally, I want to comment on Annie’s suggestion that

some texts use irony and dark humor for more extended periods than you [that’s me] suggest in that footnote—an assumption that can be tested by comparing human-annotated texts with the Syuzhet package.

I think that would be a great test, and I hope that Annie will consider working with me, or in parallel, to test it.  If anyone has any human annotated novels, please send them my/our way!

Things like irony, metaphor, and dark humor are the monsters under the bed that keep me up at night. Still, I would not have released this code without doing a little bit of testing:-). These monsters can indeed wreak a bit of havoc, but usually they are all shadow and no teeth. Take the colloquial expression “That’s some bad R code, man.” This sentence is supposed to mean the opposite, as in “That is a fine bit of R coding, sir.”  This is a sentence the tool is not likely to get right; but, then again, this sentence also messes up my young daughter, and it tends to confuse English language learners. I have yet to find any sustained examples of this sort of construction in typical prose fiction, and I have made a fairly careful study of the emotional outliers in my corpus.

Satire, extended satire in particular, is probably a more serious monster.  Still, I would argue that the sentiment tools performs exactly as expected; they just don’t understand what they are “reading” in the way that we do.  Then again, and this is no fabrication, I have had some (as in too many) college students over the years who haven’t understood what they were reading and thought that Swift was being serious about eating succulent little babies in his Modest Proposal (those kooky Irish)!

So, some human beings interpret the sentiment in Modest Proposal exactly as the sentiment parser does, which is to say, literally! (Check out the special bonus material at the bottom of this post for a graph of Modest Proposal.) I’d love to have a tool that could detect satire, irony, dark humor and the like, but such a tool is still a good ways off.  In the meantime, we can take comfort in incremental progress.

Special thanks to Annie Swafford for prompting a stimulating discussion.  Here is all the code necessary to repeat the experiments discussed above. . .

SPECIAL BONUS MATERIAL

Swift’s classic satire presents some sentiment challenges.  There is disagreement between the Stanford method and the other three in segment four where the sentiments move in opposite directions.

modest_percent

FOOTNOTE

[1] By the way, I’m not sure if Annie was suggesting that the Stanford parser was not working because she could not get it to work (the NAs) or because there was something wrong in the syuzhet package code. The code, as written, works just fine on the two machines I have available for testing. I’d appreciate hearing from others who are having problems; my implementation definitely qualifies as a first class hack.

The Rest of the Story

My blog on February 2, about the Syuzhet package I developed for R (now available on CRAN), generated some nice press that I was not expecting: Motherboard, then The Paris Review, and several R blogs (Revolutions, R-Bloggersinside-R) all featured the work.  The press was nice, but I was not at all prepared for the focus to be placed on the one piece of the story that I had yet to explain, namely, how I used the Syuzhet code and some unsupervised machine clustering to identify what seem to be six, or possibly seven, archetypal plot shapes.  So, here now is the rest of the story. . .

In brief: (A Plot Modeling Recipe)

  1. Apply functions available in the Syuzhet package, to generate a generalized a plot shape for every book in a corpus of 41,383 novels.[1]
  2. Employ euclidean distance to build a large distance matrix by computing the similarity between every pair of novels.
  3. Use unsupervised hierarchical clustering to group books based on the similarity of their plot shape.
  4. Examine the resulting clusters with furrowed brow and say “hmmmm.”
  5. Test several methods of cluster identification (silhouette, gap statistic, elbow).
  6. Develop ad-hoc cluster identification algorithm.
  7. Observe that there are six, or maybe seven, fundamental plot shapes.
  8. Repeat everything over and over again for 12 months while worrying a lot about observing six or seven plots.

Caveats:

Before I reveal the six/seven plots (scroll down if you can’t wait), it’s important to point out that what I offer here is the result of two particular methods of analysis.  If you don’t like the plot shapes that these methods reveal, then you’ll be free to take issue with the methods and try a different approach.  You could, for example,

  1. Read 41,383 novels and sketch the plots of each using Vonnegut’s chalkboard. You could then spend a few decades organizing and classifying them into some sort of taxonomy.  You could then work on clustering them into a finite set of foundational shapes.  This is more or less the method Vonnegut employed, excepting, of course, that he probably only read a few hundred stories and probably only sketched out a few dozen on his chalk board.
  2. You could use another method, such as the one that Benjamin Schmidt has proposed over at his Sapping Attention blog.

Background:

In my previous post, I explained how I developed some software (named “Suyzhet” in homage to Propp) to extract plot shapes from novels based on sentiment analysis.  In order to understand how I derive the six/seven plot archetypes, we need to understand a little bit about Euclidean distance and hierarchical clustering.  The former provides a mathematical way of computing the similarity or distance between two points in space.  When that space is two dimensional, it’s pretty easy to visualize what is going on: we plot two points on an x-y grid and then measure the distance between them.  When the space is three dimensional, it gets a bit harder, but you can still imagine measuring the distance between some point about three feet off the floor in your kitchen and some point about five feet off the floor in your living room.  Once we go beyond the third dimension things get downright tricky, and we have to rely on the mathematics of the Euclidean metric. Regardless of the dimensions, though, the fundamental idea is the same: we are measuring the distance between points and the shorter that distance the more similar the points are.  In this case the points are books, and the feature that determines their point in space is their “plot shape” as derived from Syuzhet.

Once the distances between all the points are measured, we construct a “distance matrix.”  This distance matrix is just a big spread sheet where we can look up the distance from any one point to any other point.  It might look something like Figure 1.  According to this matrix, the distance between Book 1 and Book 3 is “0.5” whereas the distance between Book 2 and Book 3 is “0.25.”

Distance Matrix

Figure 1: A Distance Matrix

Hierarchical clustering methods use this distance matrix as a foundation upon which to build a hierarchy of similarities. This hierarchy is often visualized as a dendrogram such as seen in Figure 2.

Figure 2: Dendrogram

Figure 2: Dendrogram

Figure 2 is a bit like a tree (upside down); it has branches.   At any vertical point, we can cut this tree and the result would be to separate it into two or more branches, or clusters.  For example, cutting the tree in Figure 2 at a height of 225, would result in four primary clusters.  The trick with this sort of tree cutting, is identifying an “ideal” vertical position to insert the saw.  Before I get to that, though, we need to step back for a moment to those plots created with the Syuzhet software.

The Plot Thickens

In my previous post, I showed what the plots of Joyce’s Portrait and Wilde’s Dorian Grey look like when graphed using Suyzhet.  Underneath each plot graph, is a sequence of 100 numbers from which the shape of the plot is derived.  I have collected these sequences for 41,383 novels, and when I average them, I get the “super average plot archetype” seen in Figure 3.

The Super Average Plot

Figure 3: The Super Average Plot

That is kind of interesting, but things get a lot more interesting after a bit of tree cutting. If you look at the dendrogram in Figure 2 again, you see that cutting the tree just below 250 will result in two primary clusters.  After cutting the tree at that point, it is then possible to calculate a mean shape for all the books in each cluster. The result is seen in Figure 4.

Figure 4: Two Primary Plots

Figure 4: Two Primary Plots

In homage to Vonnegut, I have titled the shape on the left “man in hole.” 46% of the books in this corpus fall into this cluster.  The remaining 54% are more similar to the plot on the right, which I have named “man on hill.”  At this point, I’d encourage you to take a quick peek Maya Eilam’s very nice visualization of Vonnegut’s archetypal plot shapes.  The plots I’ll show here are not going to look quite the same, but there will be some resonance.

Looking again at the dendrogram, you can see that the two primary clusters (MOH and MIH), can be split fairly easily into a set of four clusters.  When the tree is cut in this manner, the two plots shown in Figure 4, split into four.

Figure 5: MIH Types I and II

Figure 5: MIH Types I and II

Figure 5 shows the derivatives of the man in hole plot shape.  The man in hole plot splits into one shape (“Type I”) that looks a lot like classical tragedy and another (“Type II”) that looks more like comedy.  Whatever the case, one has a much happier ending than the other.  Figure 6 shows the derivatives of the man on hill.

Figure 6: Man on Hill Types I and II

Figure 6: Man on Hill Types I and II

Here again, one plot leads us to a happy ending and the other to a rather dark conclusion.

Cutting the tree beyond these four shapes gets trickier.  It is difficult to know where precisely to stop and cut.  Move the cut point just a little bit, and we could go from having 10 clusters to 20; it is possible, in fact, to keep moving the the cut point further and further down the tree until a point at which every book is its own cluster!  Doing that, however, would be rather silly (see “Caveats” item 1 above).  So the objective is to find an “ideal” place to cut the tree such that the resulting clusters have  the greatest amount of internal homogeneity while simultaneously being as different from each other as possible.

My solution to this problem involves iterating through a series of possible cut points and then taking two measures after each cutting.  The first is a measure of cluster homogeneity the second is a measure of cluster dissimilarity.  This process is more easily described in pseudocode:

Let K be a number of possible clusters from 2 to 50.

With each iteration, I store the resulting values so that I can compare them and identify a value of K that best fulfills the objectives described above.  In order to make this test more robust, I opted to randomly select a subset of one half of the books in the corpus (roughly 20K) and run this test over and over again (each time with a new random sample).  When I did this, I found that the method identified six as the ideal number of clusters about 90% of the time.  The other 10% of the time, it said that seven or eight was a better choice.[2]

In addition to this mathematical approach, I also employed good old subjective evaluation.  The tool suggested six or seven, but this number (six, seven) would be rather useless if the resulting shapes did not make any sense to those of us who actually read the books.  So, I looked at a lot of plots; everything from two to twenty.  After twenty, I figure there is not much point because the shapes get so similar to each other that it would be rather hard to make the case that plot 19 is really all that different from plot 20.  With six and with seven, however, there remains good deal of variation.

We saw above how MIH and MOH both split into sub types.  These I labeled as MIH Type I, MIH Type II, MOH Type I, and MOH Type II.  At the cut point that results in six plots, MIH Type I and MOH Type II stay as we saw them above in figures 5 and 6, but MIH II and MOH I both split resulting in the shapes seen in Figure 7.

Figure 7: Level Six

Figure 7: Level Six

Already we can begin to see some shape repetition.  The variant of MIH seen in the lower right, is ultimately a steeper, or more extreme, version of the basic MIH.  The other three, though, appear rather more distinct.

At level seven, MOH II splits in two resulting in the shapes shown in Figure 8. After seven, we begin to see a lot more shape repetition, and though each of these shapes is unique in terms of its precise placement on the y axis, i.e. some are more happy others more dark, the arcs are generally similar.

Obviously, there is a great deal more interpretive work to be done here.  Many of these shapes, I think, can be further classified according to their “affects” and “effects.” What, for example, is the overall impression one gets from a book that takes a character to great heights (MOH) and then plunges him/her into a pit of despair from which there is no exit (as is seen in Figure 8 left).

Figure 8: Seven Plots

Figure 8: Seven Plots

But perhaps even more interesting than any of this is the possibility for movement between scales.  Scale hopping is something I advocate in Macroanalysis.  The great power of big(ish) data is that it allows us to contextualize our small reading.  Joyce’s Portrait of the Artist (Figure 9) is a type of MIH.  What other books are MIHs?  Are they popular books?  Are they classics?  Best sellers?  Can we find another telling of the same story?  This is the work that I am doing now, moving from the large to the small and back again. Figures 10-15 (below) present six popular/well-known novels and their corresponding plot types for consideration.

[Update March 2: Annie Swafford offers an interesting critique of this work on her blog.  Her post includes some comments from me in response.]

poa

Figure 9: Joyce’s Portrait

 

Figure 10

Figure 10

Figure 11

Figure 11

Figure 12

Figure 12

Figure 13

Figure 13

Figure 14

Figure 14

Figure 15

Figure 15

Footnotes:

[1] The Suyzhet package performs a certain type of text analysis, and I’m claiming that the results of this analysis may serve as a pretty darn good proxy for plot.  That said, I’ve been working on this problem for two years, and I know some specific places where it fails.  The most spectacular example of failure was discovered by my son. He’d just finished reading one of the books in my corpus, and I showed him the plot shape from the book and asked him it it made sense. He said, “well, yes, mostly.  But this spike here is all wrong.”  It was a spike in good fortune, positive valence, at precisely the place in the novel where the villains had scored a major victory.  The positive valence was associated with a several page long section in which the bad guys were having a very good time. Readers, of course, would see this as a negative moment in the text, Suyzhet does not.  Nor does Suyzhet understand irony and dark humor and so on.  On a whole, however, Suyzhet gets it right, and that’s because most books are not sustained satire, or sustained irony.  Most books end up using emotional markers in a fairly consistent and conventional way.  Indeed, even for an experimental novel such as Joyce’s Ulysses, Suyzhet produces a plot shape that I consider to be a good match to the ebbs and flows of the text.

[2] In a longer, less blog friendly version of this research that is to appear in a collection of essays on digital literary studies, I explain the mathematics in precise detail.

Revealing Sentiment and Plot Arcs with the Syuzhet Package

Introduction

This post is a followup to A Novel Method for Detecting Plot posted June 15, 2014.

For the past few years, I have been exploring the relationship between sentiment and plot shape in fiction. Earlier today I posted an R package titled “syuzhet” to github. The package is designed to extract sentiment and plot information from prose. Methods for text import, sentiment extraction, and plot arc modeling are described in the documentation and in the package vignette. What follows below is a blog-friendly version of a longer academic paper describing how I employed this package to study plot in a corpus of ~50,000 novels.

noisy

Background

When I began the research that lead to this package, my goal was to study positive and negative emotions in literature across time, much in the same way that I had studied style and theme in Macroanalysis. Along the way, however, I discovered that fluctuations in sentiment can serve as a rather natural proxy for fluctuations in plot movement. Studying plot shifts via sentiment analysis turned out to be a far more interesting project than the simple study of sentiment, and my research got a huge boost when I stumbled upon a video of Kurt Vonnegut describing plot in precisely these terms.

After seeing the video and hearing Vonnegut’s opening challenge (“There’s no reason why the simple shapes of stories can’t be fed into computers”), I set out to develop a systematic way of extracting plot arcs from fiction. I felt this might help me to better understand and visualize how narrative is constructed. The fundamental idea, of course, was nothing new. What I was after is what the Russian formalist Vladimir Propp had defined as the narrative’s syuzhet (the organization of the narrative) as opposed to its fabula (raw elements of the story).

Syuzhet is concerned with the linear progression of narrative from beginning (first page) to the end (last page), whereas fabula is concerned with the specific events of a story, events which may or may not be related in chronological order. When we study fabula, which is what we typically do in literature courses, we mentally reconstruct the events into chronological order. We hope that this reconstruction of the fabula will help us understand the experience of the characters, the core story, etc. When we study the syuzhet, we are not so much concerned with the order of the fictional events but specifically interested in the manner in which the author presents those events to readers.

Consider the technique that radio personality Paul Harvey used in his iconic radio show “The Rest of the Story.” In each story, Harvey would hold back certain key elements until the very end of the program. The narrative would appear to have reached its conclusion, and then Harvey would say, “and now, the rest of the story.” At this point, he would reveal the held back information and the listener would reconstruct the entire fabula. The effect (and affect) of Harvey’s technique, the syuzhet, was usually stunning and pleasantly surprising. Had the story been told in simple chronological order, it would have been bland, perhaps even boring. What gave Harvey’s show power was his narrative technique.

This power was largely derived from the organization of the narrative elements and the manner in which Harvey offered listeners clues and then used narrative and language to evoke both curiosity and emotional response. What Harvey said and how he said it, were critical elements to the overall effect of the story. Harvey’s success was in finding and mastering a particular style of plot, a plot that has much in common with those found in mystery and detective fiction. A series of clues is presented along side a series of misdirections and the mystery is ultimately resolved in some grand reveal that defies expectations.

A Finite Number of Plots

But this Harvey method is just one among many possible plots. Countless scholars and non-scholars have pontificated about the possibility of a finite set of fundamental or archetypal plot shapes.

One of the more recent and famous/infamous of these scholars is Christopher Booker, whose 2004 book, titled The Seven Basic Plots: Why We Tell Stories, argues for a Jungian inspired understanding of plot in terms of seven basic archetypes. Booker’s work appears to be strongly influenced by prior work describing plot in terms of conflict. These core conflicts will be familiar to students of literature: such constructions were once taught to us under the headings of “man vs. man,” “man against nature,” “man vs. society,” and so on.

Other scholars have offered other numbers. William Foster-Harris has argued in favor of three basic patterns of plot The Basic Patterns of Plot (Foster-Harris. University of Oklahoma Press, 1959.); Ronald B. Tobias has argued for twenty (Tobias, Ronald B. 20 Master Plots. Cincinnati: Writer’s Digest Books, 1993.), and Georges Polti claims that there are thirty six (The Thirty-Six Dramatic Situations. trans. Lucille Ray). So the story goes.

All of these discussions about plot typically involve some discussion of a story’s central conflict. But discussions of conflict are more appropriately classified as fabula. Nevertheless, many of these same discussions also explore the flow, or trajectory, of the narrative, and these I consider to be appropriately categorized as syuzhet. Often these discussions of plot engage visualization in order to convey the “movement” of the narrative. Perhaps the best example of this is the one offered by Vonnegut.

poa

A Significant Problem

Still, all of these explanations of plot suffer from a significant problem: a lack of data. Each of these proposed taxonomies suffers from anecdotalism. Vonnegut draws the plot of Cinderella for us on his chalk board, and we can imagine a handful of similar plot shapes. He describes another plot and names it “man in hole,” and we can imagine a few similar stories. But our imaginations are limited.

This limitation led me to think hard about the problem of how to compare, mathematically and computationally, the shape of one story to another. Assuming I could use computers and some NLP magic to extract plot shape from narrative (see A Novel Method for Detecting Plot), it would still be impossible to compare one shape to another because of the simple fact that stories are not the same length. Vonnegut solved this problem by creating an x-axis that runs from B to E, that is, from beginning to end. What Vonnegut did not solve, however, was the real computational problem of text length.

It was tempting to consider simply breaking each book into ten or one-hundred equally sized pieces and then taking measurements of the mean emotional valence in each chunk.

poa.percent

Unfortunately, some of the books would have much larger chunks and with larger chunks would come the possibility of more and more diverse valence markers. What happens, in fact, is that larger chunks of text tend to have a preponderance of both positive and negative valence markers. The end result is that all the means end up very close to neutral on the y-axis of emotional valence. Indeed, books as a whole tend to have a mean valence close to zero on a scale of -1 to 1. (I tested this by calculating the mean valence for 3500 novels in my nineteenth century novels corpus and then plotting the results as a histogram. The distribution showed a clustering around zero with very few books on the extremes.)

So, I needed a way to deal with length. I needed a way to compare the shapes of the stories regardless of the length of the novels. Luckily, since coming to UNL, I’ve become acquainted with a physicist who is one of the team of scientists who recently discovered the Higgs Boson at CERN. Over coffee one afternoon, this physicist, Aaron Dominguez, helped me figure out how to travel through narrative time.

A Solution

Aaron introduced me to a mathematical formula from signal processing called the Fourier transformation. The Fourier transformation provides a way of decomposing a time based signal and reconstituting it in the frequency domain. A complex signal (such as the one seen above in the first figure in this post) can be decomposed into series of symmetrical waves of varying frequencies. And one of the magical things about the Fourier equation is that these decomposed component sine waves can be added back together (summed) in order to reproduce the original wave form–this is called a backward or reverse transformation. Fourier provides a way of transforming the sentiment-based plot trajectories into an equivalent data form that is independent of the length of the trajectory from beginning to end. The frequency domain begins to solve the book length problem.

It turns out that not all of these sine waves in the frequency domain are created equal; some play a bigger role in the construction of the original signal. In signal processing, a low-pass filter can be used to remove the background “hiss” in an audio recording, and a similar approach can be used to filter out the extremes in the sentiment trajectories. When a low-pass filter is applied to the sentiment data, it’s possible to filter and thereby smooth out a great deal of the affectual noise.

The filtered data from the frequency domain can then be reconstituted back into the time domain using the reverse transformation. At the same time, the x-axis can be normalized and the foundation shape of the story revealed.

foundation

Above you can see the core shape of Joyce’s Portrait revealed using the “bing” method of the get_sentiment function in the syuzhet package. (Check the package documentation and vignette for details on the various options and methods.)

Once a book’s plot trajectory is converted into this normalized space, we no longer have the problem of comparing books of different lengths. Compare the foundation shape of Joyce’s Portrait (above) to Wilde’s Picture of Dorain Grey (below).

wilde

The models reflect the key narrative movements in both of these plots. Young Stephen reaches a low point during and just after the sermon on hell which occurs midway through the narrative. Dorian’s life takes a dark turn as the reality of the portrait becomes apparent. But the full power of these transformed plots does not sit simply in visualization. The values that inform these visualizations can now be compared. In a follow up post, I’ll discuss how I measured and compared 40,000+ plot shapes and then clustered the resulting data in order to reveal six common, perhaps archetypal, plot shapes. . .

Plot Arcs (Schmidt Style)

A few weeks ago Ben Schmidt posted a provocative blog entry titled “Typical TV episodes: visualizing topics in screen time.” It’s worth a careful read. . .

Ben began by topic modeling the closed captioning data from a series of popular TV series and then visualizing the ten most common topics over the time span of each episode. In other words, the x-axis is time, and the y-axis is a measure of topical presence. The end result is something that begins to look a bit like what we could call plot.

Ben followed this post with an even more provocative one on 12/16/14 “Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts“. This post led a number of us (Underwood, Mimno, Cherny, etc.) to question what the approach might reveal if applied to novels . . .

In my own recent work, I have been attempting to model plot movement in narrative fiction by analyzing the rise and fall of emotional valence across narrative time. It has been clear to me, however, that my method is somewhat impoverished by a lack context for the emotions I am measuring; Ben’s topic-based approach to plot structure might be just the context I’m missing, and some correlation analysis might be just the right recipe . . . as usual, Ben has given us a lot to think about—i.e. Happy Holidays!

After following the discussion on Twitter and on Ben’s blog, David Mimno wrote to me about whipping up some of these topical plot lines based on the 500 Topic model that I had built for Macroanalysis. Needless to say, I thought this was a great idea. (David and I had previously revisited my topical data for an article in Poetics.) Within a few hours, David had run the entire collection of 500 topics and produced 500 graphs showing the general behavior of each topic across all of the 3,500 texts in my corpus. You will find the output of David’s work here: http://mimno.infosci.cornell.edu/novels/plot.html

In David’s short introductory paragraph, he calls our attention to two specific topic graphs, one for the topic labeled “school” and another labeled “punishment.” You can find my graphs for these two topics here (school) and here (punishment). In referencing these two plots, David calls our attention to one topic (school) that appears prominently at the beginnings of novels in this corpus (think Bildungsroman, perhaps?) and another topic (punishment) that tends to be prominent at the end of novels (think Newgate novels or Oliver Twist, perhaps?).

Like the data from Ben, this data David has mined from my 19th century novels topic model is incredibly rich and demands deeper inspection. I’ve only begun to digest it in bits, but I do observe that a lot of topics carrying negative valence seem to rise over the course of narrative time. This makes intuitive sense if we believe that the central conflict of a novel must grow more intense as the novel progresses. The exciting thing to do ext is to move from the macro to the micro scale and look at the individual novels within this larger context. Perhaps we’ll be able to identify archetypal patterns and then observe which novels stick to the archetypes and which digress. . . what a feast!

Luckily we have a whole new year to indulge!

NHC Summer Institutes in Digital Humanities

I’m pleased to announce that Willard McCarty and I are leading a two-year set of summer institutes in digital humanities at the National Humanities Center. Here is the official announcement:

“The first of the National Humanities Center’s summer institutes in digital humanities, devoted to digital textual studies, will convene for two one-week sessions, first in June 2015 and again in 2016. The objective of the Institute in Digital Textual Studies is to develop participants’ technological and scholarly imaginations and to combine them into a powerful investigative instrument. Led by Willard McCarty (King’s College London and University of Western Sydney) and Matthew Jockers (University of Nebraska), the Institute aims to further the development of individual as well as collaborative projects in literary and textual studies. The Institute will take place in Chapel Hill, North Carolina, in 2015 and at the National Humanities Center in Research Triangle Park, North Carolina, in 2016.”

The first workshop will take place June 8 – 12. Applications are now open. See http://nationalhumanitiescenter.org/digital-humanities/application.html

NHC Flyer

Reading Macroanalysis: The Hard Way!

This past November, Judge Denny Chin ruled to dismiss the Authors Guild’s case against Google; the Guild vowed they would appeal the decision and two months ago their appeal was submitted. I’ll leave it to my legal colleagues to discuss the merit (or lack) in the Guild’s various arguments, but one thing I found curious was the Guild’s assertion that 78% of every book is available, for free, to visitors to the Google Books pages.

According to the Guild’s appeal:

Since 2005, Google has displayed verbatim text from copyrighted books on these pages. . . Google generally divides each page image into eighths, which it calls “snippets.”. . . Once a user retrieves a book through her initial search, she can enter any other search terms she chooses, and the author’s verbatim words will be displayed in three snippets for each search. Although Google has stated that any given search by a user “only” displays three snippets of each book, a single user can view far more than three snippets from a Library Project book by performing multiple searches using different terms, including terms suggested by Google. . . Even minor variations in search terms will yield different displays of text. . . Google displays snippets from each book, except that it withholds display of 10% of the pages in each book and of one snippet per page. . .Thus, Google makes the vast majority of the text of these books—in all, 78% of each work—available for display to its users.

I decided to test the Guild’s assertion, and what better book to use than my own: Macroanalysis: Digital Methods and Literary History.

In the “Preview,” Google displays the front matter (table of contents, acknowledgements, etc) followed by the first 16 pages of my text. I consider this tempting pabulum for would be readers and within the bounds of fair use, not to mention free advertising for me. The last sentence in the displayed preview is cut off; it ends as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now. . . ” Thus ends page 16 and thus ends Google’s preview.

According to the author’s guild, however, a visitor to this book page can access much more of the book by using a clever method of keyword searching. What the Guild does not tell us, however, is just how impractical and ridiculous such searching is. But that is my conclusion and I’m getting ahead of myself here. . .

To test the guild’s assertion, I decided to read my book for free via Google books. I began by reading the material just described above, the front matter and the first 16 pages (very exciting stuff, BTW). At the end of this last sentence, it is pretty easy to figure out what the next word would be; surely any reader of English could guess that the next word, after “. . .scaling of digital content that is now. . . ” would be the word “available.”

Just to be sure, though, I double-checked that I was guessing correctly by consulting the print copy of the book. Crap! The next word was not “available.” The full sentence reads as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now held in twenty-first-century digital libraries.”

Now why is this mistake of mine important to note? Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book. When I entered the word “available” into the search field, I was hoping to get a snippet of text from the next page, a snippet that would allow me to read the rest of the sentence. But because I guessed wrong, I in fact got non-contiguous snippets from pages 77, 174, 72, 9, 56, 15, 37, 162, 8, 4, 80, 120, 154, 46, 133, 79, 27, 97, 147, and 17, in that order. These are all the pages in the book where I use the word “available” but none include the rest of the sentence I want to read. Ugh.

Fortunately, I have a copy of the full text on my desk. So I turn to page 17 and read the sentence. Aha! I now conduct a search for the word “held.” This search results in eight snippets; the last of these, as it happens, is the snippet I want from page 17. This new snippet contains the next 42 words. The snippet is in fact just the end of the incomplete sentence from page 16 followed by another incomplete sentence ending with the words: “but we have not yet fully articulated or explored the ways in which. . . ”

So here I have to admit that I’m the author of this book, and I have no idea what follows. I go back to my hard copy to find that the sentence ends as follows: “. . . these massive corpora offer new avenues for research and new ways of thinking about our literary subject.”

Without the full text by my side, I’d be hard pressed to come up with the right search terms to get the next snippet. Luckily I have the original text, so I enter the word “massive” hoping to get the next contiguous snippet. Six snippets are revealed, the last of these includes the sentence I was hoping to find and read. After the word “which,” I am rewarded with “these massive corpora offer new avenues for” and then the snippet ends! Crap, I really want to read this book for free!

So I think to myself, “what if instead of trying to guess a keyword from the next sentence, I just use a keyword from the last part of the snippet. “avenues” seems like a good candidate, so I plug it in. Crap! The same snippet is show again. Looks like I’m going to have to keep guessing. . .

Let’s see, “new avenues for. . . ” perhaps new avenues for “research”? (Ok, I’m cheating again by going back to the hard copy on my desk, but I think a savvy user determined to read this book for free might guess the word “research”). I plug it in. . . 38 snippets are returned! I scroll though them and find the one from page 17. The key snippet now includes the end of the sentence: “research and new ways of thinking about our literary subject.”

Now I’m making progress. Unfortunately, I have no idea what comes next. Not only is this the end of a sentence, but it looks like it might be the end of a paragraph. How to read the next sentence? I try the word “subject” and Google simply returns the same snippet again (along with assorted others from elsewhere in the book). So I cheat again and look at my copy of the book. I enter the word “extent” which appears in the next sentence. My cheating is rewarded and I get most of the next sentence: “To some extent, our thus-far limited use of digital content is a result of a disciplinary habit of thinking small: the traditionally minded scholar recognizes value in digital texts because they are individually searchable, but this same scholar, as a. . . ”

Thank goodness I have tenure and nothing better to do!

The next word is surely the word “result,” which I now dutifully enter into the search field. Among the 32 snippets that the search returns, I find my target snippet. I am rewarded with a copy of the exact same snippet I just saw with no additional words. Crap! I’m going to have to be even more cleaver if I’m going to game this system.

Back to my copy of the book I turn. The sentence continues “as a result of a traditional training,” so I enter the word “traditional,” and I’m rewarded with . . . the same damn passage again! I have already seen it twice, now thrice. My search for the term “traditional” returns a hit for “traditionally” in the passage I have already seen and, importantly, no hit for the instance of “traditional” that I know (from reading the copy of the book on my desk) appears in the next line. How about “training,” I wonder. Nothing! Clearly Google is on to me now. I get results for other instances of the word “training” but not for the one that I know appears in the continuation of the sentence I have already seen.

Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.

I get it. Times are tough and some folks simply need to steal books from snippet view because they can’t afford to buy them. I’m sympathetic to these folks; they need to satisfy their intense passion for reading and knowledge and who could blame them? Then again, if we consider the opportunity cost at $7.25 per hour (the current minimum wage), then stealing this book from snippet view would cost a savvy text pirate $2,218.75 in lost wages. The eBook version of my text, linked to from the Google Books web page, sells for $14.95. Hmmm?

A Novel Method for Detecting Plot

While studying anthropology at the University of Chicago, Kurt Vonnegut proposed writing a master’s thesis on the shape of narratives. He argued that “the fundamental idea is that stories have shapes which can be drawn on graph paper, and that the shape of a given society’s stories is at least as interesting as the shape of its pots or spearheads.” The idea was rejected.

In 2011, Open Culture featured a video in which Vonnegut expanded on this idea and suggested that computers might someday be able to model the shape of stories, that is, the movement of the narratives, the plots. The video is about four minutes long; it’s worth watching.

About the same time that I discovered this video, I was working on a project in which I was applying the tools and techniques of sentiment analysis to works of fiction.[1] Initially I was interested in tracing the evolution of emotional content in novels over the course of the 19th century. By accident I discovered that the sentiment I was detecting and measuring in the fiction could be used as a highly accurate proxy for plot movement.

Joyce’s Portrait of the Artist as a Young Man is a story that I know fairly well. Once upon a time a moo cow came down along the road. . .and so on . . .

Here is the shape of Portrait of the Artist as a Young Man that my computer drew based on an analysis of the sentiment markers in the text:

poa1

If you are familiar with the plot, you’ll readily see that the computer’s version of the story is accurate. As it happens, I was teaching Portrait last fall, so I projected this image onto the white board and asked my students to annotate it. Here are a few of the high (and low) points that we identified.

poa2

Because the x-axis represents the progress of the narrative as a percentage, it is easy to move from the graph to the actual pages in the text, regardless of the edition one happens to be using. That’s precisely what we did in the class. We matched our human reading of the book with the points on the graph on a page-by-page basis.

Here is a graph from another Irish novel that you might know; this is Wilde’s Picture of Dorian Gray.

dorian1

If you remember the story, you’ll see how well this plot line models the movement of the story. Discovering the accuracy of these graphs was quite thrilling.

This next image shows Dan Brown’s blockbuster novel The Da Vinci Code. Notice how much more regular the fluctuations are. This is the profile of a page turner. Notice too how the more generalized blue trend line hovers above neutral in terms of its emotional valence. Dan Brown never lets the plot become too troubled or too much of a downer. He baits us and teases us with fluctuating emotion.

brown1

Now compare Da Vinci Code to one of my favorite contemporary novels, Cormac McCarthy’s Blood Meridian. Blood Meridian is a dark book and the more generalized blue trend line lingers in the realms of negative emotion throughout the text; it is a very different book from The Da Vinci Code.[2]

mccarthy1

I won’t get into the precise details of how I am measuring emotional valence in these books here.[3] It’s a bit too complicated for an already too long blog post. I will note, however, that the process involves two major components: a controlled vocabulary of positive and negative sentiment markers collected by Bing Liu of the University of Illinois at Chicago and a machine model that I trained to identify and score passages as positive or negative.

In a follow-up post, I’ll describe how I normalized the plot shapes in 40,000 novels in order to compare the shapes and discover what appear to be six archetypal plots!

NOTES:
[1] In the field natural language processing there is an area of research known as sentiment analysis or, sometimes, opinion mining. And when our colleagues engage in this kind of work, they very often focus their study on a highly stylized genre of non-fiction: the review, specifically movie reviews and product reviews. The idea behind this work is to develop computational methods for detecting what we, literary folk, might call mood, or tone, or sentiment, or perhaps even refer to as affect. The psychologists prefer the word valence, and valence seems most appropriate to this research of mine because the psychologists also like to measure degrees of positive and negative valence. I am not aware of anyone working in sentiment analysis who is specifically interested in modeling emotional valence in fiction. In fact, the great majority of work in this field is so far removed from what we care about in literary studies that I spent about six months simply wondering whether or not the methods developed by folks trying to gauge opinions in movie reviews could even be usefully employed in studies of literature.
[2] I gained access to some of these novels through a data transfer agreement made between the University of Nebraska and a private company that is no longer in business. See Unfolding the Novel.
[3] I’m working on a longer and more formal version of this research report for publication. The longer version will include all the details of the methodology. Stay Tuned:-)

So What?

Over the past few days, several people have written to ask what I thought about the article by Adam Kirsch in New Republic (“Technology Is Taking Over English Departments The false promise of the digital humanities.”) In short, I think it lacks insight and new knowledge. But, of course, that is precisely the complaint that Kirsch levels against the digital humanities. . .

Several months ago, I was interviewed for a story about topic modeling to appear in the web publication Nautilus. The journalist, Dana Mackenzie, wanted to dive into the “so what” question and ask how my quantitative and empirical methods were being received by literary scholars and other humanists. He asked the question bluntly because he’d read the Stanley Fish blog in the NYT and knew already that there was some push back from the more traditional among us. But honestly, this is not a question I spend much time thinking about, so I referred Dana to my UNL colleague Steve Ramsay and to Matthew Kirshenbaum at the University of Maryland. They have each addressed this question formally and are far more eloquent on the subject than I am.

What matters to me, and I think what should matter to most of us is the work itself, and I believe, perhaps naively, that the value of the work is, or should be, self-evident. The answer to the question of “so what?” should be obvious. Unfortunately, it is not always obvious, especially to readers like Kirsch who are not working in the sub fields of this massive big tent we have come to call “digital humanities” (and for the record, I do despise that term for its lack of specificity). Kirsch and others inevitably gravitate to the most easily accessible and generalized resources often avoiding or missing some of the best work in the field.

“So what?” is, of course, the more informal and less tactful way of asking what one sometimes hears (or might wish to hear) asked after an academic paper given at the Digital Humanities conference, e.g. “I was struck by your use of latent Dirichlet allocation, but where is the new knowledge gained from your analysis?”

But questions such as this are not specific to digital humanities (I was struck by your use of Derrida, but where is the new knowledge gained from your analysis). In a famous essay, Claude Levi-Strauss asked “so what” after reading Vladimir Propp’s Morphology of the Folktale. If I understand Levi-Strauss correctly the beef with Propp is that he never gets beyond the model; Propp fails to answer the “so what” question. To his credit, Levi-Strauss gives Propp props for revealing the formal model of the folktale when he writes that: “Before the epoch of formalism we were indeed unaware of what these tales had in common.”

But then, in the very next sentence, Levi-Strauss complains that Propp’s model fails to account for content and context, and so we are “deprived of any means of understanding how they differ.”

“The error of formalism” Levi-Strauss writes, is “the belief that grammar can be tackled at once and vocabulary later.” In short, the critique of Propp is just simply that Propp did not move beyond observation of what is and into interpretation of what that thing that is, means (Propp 1984).

To be fair, I think that Levi-Strauss gave Propp some credit and took Propp’s work as a foundation upon which to build more nuanced layers of meaning. Propp identified a finite set of 31 functions that could be identified across narratives; Levi-Strauss wished to say something about narratives within their cultural and historical context. . .

This is, I suppose, the difference between discovering DNA and making DNA useful. But bear in mind that the one ever depends upon the other. Leslie Pray writes about the history of DNA in a Nature article from 2008:

Many people believe that American biologist James Watson and English physicist Francis Crick discovered DNA in the 1950s. In reality, this is not the case. Rather, DNA was first identified in the late 1860s by a Swiss chemist. . . and other scientists . . . carried out . . . research . . . that revealed additional details about the DNA molecule . . . Without the scientific foundation provided by these pioneers, Watson and Crick may never have reached their groundbreaking conclusion of 1953.

(Pray 2008)

I suppose I take exception to the idea that the kind of work I am engaged in, because it is quantitative and methodological, because it seeks first to define what is, and only then to describe why that which is matters, must meet some additional criteria of relevance.

There is often a double standard at work here. The use of numbers (computers, quantification, etc.) in literary studies often triggers a knee jerk reaction. When the numbers come out, the gloves come off.

When discussing my work, I am sometimes asked whether the methods and approaches I advocate and employ succeed in bringing new knowledge to our study of literature. My answer is a firm and resounding “yes.” At the same time, I need to emphasize that computational work in the humanities can be simply about testing, rejecting, or reconfirming, what we think we already know. And I think that is a good thing!

During a lecture about macro-patters of literary style in the 19th century novel, I used the example of Moby Dick. I reported how in terms of style and theme Moby Dick is a statistical mutant among a corpus of 1000 other 19th century American novels. A colleague raised his hand and pointed out that literary scholars already know that Moby Dick is an aberration. Why bother computing a new answer to a question for which we already have an answer?

My colleague’s question says something about our scholarly traditions in the humanities. It is not the sort of question that one would ask a physicist after a lecture confirming the existence of the Higgs Boson! It is, at the same time, an ironic question; we humanists have tended to favor a notion that literary arguments are never closed!

In other words, do we really know that Moby Dick is an aberration? Could a skillful scholar/humanist/rhetorician argue the counter point? I think that the answer to the first question is “no” and the second is “yes.” Maybe Moby Dick is only an outlier in comparison to the other twenty or thirty American novels that we have traditionally studied along side Moby Dick?

My point in using Moby Dick was not to pretend that I had discovered something new about the position of the novel in the American literary tradition, but rather to bring new evidence and a new perspective to the matter and in this case fortify the existing hypothesis.

If quantitative evidence happens to confirm what we have come to believe using far more qualitative methods, I think that new evidence should be viewed as a good thing. If the latest Mars rover returns more evidence that the planet could have once supported life, that new evidence would be important and welcomed. True, it would not be as shocking or exciting as the first discovery of microbes on Mars, or the first discovery of ice on Mars, but it would be viewed as important evidence nevertheless, and it would add one more piece to a larger puzzle. Why should a discussion of Moby Dick’s place in literary history be any different?

In short computational approaches to literary study can provide complementary evidence, and I think that is a good thing.

Computational approaches can also provide contradictory evidence, evidence that challenges our traditional, impressionistic, or anecdotal theories.

In 1990 my dissertation adviser, Charles Fanning, published an excellent book titled The Irish Voice in America. It remains the definitive text in the field. In that book he argued for what he called a “lost generation” of Irish-American writers in the period from 1900 to 1930. His research suggested that Irish-American writing in this period declined, and so he formed a theory about this lost generation and argued that adverse social forces led Irish-Americans away from writing about the Irish experience.

In 2004, I presented new evidence about this period in Irish-American literary history. It was quantitative evidence showing not just why Fanning had observed what he had observed but also why his generalizations from those observations were problematic. Charlie was in the audience that day and after my lecture he came up to say hello. It was an awkward moment, but to my delight, Charlie smiled and said, “it was just an idea.” His social theory was his best guess given the evidence available in 1990, and he understood that.

My point is to say that in this case, computational and quantitative methods provided an opportunity for falsification. But just because such methods can provide contradiction or falsification, we must not get caught up in a numbers game where we only value the testable ideas. Some problems lend themselves to computational or quantitative testing; others do not, and I think that is a fine thing. There is a lot of room under the big tent we call the humanities.

And finally, these methods I find useful to employ can lead to genuinely new discoveries. Computational text analysis has a way of bringing into our field of view certain details and qualities of texts that we would miss with just the naked eye (as John Burrows and Julia Flanders have made clear). I like to think that the “Analysis” section of Macroanalysis offers a few such discoveries, but maybe Mr. Kirsch already knew all that? For a much simpler example, consider Patrick Juola’s recent discovery that J. K. Rowling was the author of The Cuckoo’s Calling, a book Rowling wrote under the pseudonym Robert Galbraith. I think Joula’s discovery is a very good thing, and it is not something that we already knew. I could cite a number of similar examples from research in stylometry, but this example happens to be accessible and appealing to a wide range of non-specialists: just the sort of simple folk I assume Kirsch is attempting to persuade in his polemic against the digital humanities.

Works Cited:

Propp, Vladimir. Theory and History of the Folktale. Trans. Ariadna Y. Martin and Richard Martin. Edited by Anatoly Liberman. University of Minnesota Press, 1984. 180

Pray, L. (2008) Discovery of DNA structure and function: Watson and Crick. Nature