R-Code - Matthew L. Jockers

Jan 12 2017

Resurrecting a Low Pass Filter (well, kind of)

On April 6th, 2015, I posted Requiem for a low pass filter acknowledging that the smoothing filter as I had implemented it in the beta version of Syuzhet was not performing satisfactorily. Ben Schmidt had demonstrated that the filter was artificially distorting the edges of the plots, and prior to Ben’s post, Annie Swafford had argued that the method was producing an unacceptable “ringing” artifact. Within days of posting the “requiem,” I began hearing from people in the signal processing community offering solutions and suggesting I might have given up on the low pass filter too soon.

One good suggestion came from Tommy McGuire (via the Syuzhet GitHub page). Tommy added a “padding factor” argument to the get_transformed_values function in order deal with the periodicity artifacts at the beginnings and ends of a transfomred signal. McGuire’s changes addressed some of the issues and were rolled into the next version of the package.

A second important change was the addition of a function similar to the get_transformed_values function but using a discrete cosine transformation (see get_dct_transform) instead of the FFT. The idea for using DCT was offered by Bradley Riddle, a signal processing engineer who works on time series analysis software for in-air acoustic, SONAR, RADAR and speech data. DCT appears to have satisfactorily solved the problem of periodicity artifacts, but users can judge for themselves (see discussion of simple_plot below).

In the latest release (April 28, 2016), I kept the original get_transformed_values as modified by Tommy McGuire and also added in the new get_dct_transform. The DCT is much better behaved at the edges, and it requires less tweaking (i.e. no padding). Figure 1 shows a plot of Madame Bovary (the text Ben had used in his example of the edge artifacts) with the original plot line (without Tommy McGuire’s update) produced by get_transformed_values (in blue) and a new plot line (in red) produced by the get_dct_transform. The red line is a more accurate representation of the (tragic) plot as we know it.

As in the past, I have graphed several dozen novels that I (and my students and colleagues) know well in order to validate that the shapes being produced by the DCT method are accurate representations of the shifting emotions in the novels. I have also worked with a handful of creative writing colleagues here at UNL and elsewhere, graphing their novels and getting feedback from them about whether the shapes match their sense of their own books. In every case, the response has been “yes.” (Though that does not guarantee you won’t find an exception–please let me know if you do!)
bovary_plot

Figure 1

Like the original get_transformed_values, the new get_dct_transform implements a low-pass filter to handle the smoothing. For those following the larger discussion, note that there is nothing unusual or strange about using low-pass filters for smoothing data. Indeed, the well known moving average is an example of a low-pass filter and a simple Google search will turn up countless articles about smoothing data with FFT and DCT. The trick with any smoother is determining how much smoothing you want to do. With a moving average, the witchcraft comes in setting the size of the moving widow to determine how much noise to remove. With the get_dct_transform it is about setting the number of low frequency components to retain. In any such smoothing you have to accept/assume that the important (desired) information is contained in the lower frequency variation and not in the higher frequency noise.

To help users visualize how two very common filters smooth data in comparison to the get_dct_transform, I added a function called “simple_plot.” With simple_plot it is easy to see how the three different smoothing methods represent the data. Figure 2 shows the output of calling simple_plot for Madame Bovary using the function’s default values. The top panel shows three lines produced by: 1) a Loess smoother, 2) a rolling mean, and 3) the get_dct_transform. (Note that with a rolling mean, you lose data at both beginning and the end of the series.) The bottom image shows a flatter DCT line (i.e. produced by retaining fewer low frequency components and, therefore, having less noise). The bottom image also uses the reverse transform process as a way to normalize the x-axis to 100 units. (In the new release, there is now another function, rescale_x_2, that can be used as an alternative way to normalize both the x and y axis.)

simple_plot_bovary

Figure 2

The other change in the latest release is the addition of a custom sentiment dictionary compiled with help from the students in my lab. I have documented the creation and testing of that dictionary in two previous blog posts: “That Sentimental Feeling” (12/20/2015) and “More Syuzhet Validation” (August 11, 2016). In these posts, human coded sentiment data is compared to machine derived data in eleven well known novels. We still have more work to do in terms of tweaking and validating the dictionary, but so far it is performing as well as the other dictionaries and in some case better.

Also worth mention here is yet another smoothing method suggested by Jianbo Gao, who has developed an innovative adaptive approach to smoothing time series data. Jianbo and I met at the Institute for Applied Mathematics last summer and, with John Laudun and Timothy Tangherlini, we wrote a paper titled “A Multiscale Theory for the Dynamical Evolution of Sentiment in Novels” that was delivered at the Conference on Behavioral, Economic, and Socio-Cultural Computing last November. I have not found the time to implement this adaptive smoother into the Syuzhet package, but it is on the todo list.

Also on the todo list for a future release is adding the ability to work with languages other than English. Thanks to Denis Roussel, GitHub contributor “denrou”, this work is now progressing nicely.

Over the past few years, a number of people have contributed to this work, either directly or indirectly. I think we are making good progress, and I want to acknowledge the following people in particular: Aaron Dominguez, Andrew Piper, Annie Swafford, Ben Schmidt, Bradley Riddle, Chris Stubben, David Bamman, Denis Roussel, Drue Marr, Ellie Wilke, Faith Aberle, Felix Peckitt, Gabi Kirilloff, Jianbo Gao, Julius Fredrick, Lincoln Mullen, Marti Hearst, Michael Hoffman, Natalie Mackley, Nissanka Wickremasinghe, Oliver Keyes, Peter Organisciak, Roz Thalken, Sarah Cohen, Scott Enderle, Tasha Saathoff, Ted Underwood, Timothy Schaffert, Tommy McGuire, and Walter Jacob. (If I forgot you, I’m sorry, please let me know).

Apr 01 2015

Academic, Commentary, R-Code, Text-Mining

My Sentiments (Exactly?)

While developing the Syuzhet package–a tool for tracking relative shifts in narrative sentiment–I spent a fair amount of time gut-checking whether the sentiment values returned by the machine methods were a good match for my own sense of the narrative sentiment. Between 70% and 80% of the time, they were what I considered to be good sentence level matches. . . but sentences were not my primary unit of interest.

Rather, I wanted a way to assess whether the story shapes that the tool produced by tracking changes in sentiment were a good approximation of central shifts in the “emotional trajectory” of a narrative. This emotional trajectory was something that Kurt Vonnegut had described in a lecture about the simple shapes of stories. On a chalkboard, Vonnegut graphed stories of good fortune and ill fortune in a demonstration that he calls “an exercise in relativity.” He was not interested in the precise high and lows in a given book, but instead with the highs and lows of the book relative to each other.

Blood Meridian and The Devil Wears Prada are two very different books. The former is way, way more negative. What Vonnegut was interested in understanding was not whether McCarthy’s book was more wholly negative than Weisberger’s, he was interested in understanding the internal dynamics of shifting sentiment: where in a book would we find the lowest low relative to the highest high. Implied in Vonnegut’s lecture was the idea that this tracking of relative high and lows could serve as a proxy for something like “plot structure” or “syuzhet.”

This was an interesting idea, and sentiment analysis offered a possible way forward. Unfortunately, the best work in sentiment analysis has been in very different domains. Could sentiment analysis tools and dictionaries that were designed to assess sentiment in movie reviews also detect subtle shifts in the language of prose fiction? Could these methods handle irony, metaphor, and so forth? Some people, especially if they looked only at the results of a few sentences, might reject the whole idea out of hand. Movie reviews and fiction, hogwash! Instead of rejecting the idea, I sat down and human coded the sentiment of every sentence in Joyce’s Portrait of the Artist. I then developed Syuzhet so that I could apply and compare four different sentiment detection techniques to my own human codings.

This human coding business is nuanced. Some sentences are tricky. But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment. Consider this contrived example:

“I hated the way he looked at me that morning, and I was glad that he had become my friend.”

Is that a positive or negative sentence? Given the coordinating “and” perhaps the second half is more important than the first part? I coded sentences such as this as neutral, and thankfully these were the outliers and not the norm. Most of the time–even in a complex novel like Portrait where the style and complexity of the sentences are both evolving with the maturation of the protagonist–it was fairly easy to make a determination of positive, negative, or neutral.

It turns out that when you do this sort of close reading you learn a lot about the way that authors write/express/manipulate “sentiment.” One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky. In fact, in many random passages that I examined from other books, and in the entirety of Portrait, tricky sentences were usually followed or preceded by other simple sentences that would clarify the sentiment of the larger passage. This is an important observation because at the level of an individual sentence, we know that the various machine methods are not super effective.[1] That said, I was pretty surprised by the amount of sentence level agreement in my ad hoc test. On a sentence by sentence basis, here is how the four methods in the package performed:[2]

Bing 84% agreement
Afinn 80% agreement
Stanford 60% agreement
NRC 50% agreement

These results surprised me. I was shocked that the more awesome Stanford method did not outperform the others. I was so shocked, in fact, that I figured I must have done something wrong. The Stanford sentiment tagger, for example, thinks that the following sentence from Joyces Portrait is negative.

“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo.”

It was a “very good time.” How could that be negative? I think “a very good time” is positive and so do the other methods. The Stanford tagger also indicated that the sentence “He sang that song” is slightly negative. All of the other methods scored it as neutral, and so did I.

I’m a huge fan of the Stanford tagger; I’ve been impressed by the way that it handles negation, but perhaps when all is said and done it is simply not well-suited to literary prose where the syntactical constructions can be far more complicated than typical utilitarian prose? I need more time to study how the Stanford tagger behaved on this problem, so I’m just going to exclude it from the rest of this report. My hypothesis, however, is that it is far more sensitive to register/genre than the dictionary based methods.

So, as I was saying, what happens with sentiment in actual prose fiction is usually achieved over a series of sentences. That simile, that bit of irony, that negated sentence is typically followed and/or preceded by a series of more direct sentences expressing the sentiment of the passage. For example,

“She was not ugly. She was exceedingly beautiful.”
“I watched him with disgust. He ate like a pig.”

Prose, at least the prose that I studied in this experiment, is rarely composed of sustained irony, sustained negation, sustained metaphor, etc. Usually authors provide us with lots of clues about the sentiment we are meant to experience, and over the course of several sentences, a paragraph, or a page, the sentiment tends to become less ambiguous.

So instead of just testing the machine methods against my human sentiments on a sentence by sentence basis, I split Joyce’s portrait into 20 equally sized chunks, and calculated the mean sentiment of each. I then compared those means to the means of my own human coded sentiments. These were the results:

Bing 80% agreement
Afinn 85% agreement
NRC 90% agreement

Not bad. But of course any time we apply a chunking method like this we risk breaking the text right in the middle of a key passage. And, as we increase the number of chunks and effectively decrease the size of each passage, the values tend to decrease. I ran the same test using 100 segments and saw this:

Bing 73% agreement
Afinn 77% agreement
NRC 58% agreement (ouch)

Figure 1 graphs how the AFinn method (with 77% agreement over 100 segments) tracked the sentiment compared to my human sentiments.

Figure 1

Next I transformed all of the sentiment vectors (machine and human) using the get_transformed_values function. I then calculated the amount of agreement. With the low pass filter set to the default of 3, I observed the following agreement:

Bing 73% agreement
Afinn 74% agreement
NRC 86% agreement

With the low pass filter set to 5, I observed the following agreement:

Bing 87% agreement
Afinn 93% agreement
NRC 90% agreement

Figure 2 graphs how the transformed AFinn method tracked narrative changes in sentiment compared to my human sentiments.[3]

Figure 2

As I have said elsewhere, my primary reason for open-sourcing this code was so that others could plot some narratives of their own and see if the shapes track well with their human sense of the emotional trajectories. If you do that, and you have successes or failure, I’d be very interested in hearing from you (please send me an email).

Given all of the above, I suppose my current working benchmark for human to machine accuracy is something like ~80%. Frankly, though, I’m more interested in the big picture and whether or not the overall shapes produced by this method map well onto our human sense of a book’s emotional trajectory. They certainly do seem to map well with my sense of Portrait of the Artist, and with many other books in my library, but what about your favorite novel?

FOOTNOTES:
[1] For what it is worth, the same can probably be said about us, the human beings. Given a single sentence with no context, we could probably argue about its positive or negativeness.
[2] Each method uses a slightly different value range, so when I write of “agreement,” I mean only that the machine method agreed with the human (me) that a given sentence was positively or negatively charged. My rating scale consisted of three values: 1, 0, -1 (positive, neutral, negative). I did not test the extent of the positiveness or the degree of negativeness.
[3] I explored low-pass values in increments of 5 all the way to 100. The percentages of agreement were consistently between 70 and 90.

Mar 24 2015

Academic, Commentary, R-Code, Text-Mining

A Ringing Endorsement of Smoothing

On March 7, Annie Swafford posted an interesting critique of the transformation method implemented in Syuzhet. Her basic argument is that setting the low-pass filter too low may result in misleading ringing artifacts.[1] This post takes up the issue of ringing artifacts more directly and explains how Annie’s clever method of neutralizing values actually demonstrates just how effective the Syuzhet tool is in doing what it was designed to do! But lest we begin chasing any red herring, let me be very clear about the objectives of the software.

The tool is meant to reveal the simple (and latent) shape of stories, not the complex shape of stories, not the perfect shape of stories, not the absolute shape of stories, just the simple foundational shapes.[2] This was the challenge that Vonnegut put forth when he said “There is no reason why the simple shape of stories cannot be fed into computers.”
The tool uses sentiment, as detected by four possible methods, as a proxy for “plot.” This is in keeping with Vonnegut’s conception of “plot” as a movement between what he called “good fortune” and “ill fortune.” The gamble Syuzhet makes is that the sentiment detection methods are both “good enough” and also may serve as a satisfying proxy for the “good” and “ill” fortune Vonnegut describes in his essay and lecture.
Despite some complex mathematics, there is an interpretive dimension to this work. I suppose this is why folks call it “digital humanities” instead of physics. Syuzhet was designed to estimate and smooth the emotional highs and lows of a narrative; it was not designed to provide a perfect mapping of emotional valence. I don’t think such perfect mapping is computationally possible; if you want/need that kind of experience, then you need to read the book (some of ’em are even worth it). I’m interested in detecting/revealing the simple shape of stories by approximating the fundamental highs and lows of emotional valence. I believe that this is what Vonnegut had in mind.
Finally, when examining the shapes produced by graphing the Syuzhet values, we must remember what Vonnegut said: “This is an exercise in relativity, really. It is the shape of the curves that matters and not their origins.” When Vonnegut speaks of the shapes, he speaks of them as “simple” shapes.

In her critique of the software, Annie expresses concern over the potential for ringing artifacts when using a Fourier transformation and a very low, low-pass filter. She introduces an innovative method for detecting this possible ringing. To demonstrate the efficacy of her method, she “neutralizes” one third of the sentiment values in Joyce’s Portrait of the Artist as a Young Man and then retransforms and graphs the new neutralized shape against the original foundation shape of the story.

Annie posits that if the Syuzhet tool is working as she thinks it should, then the last third of the foundational shape should change in reaction to this neutralization. In Annie’s example, however, no significant change is observed, and she concludes that this must be due to a ringing artifact. Figure 1 (below) is the evidence she presents on her blog.

: Figure 1: last third neutralized

For what it is worth, we do see some minor differences between the blue and the orange lines, but really, these look like the same “Man in Hole” plot shapes. Ouch, this does look like a bad ringing artifact. But could there be another explanation?

There may, indeed, be some ringing here, but it’s not nearly so extreme as Figure 1 suggests. An alternative conclusion is that the similarity we observe in the two lines is due to a similarity between the actual values and the neutralized values. As it happens, the last third of the novel is already pretty neutral compared to the rest of the novel. In fact, the mean valence for the entire last third of the novel is -0.05. So all we have really achieved in this test is to replace a section of relatively neutral valence with another segment of totally neutral valence.

This is not, therefore, a very good book in which to test for the presence of a ringing artifacts using this particular method of neutralization. What we see here is a case of the right result but the wrong conclusion. Which is not to say that there is not some ringing present; I’ll get to that in a moment. But first another experiment.

If, instead of resetting those values to zero, we set them to 3 (making Portrait end on a very happy note indeed), we get a much different shape (blue line in figure 3). The earlier parts of the novel are now represented as comparatively less negative and the end of the novel is mucho positive.

Figure 3: Portrait with artificial positive ending

And, naturally, we can also set those values very negative and produce the graph seen in figure 4. Oh, poor Stephen.

Figure 4: Portrait with artificial negative ending

“But wait, Jockers, you can’t deny that there is still an artificial “hump” there at the end of figure 3 and an artificial trough at the end of figure 4.” Nope, I won’t deny it, there really can be ringing artifacts. Let’s see if we can find some that actually matter . . .

First let’s test the beginning of the novel using the Swafford method. We neutralize the beginning third of the novel and graph it against the original shape (figure 5). Hmm, again it seems that the foundation shape is pretty close to the original. Is this a ringing artifact?

Figure 5: first third neutralized

Could be, but in this case it is probably just another false ringer. Guess what, the beginning of Joyce’s novel is also comparatively neutral. This is why the Swafford method results in something similar when we neutralize the first third of the book. Do note that the first third is a little bit less neutral than the last third. This is why we see a slightly larger difference between the blue and orange lines in figure 5 compared to figure 1.

But what about the middle section?

If we set the middle third of the novel to neutral, what we get is a very different shape (and a very different novel)! Figure 6 uses the Swafford method to remove the central crisis of the novel. This is no longer a “man in hole” story, and the resulting shape is precisely what we would expect. Make no mistake, that hump of happiness is not a ringing artifact. That hump in the middle is now the most sustained non-negative moment in the book. We have replaced hell with limbo (not heaven because these are neutral values), and in comparison to the other parts of the book, limbo looks pretty good! Keep in mind Vonnegut’s message from #4 above: “This is an exercise in relativity.” Also keep in mind that there is some scaling going on over the y-axis; in other words, we should not get too hung up on the precise position on the y-axis at the expense of seeing the simple shape.

In the new graph, the deepest trough has now shifted to the early part of the novel, which is now the location of the greatest negative valence in the story (it’s the section where Stephen gets sick and is then beaten by father Dolan). The end of the book now looks relatively darker since we no longer have the depths of hell from the midsection for comparison, but the end third of Portrait is definitely not as negative as the beginning third and this is reflected nicely in figure 6. (This more positive ending is also evident, by the way, in the original shape–orange line–where the hump in the last third is slightly higher than the early hump.)

Figure 6: Portrait with Swaffordized Middle

So, the Swafford method proves to be a very useful tool for testing and confirming our expectations. If we remove the most negative section of the novel, then we should see the nadir of the simple shape shift to the next most negative section. That is precisely what we see. I have tested this over a series of other novels, and the effect is the same (see figure 9 below, for example). This is a great method for validating that the tool is working as expected. Thanks Annie Swafford!

“But wait a second Jockers, what about those rascally ringing artifacts you promised.”

Yes, yes, there can indeed be ringing artifacts. Let’s go get some. . . .

Annie follows her previous analysis with what seems like an even more extreme example. She neutralizes everything in Joyce’s Portrait except for the middle 20 sentences of the novel.[3] When the resulting graph looks a lot like the original man-in-hole graph, she says, in essence: “Busted! there is your ringing artifact Dr. J!” Figure 7 is the graphic from her blog.

Figure 7: Only 20 (sic) sentences of Portrait

Busted indeed! Those positive valence humps, peaking at 25 and 75 on the x-axis are dead ringers for ringers. We know from constructing the experiment in this manner, that everything from 0 to ~49 and everything from ~51 to 100 on the x-axis is perfectly neutral, and yet the tool, the visualization, is revealing two positive humps before and after the middle section: horrible, happy, phantom humps that do not exist in the story!

But wait. . .

With all smoothing methods some degree of imprecision is to be expected. Remember what Vonnegut points out: this is “an exercise in relativity.” Relatively speaking, even the extreme example in figure 7 is, in my opinion, not too bad. Just imagine a hypothetical protagonist cruising along in a hypothetical novel such as the one Annie has written with her neutral values. This protagonist is feeling pretty good about all that neutrality; she ain’t feeling great, but she’s better than bad. Then she hits that negative section . . . as Vonnegut would say, “oh, God damn it.”[4] But then things get better, or more precisely, things get comparatively better. So, the blue line is not a great representation of the narrative, but it’s not a bad approximation either.

But look, I understand my colleague’s concern for more precision, and I don’t want it to appear that I’m taking this ringing business too lightly. Figure 8 (below) was produced using precisely the same data that Annie used in her two-sentence example; everything is neutralized except for those two sentences from the exact middle of the novel. This time, however, I have used a low pass filter set at 100. Voila! The new shape (blue) is nothing at all like the original (orange), and the new shape also provides the deep level of detail–and lack of ringing–that some users may desire.[5] Unfortunately, using such a high, low-pass filter does not usually produce easily interpretable graphs such as seen in figure 8.

Figure 8: Original shape with neutralized “Swafford Shape” using 100 components

In this very simple example, turning the low-pass filter up to 100 produces a graph that is very easy to read/interpret. When we begin looking at real novels, however, a low-pass of 100 does not result in shapes that are very easy to visually interpret, and it becomes necessary to smooth them. I think that is what visualization is all about, which is to say, simplifying the complex so that we can get the gist. One way to simplify these emotional trajectories is to use a low, low pass filter. Given that going low may cause more ringing, you need to decide just how low you can go. Another option, that I demonstrated in my previous post, is to use a high value for the low pass filter (to avoid potential ringing) and then apply a lowess smoother (or your own favorite smoother) in order to reveal the “simple shape” (see figure 1 of http://www.matthewjockers.net/2015/03/09/is-that-your-syuzhet-ringing/).

In a future post, I’ll explore something I mentioned to Annie in our email exchange (prior to her public critique): an ad hoc method I’ve been working on that seeks to identify an “ideal” number of components for the low pass filter.

Figure 9: Dorian Gray behaving exactly as we would expect with last third neutralized

FOOTNOTES:

[1] Annie does not actually explain that the low-pass filter is a user controlled parameter or that what she is actually testing is the efficacy of the default value. Users of the tool are welcome to experiment with different values for the low pass filter as I have done here: Is that your Syuzhet Ringing.

[2] I’ve been calling these simple shapes “emotional trajectories” and “plot.” Plot is potentially controversial here, so if folks would like to argue that point, I’m sympathetic. For the first year of this research, I never used the word “plot,” choosing instead “emotional trajectory” or “simple shape,” which is Vonnegut’s term. I realize plot is a loaded and nuanced word, but “emotional trajectory” and “simple shape” are just not really part of our nomenclature, so plot is my default term.

[3] ~~There is a small discrepancy between Annie’s blog and her code~~. Correction: Annie writes about and includes a graph showing the middle “20” sentences, but then provides code for retaining both the middle 2 and the middle 20 sentences. Either way the point is the same.

[4] The two negative valence sentences from the middle of Portrait are as follows: “Nay, things which are good in themselves become evil in hell. Company, elsewhere a source of comfort to the afflicted, will be there a continual torment: knowledge, so much longed for as the chief good of the intellect, will there be hated worse than ignorance: light, so much coveted by all creatures from the lord of creation down to the humblest plant in the forest, will be loathed intensely.”

[5] Annie has written that “Syuzhet computes foundation shapes by discarding all but the lowest terms of the Fourier transform.” That is a rather misleading comment. The low-pass-filter is set to 3 by default, but it is a user tunable parameter. I explained my reasons for choosing 3 as the default in my email exchange with Annie prior to her critique. It is unclear to me why Annie does not to mention my explanation, so here it is from our email exchange:

“. . . The short and perhaps unsatisfying answer is that I selected 3 based on a good deal of trial and error and several attempts to employ some standard filters that seek to identify a cutoff / threshold by examining the frequencies (ideal, butterworth, and several others that I don’t remember any more). The trouble with these, and why I selected 3 as the default, is that once you go higher than 3 the resulting plots get rather more complicated, and the goal, of course, is to do the opposite, which is to say that I seek to reduce the plot to a simple base form (along the lines of what Vonnegut is suggesting). Three isn’t magic, but it does seem to work well at rooting out the foundational shape of the story. Does it miss some of the subtitles, yep, but again, that is the point, in part. The longer answer is that is that this is something I’m still experimenting with. I have one idea that I’m working with now…”

Mar 09 2015

Academic, Commentary, R-Code, Text-Mining

Is that Your Syuzhet Ringing?

Over the weekend, Annie Swafford published another installment in her ongoing critique of Syuzhet, the R package that I released in early February. In her recent blog post, an interesting approach for testing the get_transformed_values function is proposed[1].

Previously Annie had noted how using the default values for the low-pass filter may result in too much information loss, to which I replied that that is the point. (Readers hung up on this point are advised to go back and watch the Vonnegut video again.) With any kind of smoothing, there is going to be information loss. The function is designed to allow the user to tune the low pass filter for greater or lesser degrees of noise (an important point that I shall return to in a moment).

In the new post, Annie explores the efficacy of leaving the low pass filter at its default value of 3; she demonstrates how this value appears to produce a ringing artifact. This is something that the two of us had discussed at some length in an email correspondence prior to this blogging frenzy. In that correspondence, I promised to explore adding a gaussian filter to the package, a filter she believes would be more appropriate. Based on her advice, I have explored that option, and will do so further, but for now I remain unconvinced that there is a problem for Gauss to solve.[2]

As I said in my previous post, I believe the true test of the method lies in assessing whether or not the shapes produced by the transformation are a good approximation of the shape of the story. But remember too, that the primary point of the transformation function is to solve the problem of length; it is hard to compare the plot shape of a long novel to a short one. The low-pass argument is essentially a visualization and noise reduction parameter. Users who want a closer, scene by scene or sentence by sentence representation of the sentiment data, will likely gravitate to the get_percentage_values function (and a very large number of bins) as, for example, Lincoln Mullen has done on Rpubs.[3]

The downside to that approach, of course, is that you cannot compare two sentiment arcs mathematically; you can only do so by eye. You cannot compare them mathematically because the amount of text inside each percentage segment will be quite different if the novels are of different lengths, and that would not be a fair comparison. The transformation function is my attempt at solving this time domain conundrum. While I believe that it solves the problem well, I’m certainly open to other options. If we decide that the transformation function is no good, that it produces too much ringing, etc. then we should look for a more attractive alternative. Until an alternative is found and demonstrated, I’m not going to allow the perfect to become the enemy of the good.

But, alas, here we are once again on the question of defining what is “good” and what is “good enough.” So let us turn now to that question and this matter of ringing artifacts.

The problem of ringing artifacts is well understood in the signal processing literature if a bit less so in the narratological literature:-) Annie has done a fine job of explicating the nature of this problem, and I can’t help thinking that this is a very clever idea of hers. In fact, I wrote to Annie acknowledging this and noting how I wish I had thought of it myself.

But after repeating her experiment a number of times, with greater and lesser degrees of success, I decided that this exercise is ultimately a bit of a red herring. Among other things, there are no books with zero neutral values for an entire third, but more importantly the exercise has more to do with the setting of a particular user parameter than it does with the package.

I’d like to now offer a bit of cake and eat it too. This most recent criticism has focused on the default values for the low-pass filter that I set for the function. There is, of course, nothing preventing adjustment of that parameter by those with a taste for adventure. The higher the number, the greater the number of components that are retained; the more components we retain, the less ringing and the closer we get to reproducing the original signal.

So let us assume for a moment that the sentiment detection methods all work perfectly. We know as a matter of fact that they don’t work perfectly (you know, like human beings), but this matter of imprecision is something we have already covered in a previous post where I showed that the three dictionary based methods tend to agree with each other and with the more sophisticated Stanford method. So even though we know we are not getting every sentence’s sentiment just right, let’s pretend that we are, if only for a moment.

With that assumed, let us now recall the primary rationale for the Fourier transformation: to normalize the length of the x-axis. As it happens, we can do that normalization (the cake) and also retain a great many more components than the 3 default components (eating it). Figure 1 shows Joyce’s Portrait of the Artist transformed using a low pass filter size of 100.

This produces a graph with a lot more noise, but we have effectively eliminated any objectionable ringing. With the addition of a smoothing line (lowess function in R), what we see once again (ta da) is a beautiful, if rather less dramatic, example of Vonnegut’s Man in Hole! And this is precisely the goal, to reveal the plot shape latent in the noise. The smaller low-pass filter accentuates this effect, the higher low-pass filter provides more information: both show the same essential shape.

Figure 1: Portrait with low pass at 100

Figure 2: Portrait with low pass at 3

Figure 3: Portrait with low pass at 20

In the course of this research, I have hand examined the transformed shapes for several dozen novels. The number of novels I have examined corresponds to the number that I feel I know well enough to assess (and also happen to possess in digital form). These include such old and new favorites as:

Portrait of the Artist
Picture of Dorian Grey
Ulysses
Blood Meridian
Gone Girl
Finnegans Wake (nah, just kidding)
. . .
And many more.

As I noted in my previous post, the only way to determine the efficacy of this model is to see if it approximates reality. We have now plotted Portrait of the Artist six ways to Sunday, and every time we have seen a version of the same man in hole shape. I’ve read this book 20 times, I have taught this book a dozen times. It is a man in hole plot.

In my (admittedly) anecdotal evaluations, I have continued to see convincing graphs, such as the one above (and the one below in figure 4). I have found a few special books that don’t do very well, but that is a story you will have to wait for (spoiler alert, they are not works of satire or dark humor, but they are multi-plot novels involving parallel stories).

Still, I am open to the possibility that there is some confirmation bias possible here. And this is why I wanted to release the package in the first place. I had hoped that putting the code on gitHub would entice others toward innovation within the code, but the unexpected criticism has certainly been healthy too, and this conversation has certainly made me think of ways that the functions could be improved.

In retrospect, it may have been better to wait until the full paper was complete before distributing the code. Most of the things we have covered in the last few weeks on this blog are things that get discussed in finer detail in the paper. Despite more details to come, I believe, as Dryden might say, that the last (plot) line is now sufficiently explicated.

Bonus Images:

Figure 4

In terms of basic shape, Figure 4 is remarkably similar to the more dramatized version seen in figure 5 below. If you can’t see it, you aren’t reading enough Vonnegut.

Figure 5

[1] How’s that for some awkward passive voice? A few on Twitter have expressed some thoughts on my use of Annie’s first name in my earlier response. Regular readers of this blog will know that I am consistent in referring to people by their full names upon first mention and by their first names thereafter. Previous victims of my “house style” have included David Mimno, David; Dana Mackenzie, Dana; Ben Schmidt, Ben; Franco Moretti, Franco, and Julia Flanders, Julia. There are probably others.

[2] Anyone losing sleep over this gaussian filter business is welcome to grab the code and give it a whirl.

[3] In the essay I am writing about this work, I address a number of the nuances that I have skipped over in these blog posts. One of the nuances I discuss is an automated process for the selection of a low-pass filter size.

Mar 04 2015

Academic, Commentary, R-Code, Text-Mining

Some thoughts on Annie’s thoughts . . . about Syuzhet

Annie Swafford has raised a couple of interesting points about how the syuzhet package works to estimate the emotional trajectory in a novel, a trajectory which I have suggested serves as a handy proxy for plot (in the spirit of Kurt Vonnegut).

Annie expresses some concern about the level of precision the tool provides and suggest that dictionary based methods (such as the three I include as options in syuzhet) are not reliable. She writes “Sentiment analysis based solely on word-by-word lexicon lookups is really not state-of-the-art at all.” That’s fair, I suppose, but those three lexicons are benchmarks of some importance, and they deserve to be included in the package if for no other reason than for comparison. Frankly, I don’t think any of the current sentiment detection methods are especially reliable. The Stanford tagger has a reputation for being the main contender for the title of “best in the open source market,” but even it hovers around 80 – 83% accuracy. My own tests have shown that performance depends a good deal on genre/register.

But Annie seems especially concerned about the three dictionary methods in the package. She writes “sentiment analysis as it is implemented in the syuzhet package does not correctly identify the sentiment of sentences.” Given that sentiment is a subtle and nuanced thing, I’m not sure that “correct” is the right word here. I’m not convinced there is a “correct” answer when it comes to this question of valence. I do agree, however, that some answers are more or less correct than others and that to be useful we need to be on the closer side. The question to address, then, is whether we are close enough, and that’s a hard one. We would probably find a good deal of human agreement when it comes to the extremes of sentiment, but there are a lot of tricky cases, grey areas where I’m not sure we would all agree. We certainly cannot expect the tool to perform better than a person, so we need some flexibility in our definition of “correct.”

Take, for example, the sentence “I studied at Leland Stanford Junior University.” The state-of-the-art Stanford sentiment parser scores this sentence as “negative.” I think that is incorrect (you are welcome to disagree;-). The “bing” method, that I have implemented as the default in syuzhet, scores this sentence as neutral, as does the “afinn” method (also in the package). The NRC method scores it as slightly positive. So, which one is correct? We could go all Derrida on this sentence and deconstruct each word, unpack what “junior” really means. We could probably even “problematize” it! . . . But let’s not.

What Annie writes about dictionary based methods not being the most state-of-the-art is true from a technical standpoint but sophisticated methods and complexity do not necessarily correlate with results. Annie suggest that “getting the Stanford package to work consistently would go a long way towards addressing some of these issues,” but as we saw with the sentence above, simple beat sophisticated, hands down[1].

Consider another sentence: “Syuzhet is not beautiful.” All four methods score this sentence as positive, even the Stanford tool, which tends to do a better job with negation, says “positive.”

It is easy to find opposite cases where sophisticated wins the day. Consider this more complex sentence: “He was not the sort of man that one would describe as especially handsome.” Both NRC and Afinn score this sentence as neutral, Bing scores it slightly positive and Stanford scores it slightly negative. When it comes to negation, the Stanford tool tends to perform a bit better, but not always. The very similar sentence “She was not the sort of woman that one would describe as beautiful” is scored slightly positive by all four methods.

What I have found in my testing is that these four methods usually agree with each other, not exactly but close enough. Because the Stanford parser is very computationally expensive and requires special installation, I focused the examples in the Syuzhet Package Vignette on the three dictionary based methods. All three are lightning fast by comparison, and all three have the benefit of simplicity.

But, are they good enough compared to the more sophisticated Stanford parser?

Below are two graphics showing how the methods stack up over a longer piece of text. The first image shows sentiment using percentage based segmentation as implemented in the get_percentage_values() function.

Four Methods Compared using Percentage Segmentation

The three dictionary methods appear to be a bit closer, but all four methods do create the same basic shape. The next image shows the same data after normalization using the get_transformed_values() function. Here the similarity is even more pronounced.

Four Methods Compared Using Transformed Values

While we could legitimately argue about the accuracy of one sentence here or one sentence there, as Annie has done, that is not the point. The point is to reveal a latent emotional trajectory that represents the general sense of the novel’s plot. In this example, all four methods make it pretty clear what that shape is: This is what Vonnegut called “Man in Hole.”

The sentence level precision that Annie wants is probably not possible, at least not right now. While I am sympathetic to the position, I would argue that for this particular use case, it really does not matter. The tool simply has to be good enough, not perfect. If the overall shape mirrors our sense of the novel’s plot, then the tool is working, and this is the area where I think there is still a lot of validation work to do. Part of the impetus for releasing the package was to allow other people to experiment and report results. I’ve looked at a lot of graphs, but there is a limit to the number of books that I know well enough to be able to make an objective comparison between the Syuzhet graph and my close reading of the book.

This is another place where Annie raises some red flags. Annie calls attention to these two images (below) from my earlier post and complains that the transformed graph is not a good representation of the noisy raw data. She writes:

The full trajectory opens with a largely flat stretch and a strong negative spike around x=1100 that then rises back to be neutral by about x=1500. The foundation shape, on the other hand, opens with a rise, and in fact peaks in positivity right around where the original signal peaks in negativity. In other words, the foundation shape for the first part of the book is not merely inaccurate, but in fact exactly opposite the actual shape of the original graph.

Annie’s reading of the graphs, though, is inconsistent with the overall plot of the novel, whereas the transformed plot is perfectly consistent with the novel. What Annie calls a “strong negative spike” is the scene in which Stephen is pandied by Father Arnell. It is an important negative moment, to be sure, but not nearly as important, or as negative, as the major dip that occurs midway through the novel–when Stephen experiences Hell. The scene with Arnell is a minor blip compared to the pages and pages of hell and the pages and pages of anguish Stephen experiences before his confession.

Annie is absolutely correct in noting that there is information loss, but wrong in arguing that the graph fails to represent the novel. The tool has done what it was designed to do: it successfully reveals the overall shape of the narrative. The first third of the novel and the last third of the novel are considerably more positive than the middle section. But this is not meant to say or imply that the beginning and end are without negative moments.

It is perfectly reasonable to want to see more of the page to page, or scene by scene fluctuations in sentiment, and that can be easily achieved by using the percentage segmentation method or by altering the low-pass filter size. Changing the filter size to retain five components instead of three results in the graph below. This new graph captures that “strong negative spike” (not so “strong” compared to hell) and reveals more of the novel’s ups and downs. This graph also provides more detail about the end of the novel where Stephen comes down off his bird-girl high and moves toward a more sober perspective for his future.

Portrait with Five Components

Of course, the other reason for releasing the code is so that I can get suggestions for improvements. Annie (and a few others) have already propelled me to tweak several functions. Annie found (and reported on her blog) some legitimate flaws in the openNLP sentence parser. When it comes to passages with dialog, the openNLP parser falls down on the job. I ran a few dialog tests (including Annie’s example) and was able to fix the great majority of the sentence parsing errors by simply stripping out the quotation marks in advance. Based on Annie’s feedback, I’ve added a “quote stripping” parameter to the get_sentences() function. It’s all freshly baked and updated on github.

But finally, I want to comment on Annie’s suggestion that

some texts use irony and dark humor for more extended periods than you [that’s me] suggest in that footnote—an assumption that can be tested by comparing human-annotated texts with the Syuzhet package.

I think that would be a great test, and I hope that Annie will consider working with me, or in parallel, to test it. If anyone has any human annotated novels, please send them my/our way!

Things like irony, metaphor, and dark humor are the monsters under the bed that keep me up at night. Still, I would not have released this code without doing a little bit of testing:-). These monsters can indeed wreak a bit of havoc, but usually they are all shadow and no teeth. Take the colloquial expression “That’s some bad R code, man.” This sentence is supposed to mean the opposite, as in “That is a fine bit of R coding, sir.” This is a sentence the tool is not likely to get right; but, then again, this sentence also messes up my young daughter, and it tends to confuse English language learners. I have yet to find any sustained examples of this sort of construction in typical prose fiction, and I have made a fairly careful study of the emotional outliers in my corpus.

Satire, extended satire in particular, is probably a more serious monster. Still, I would argue that the sentiment tools performs exactly as expected; they just don’t understand what they are “reading” in the way that we do. Then again, and this is no fabrication, I have had some (as in too many) college students over the years who haven’t understood what they were reading and thought that Swift was being serious about eating succulent little babies in his Modest Proposal (those kooky Irish)!

So, some human beings interpret the sentiment in Modest Proposal exactly as the sentiment parser does, which is to say, literally! (Check out the special bonus material at the bottom of this post for a graph of Modest Proposal.) I’d love to have a tool that could detect satire, irony, dark humor and the like, but such a tool is still a good ways off. In the meantime, we can take comfort in incremental progress.

Special thanks to Annie Swafford for prompting a stimulating discussion. Here is all the code necessary to repeat the experiments discussed above. . .

library(syuzhet)
path_to_a_text_file <- system.file("extdata", "portrait.txt",
package = "syuzhet")
joyces_portrait <- get_text_as_string(path_to_a_text_file)
poa_v <- get_sentences(joyces_portrait)

# Get the four sentiment vectors
stanford_sent <- get_sentiment(poa_v, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(poa_v, method="bing")
afinn_sent <- get_sentiment(poa_v, method="afinn")
nrc_sent <- get_sentiment(poa_v, method="nrc")

######################################################
# Plot them using percentage segmentation
######################################################
plot(
  scale(get_percentage_values(stanford_sent, 10)), 
  type = "l", 
  main = "Joyce's Portrait Using All Four Methods\n and Percentage Based Segmentation", 
  xlab = "Narrative Time", 
  ylab = "Emotional Valence",
  ylim = c(-3, 3)
)
lines(
  scale(get_percentage_values(bing_sent, 10)),
  col = "red", 
  lwd = 2
)
lines(
  scale(get_percentage_values(afinn_sent, 10)),
  col = "blue", 
  lwd = 2
)
lines(
  scale(get_percentage_values(nrc_sent, 10)),
  col = "green", 
  lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)

######################################################
# Transform the Sentiments
######################################################
stan_trans <- get_transformed_values(
  stanford_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)
bing_trans <- get_transformed_values(
  bing_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)
afinn_trans <- get_transformed_values(
  afinn_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)

nrc_trans <- get_transformed_values(
  nrc_sent, 
  low_pass_size = 3, 
  x_reverse_len = 100,
  scale_vals = TRUE,
  scale_range = FALSE
)

######################################################
# Plot them all
######################################################
plot(
  stan_trans, 
  type = "l", 
  main = "Joyce's Portrait Using All Four Methods", 
  xlab = "Narrative Time", 
  ylab = "Emotional Valence",
  ylim = c(-2, 2)
)

lines(
  bing_trans,
  col = "red", 
  lwd = 2
)
lines(
  afinn_trans,
  col = "blue", 
  lwd = 2
)
lines(
  nrc_trans,
  col = "green", 
  lwd = 2
)
legend('topleft', c("Stanford", "Bing", "Afinn", "NRC"), lty=1, col=c('black', 'red', 'blue',' green'), bty='n', cex=.75)


######################################################
# Sentence Parsing Annie's Example
######################################################
annies_sentences_w_quotes <- '"Mrs. Rachael, I needn’t inform you who were acquainted with the late Miss Barbary’s affairs, that her means die with her and that this young lady, now her aunt is dead–" "My aunt, sir!" "It is really of no use carrying on a deception when no object is to be gained by it," said Mr. Kenge smoothly, "Aunt in fact, though not in law."'

# Strip out the quotation marks
annies_sentences_no_quotes <- gsub("\"", "", annies_sentences)

# With quotes, Not Very Good:
s_v <- get_sentences(annies_sentences_w_quotes)
s_v

# Without quotes, Better:
s_v_nq <- get_sentences(annies_sentences_no_quotes)
s_v_nq

######################################################
# Some Sentence Comparisons
######################################################
# Test one
test <- "He was not the sort of man that one would describe as especially handsome."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

# test 2
test <- "She was not the sort of woman that one would describe as beautiful."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

# test 3
test <- "That's some bad R code, man."
stanford_sent <- get_sentiment(test, method="stanford", "/Applications/stanford-corenlp-full-2014-01-04")
bing_sent <- get_sentiment(test, method="bing")
nrc_sent <- get_sentiment(test, method="nrc")
afinn_sent <- get_sentiment(test, method="afinn")
stanford_sent; bing_sent; nrc_sent; afinn_sent

SPECIAL BONUS MATERIAL

Swift’s classic satire presents some sentiment challenges. There is disagreement between the Stanford method and the other three in segment four where the sentiments move in opposite directions.

FOOTNOTE

[1] By the way, I’m not sure if Annie was suggesting that the Stanford parser was not working because she could not get it to work (the NAs) or because there was something wrong in the syuzhet package code. The code, as written, works just fine on the two machines I have available for testing. I’d appreciate hearing from others who are having problems; my implementation definitely qualifies as a first class hack.

Apr 21 2014

Academic, R-Code, Text-Mining, Tips and Code

Text Analysis with R . . . coming soon.

My new book, Text Analysis with R for Students of Literature is due from Springer sometime in May. I got the cover proofs this week (below). Looking good:-)

Apr 06 2014

Academic, R-Code, Text-Mining

Simple Point of View Detection

[Note 4/6/14 @ 2:24 CST: oops, had a small error in the code and corrected it: the second if statement should have been “< 1.5" which made me think of a still simpler way to code the function as edited.] [Note 4/6/14 @ 2:52 CST: After getting some feedback from Jonathan Goodwin about Ford's The Good Soldier, I added a slightly modified version of the function to the bottom of this post. The new function makes it easier to experiment by dialing in/out a threshold value for determining what the function labels as “first” vs. “third.”]

In my Macroanalysis class this term, my students are studying character and characterization. As you might expect, the manner in which we/they analyze character varies depending upon whether the story is being told in the first or third person. Since we are working with a corpus of about 3500 books, it was not practical (at least in the time span of a single class) to hand code each book for its point of view (POV). So instead of opening each text and skimming it to determine its POV, I whipped up a super simple “POV detector” function in R.*

Below you will find the function and then three examples showing how to call the function using links to the Project Gutenberg versions of Feodor Dostoevsky’s first person novel, Notes from the Underground, Stoker’s epistolary Dracula, and Joyce’s third person narrative Portrait of the Artist as a Young Man.**

We have not done anything close to a careful analysis of the results of these predictions, but we have checked a semi-random selection of 30 novels from the full corpus of 3500. At least for these 30, the predictions are spot on. If you take this function for a test drive and discover texts that don’t get correctly assigned, please let me know. This is a very “low-hanging fruit” approach and the key variable (called “first.third.ratio”) can easily be tuned. Of course, we might also consider a more sophisticated machine classification approach, but until we learn otherwise, this functions seems to be doing quite well. Please test it out and let us know what you discover…

predict.pov.from.plain<-function(path.to.file){
  first.person<-c("i", "me", "my", "myself", "we", "us")
  third.person<-c("he", "she", "his", "hers", "him", "her", "himself", "herself")
  text.lines <- scan(path.to.file, what="character", sep="\n")
  text.v <- paste(text.lines, collapse=" ")
  text.lower.v <- tolower(text.v)
  text.words.l <- strsplit(text.lower.v, "\\W")
  text.word.v <- unlist(text.words.l)
  not.blanks.v  <-  which(text.word.v!="")
  text.word.v <-  text.word.v[not.blanks.v]
  text.freqs.t <- table(text.word.v)/length(text.word.v)
  first.sum<-sum(text.freqs.t[first.person], na.rm = TRUE)
  third.sum<-sum(text.freqs.t[third.person], na.rm = TRUE)
  first.third.ratio<-third.sum/first.sum
  if(first.third.ratio >= 1.5){
    pov<-"third"
  } else {
    pov<-"first"
  }
  return(pov)
}

# Now Some Examples: 
# Notes from Underground, First Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/600/pg600.txt") 

# Dracula Epistolary, First Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/345/pg345.txt")

# Portrait of the Artist as a young man, Third Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/4217/pg4217.txt")

Here is a slightly revised version of the function that allows you to set a different “threshold” when calling the function. Johnathan Goodwin reported on Twitter that Ford’s The Good Soldier was being reported as “third” person (which is wrong). Using this new version of the function, you can dial up or down the threshold until you find the a sweet spot for a particular text (such as The Good Soldier) and then use that threshold to test on other texts.

predict.pov.from.plain.w.thresh<-function(path.to.file, threshold=1.5){
  first.person<-c("i", "me", "my", "myself", "we", "us")
  third.person<-c("he", "she", "his", "hers", "him", "her", "himself", "herself")
  text.lines <- scan(path.to.file, what="character", sep="\n")
  text.v <- paste(text.lines, collapse=" ")
  text.lower.v <- tolower(text.v)
  text.words.l <- strsplit(text.lower.v, "\\W")
  text.word.v <- unlist(text.words.l)
  not.blanks.v  <-  which(text.word.v!="")
  text.word.v <-  text.word.v[not.blanks.v]
  text.freqs.t <- table(text.word.v)/length(text.word.v)
  first.sum<-sum(text.freqs.t[first.person], na.rm = TRUE)
  third.sum<-sum(text.freqs.t[third.person], na.rm = TRUE)
  first.third.ratio<-third.sum/first.sum 
  if(first.third.ratio >= threshold){
    pov<-"third"
  } else {
    pov<-"first"
  } 
  return(pov)
}
# The Good Soldier is first Person POV but if we set the threshold too low, it gets assigned third person. . . This version of the function allows you to tune the threshold for experimenting with different texts.
# using a threshold of 2.1, this book is assigned a proper POV
predict.pov.from.plain.w.thresh("http://www.gutenberg.org/cache/epub/4217/pg4217.txt", threshold=2.1)

*[Actually, I whipped up two functions: the one seen here and another one that takes Part of Speech (POS) tagged text as input. Both versions seem to work equally well but this one that takes plain text as input is easier to implement.]

**[Note that none of the Gutenberg boilerplate text is removed in this process. In the implementation I am using with my students, all metadata has already been removed.]

Feb 25 2014

1 Comment

Academic, R-Code, Text-Mining, Tips and Code

Experimenting with “gender” package in R

Yesterday afternoon, Lincoln Mullen and Cameron Blevins released a new R package that is designed to guess (infer) the gender of a name. In my class on literary characterization at the macroscale, students are working on a project that involves a computational study of character genders. . . needless to say, the ‘gender‘ package couldn’t have come at a better time. I’ve only begun to experiment with the package this morning, but so far it looks very promising.

It doesn’t do everything that we need, but it’s a great addition to our workflow. I’ve copied below some R code that uses the gender package in combination with some named entity recognition in order to try and extract character names and genders in a small sample of prose from Twain’s Tom Sawyer. I tried a few other text samples and discovered some significant challenges (e.g. Mrs. Dashwood), but these have more to do with last names and the difficult problems of accurate NER than anything to do with the gender package.

Anyhow, I’ve just begun to experiment, so no big conclusions here, just some starter code to get folks thinking. Hopefully others will take this idea and run with it!

library(gender)
library(openNLP)
require(NLP)

sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()

s <- as.String("Tom did play hookey, and he had a very good time. He got back home barely in season to help Jim, the small colored boy, saw next-day's wood and split the kindlings before supper—at least he was there in time to tell his adventures to Jim while Jim did three-fourths of the work. Tom's younger brother (or rather half-brother) Sid was already through with his part of the work (picking up chips), for he was a quiet boy, and had no adventurous, trouble-some ways. While Tom was eating his supper, and stealing sugar as opportunity offered, Aunt Polly asked him questions that were full of guile, and very deep—for she wanted to trap him into damaging revealments. Like many other simple-hearted souls, it was her pet vanity to believe she was endowed with a talent for dark and mysterious diplomacy, and she loved to contemplate her most transparent devices as marvels of low cunning.")

a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
entity_annotator <- Maxent_Entity_Annotator()
named.ents<-s[entity_annotator(s, a2)]
named.ents.l <- strsplit(named.ents, "\\W")
named.ents.v <- unlist(named.ents.l)
not.blanks.v  <-  which(named.ents.v!="")
named.ents.v <-  named.ents.v[not.blanks.v]
gender(tolower(named.ents.v))

And here is the output:

   name gender proportion_male proportion_female
1   tom   male          0.9971            0.0029
2   jim   male          0.9968            0.0032
3   tom   male          0.9971            0.0029
4   tom   male          0.9971            0.0029
5  aunt                 NA                NA
6 polly female          0.0000            1.0000

Dec 09 2013

Academic, Just for Fun, R-Code, Tips and Code

A Festivus Miracle: Some R Bingo code

A few weeks ago my daughter’s class was gearing up to celebrate the Thanksgiving Holiday, and I was asked to help prepare some “holiday bingo cards” for the kid’s party. Naturally, I wrote a program in R for the job! (I know, I know, Maslow’s hammer)

Since I learned a few R tricks for making a grid and placing text, and since today is the first day of the Hour of Code, I’ve decided to release the code;-)

The 13 lines (excluding comments) of code below will produce 25 random Festivus Bingo Cards and write them out to a single pdf file for easy printing. You supply the Festivus Nog.

# file to make random holiday bingo boards in R. . .

# How many different cards do you want to make?
num.unique.bingo.cards<-25

# 25 words to populate the board with:
my.words<-toupper(c("Seinfeld", "Costanza", "Festivus", "Kramer", "pole", "Grievances", "Strength", "meatloaf", "wrestling ", "Miracle", "Elaine", "Jerry","George", "Dinner", "NBC","December", "Sitcom", "1997", "two-face", "donation", "Kruger", "Flask", "Strike", "Submarine", "Gwen"))

# Put it all in a single pdf file for easy printing!
pdf("bingo.cards.pdf", width=7.5, height=7.5)
for(i in 1:num.unique.bingo.cards){
  # Build a 5 x 5 Matrix for the Grid; ignore the warnings:-)
  m <- matrix(0:1, nrow=5, ncol=5)
  image(m, col=c("darkgreen", "red"), xaxt="n", yaxt="n", main="Festivus Bingo")
  the.vals<-c(0, .25, .5, .75, 1,1, .75, .5, .25, 0)
  x.y.vals<-combn(the.vals, 2)
  the.coords<-unique(t(x.y.vals))
  word.sample<-sample(my.words)
  text(the.coords, word.sample, col="white", cex=.75)
}
dev.off()

Matthew L. Jockers

"Everything . . . in nature's vast workshop from the extinction of some remote sun to the blossoming of one of the countless flowers which beautify our public parks is subject to a law of numeration as yet unascertained.” (Joyce, Ulysses, 1922)

Category Archives: R-Code

Resurrecting a Low Pass Filter (well, kind of)

My Sentiments (Exactly?)

A Ringing Endorsement of Smoothing

Is that Your Syuzhet Ringing?

Some thoughts on Annie’s thoughts . . . about Syuzhet

Text Analysis with R . . . coming soon.

Simple Point of View Detection

Experimenting with “gender” package in R

A Festivus Miracle: Some R Bingo code