» Blog Posts Matthew L. Jockers

Revisiting Chapter Nine of Macroanalysis

Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.

Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.

Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.

If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.

For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.

In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.

I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.

For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.

For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).

Figure 2: Using 167,770 Edges

Figure 3: Using 167,770 Edges

Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.

Figure 4: Using 1,000,000 Edges

Figure 5: Using 1,000,000 Edges

One might wonder if what’s being graphed here is obvious? After all wouldn’t we expect topics to be time sensitive, faddish, and wouldn’t we expect style to be likewise? Well, I suppose expectations are a matter of personal opinion.

What my data show are that some topics appear and disappear over time (e.g. vampires) in what seem to be faddish ways, others seem to appear with regularity and even predictability (love), and some are just downright odd, appearing and disappearing in no recognizable pattern (animals). Such is also the case with the word frequencies that we often speak of as a proxy for “style.” In the 19th century, for example, use of the word “like” in English fiction was fairly consistent and flat compared to other frequent words that fluctuate more from year to year or decade to decade: e.g. “of” and “it”.

So, I don’t think it is a foregone conclusion that novels published in a particular time period are necessarily similar. It is possible that a particularly popular topic might catch on or that a powerful writer’s style might get imitated. It is equally plausible that in a race to “make it new” writers would intentionally avoid working with popular topics or imitating a typical style.

And when it comes to author gender/sex, I don’t think it is obvious that male writers will write like other males and females like other females. The data reveal that even while the majority (roughly 80%) in each class write more like members of their class, many women (~20%) write more like men and many men (~20%) write more like women. Which is to say, there are central tendencies and there are outliers. When it comes to author gender, study after study indicate that the central tendency is about 80% of writers. Looking at how these distributions evolve over time, seems to me a especially interesting place for ongoing research.

But what we are ultimately dealing with here, in these graphs, are the central tendencies. I continue to believe, as I have argued in Macroanalysis and in The Bestseller Code, that it is only through an understanding of the central tendencies that we can begin to understand and appreciate what it means to be an outlier.
Syuzhet 1.0.4 now on CRAN
On Friday I posted an updated version of Syuzhet (1.0.4) to CRAN. This version has been available over on GitHub for a while now. In version 1.0.4, support for sentiment detection in several languages was added by using the expanded NRC lexicon from Saif Mohammed. The lexicon includes sentiment values for 13,901 words in each of the following languages: Arabic, Basque, Bengali, Catalan, Chinese_simplified, Chinese_traditional, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Irish, Italian, Japanese, Latin, Marathi, Persian, Portuguese, Romanian, Russian, Somali, Spanish, Sudanese, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish, Zulu. At the time of this release, however, Syuzhet will only work with languages that use Latin character sets. This effectively means that “Arabic”, “Bengali”, “Chinese_simplified”, “Chinese_traditional”, “Greek”, “Gujarati”, “Hebrew”, “Hindi”, “Japanese”, “Marathi”, “Persian”, “Russian”, “Tamil”, “Telugu”, “Thai”, “Ukranian”, “Urdu”, “Yiddish” are not supported even though these languages are part of the extended NRC dictionary and can be accessed via the get_sentiment_dictionary() function. I have heard from several of my non-English native speaking students and a few others on Twitter that the German, French, and Spanish results seem to be good. Your mileage may vary. For details on the lexicon, see NRC Emotion Lexicon. Also in this release is support for user created lexicons. To work, users create their own custom lexicon as a data frame with at least two columns named “word” and “value.” Here is a simplified example: [crayon-66a3e1cd6f16c778247015/] With contributions from Philip Bulsink, support for parallel processing was added so that one can call get_sentiment() and provide cluster information from parallel::makeCluster() to achieve results quicker on systems with multiple cores. Thanks also to Jennifer Isasi, Tyler Rinker, “amrrs,” and Oliver Keyes for recent suggestions/contributions/QA. Examples of how to use these new functions and languages are in the updated vignette.
Resurrecting a Low Pass Filter (well, kind of)
On April 6th, 2015, I posted Requiem for a low pass filter acknowledging that the smoothing filter as I had implemented it in the beta version of Syuzhet was not performing satisfactorily. Ben Schmidt had demonstrated that the filter was artificially distorting the edges of the plots, and prior to Ben’s post, Annie Swafford had argued that the method was producing an unacceptable “ringing” artifact. Within days of posting the “requiem,” I began hearing from people in the signal processing community offering solutions and suggesting I might have given up on the low pass filter too soon. One good suggestion came from Tommy McGuire (via the Syuzhet GitHub page). Tommy added a “padding factor” argument to the get_transformed_values function in order deal with the periodicity artifacts at the beginnings and ends of a transfomred signal. McGuire’s changes addressed some of the issues and were rolled into the next version of the package. A second important change was the addition of a function similar to the get_transformed_values function but using a discrete cosine transformation (see get_dct_transform) instead of the FFT. The idea for using DCT was offered by Bradley Riddle, a signal processing engineer who works on time series analysis software for in-air acoustic, SONAR, RADAR and speech data. DCT appears to have satisfactorily solved the problem of periodicity artifacts, but users can judge for themselves (see discussion of simple_plot below). In the latest release (April 28, 2016), I kept the original get_transformed_values as modified by Tommy McGuire and also added in the new get_dct_transform. The DCT is much better behaved at the edges, and it requires less tweaking (i.e. no padding). Figure 1 shows a plot of Madame Bovary (the text Ben had used in his example of the edge artifacts) with the original plot line (without Tommy McGuire’s update) produced by get_transformed_values (in blue) and a new plot line (in red) produced by the get_dct_transform. The red line is a more accurate representation of the (tragic) plot as we know it. As in the past, I have graphed several dozen novels that I (and my students and colleagues) know well in order to validate that the shapes being produced by the DCT method are accurate representations of the shifting emotions in the novels. I have also worked with a handful of creative writing colleagues here at UNL and elsewhere, graphing their novels and getting feedback from them about whether the shapes match their sense of their own books. In every case, the response has been “yes.” (Though that does not guarantee you won’t find an exception–please let me know if you do!) Figure 1 Like the original get_transformed_values, the new get_dct_transform implements a low-pass filter to handle the smoothing. For those following the larger discussion, note that there is nothing unusual or strange about using low-pass filters for smoothing data. Indeed, the well known moving average is an example of a low-pass filter and a simple Google search will turn up countless articles about smoothing data with FFT and DCT. The trick with any smoother is determining how much smoothing you want to do. With a moving average, the witchcraft comes in setting the size of the moving widow to determine how much noise to remove. With the get_dct_transform it is about setting the number of low frequency components to retain. In any such smoothing you have to accept/assume that the important (desired) information is contained in the lower frequency variation and not in the higher frequency noise. To help users visualize how two very common filters smooth data in comparison to the get_dct_transform, I added a function called “simple_plot.” With simple_plot it is easy to see how the three different smoothing methods represent the data. Figure 2 shows the output of calling simple_plot for Madame Bovary using the function’s default values. The top panel shows three lines produced by: 1) a Loess smoother, 2) a rolling mean, and 3) the get_dct_transform. (Note that with a rolling mean, you lose data at both beginning and the end of the series.) The bottom image shows a flatter DCT line (i.e. produced by retaining fewer low frequency components and, therefore, having less noise). The bottom image also uses the reverse transform process as a way to normalize the x-axis to 100 units. (In the new release, there is now another function, rescale_x_2, that can be used as an alternative way to normalize both the x and y axis.) Figure 2 The other change in the latest release is the addition of a custom sentiment dictionary compiled with help from the students in my lab. I have documented the creation and testing of that dictionary in two previous blog posts: “That Sentimental Feeling” (12/20/2015) and “More Syuzhet Validation” (August 11, 2016). In these posts, human coded sentiment data is compared to machine derived data in eleven well known novels. We still have more work to do in terms of tweaking and validating the dictionary, but so far it is performing as well as the other dictionaries and in some case better. Also worth mention here is yet another smoothing method suggested by Jianbo Gao, who has developed an innovative adaptive approach to smoothing time series data. Jianbo and I met at the Institute for Applied Mathematics last summer and, with John Laudun and Timothy Tangherlini, we wrote a paper titled “A Multiscale Theory for the Dynamical Evolution of Sentiment in Novels” that was delivered at the Conference on Behavioral, Economic, and Socio-Cultural Computing last November. I have not found the time to implement this adaptive smoother into the Syuzhet package, but it is on the todo list. Also on the todo list for a future release is adding the ability to work with languages other than English. Thanks to Denis Roussel, GitHub contributor “denrou”, this work is now progressing nicely. Over the past few years, a number of people have contributed to this work, either directly or indirectly. I think we are making good progress, and I want to acknowledge the following people in particular: Aaron Dominguez, Andrew Piper, Annie Swafford, Ben Schmidt, Bradley Riddle, Chris Stubben, David Bamman, Denis Roussel, Drue Marr, Ellie Wilke, Faith Aberle, Felix Peckitt, Gabi Kirilloff, Jianbo Gao, Julius Fredrick, Lincoln Mullen, Marti Hearst, Michael Hoffman, Natalie Mackley, Nissanka Wickremasinghe, Oliver Keyes, Peter Organisciak, Roz Thalken, Sarah Cohen, Scott Enderle, Tasha Saathoff, Ted Underwood, Timothy Schaffert, Tommy McGuire, and Walter Jacob. (If I forgot you, I’m sorry, please let me know).
More Syuzhet Validation
Back in December I posted results from a human validation experiment in which machine extracted sentiment values were compared to human coded values. The results were encouraging. In the spring, we mined the human coded sentences to help create a new sentiment dictionary that would, in theory, be more sensitive to the sort of sentiment words common to fiction (whereas existing sentiment dictionaries tend to be derived from movie and/or product review corpora). This dictionary was implemented as the default in the latest release of the Syuzhet R package (2016-04-28). Over the summer, a new group of six human-coders was hired to read novels and score the sentiment of every sentence. Each novel was read by three human-coders. In the graphs that follow below, a simple moving average is used to plot the mean sentiment of the three students (black line) along side the values derived from the new “Syuzhet” dictionary (red line). Each graph reports the Pearson product-moment correlation coefficient. This fall we will continue gathering human data by reading additional books. Once we have a few more books read, we’ll post a more detailed report, including data about inter-coder agreement and which machine methods produced results closest to the humans.
That Sentimental Feeling
Eight months ago I began a series of blog posts about my experiments using sentiment analysis as a proxy for plot movement. At the time, I had done a fair bit of anecdotal analysis of how well the sentiments detected by a machine matched my own sense of the sentiments in a series of familiar novels. In addition to the anecdotal spot-checking, I had also hand-coded every sentence of James Joyce’s novel Portrait of the Artist as a Young Man and then compared the various machine methods to my own human coded values. The similarities (seen in figure 1) were striking.
Figure 1
Soon after my first post about this work, David Bamann hired five Mechanical Turks to code the sentiment in each scene of Shakespeare’s Romeo and Juliet. David posted his results online and then Ted Underwood compared the trajectory produced by David’s turks to the machine values produced by the Syuzhet R package I had developed. Even though David’s Turks had coded scenes and Syuzhet had coded sentences, the human and machine trajectories that resulted were very similar. Figure 2 shows the two graphs, first from David’s blog and then from Ted’s.
Figure 2
Before releasing the package, I was fairly confident that the machine was doing a good job of approximating what human beings would think. I hoped that others, like David and Ted, would provide further validation. Many folks posted results online and many more emailed me saying the tool was producing trajectories that matched their sense of the novels they applied it to, but no one conducted anything beyond anecdotal spot checking. After returning to UNL in late August (I had been on leave for a year), I hired four students to code the sentiment of every sentence in six contemporary novels: All the Light We Cannot See by Anthony Doerr, The Da Vinci Code by Dan Brown, Gone Girl by Gillian Flynn, The Secret Life of Bees by Sue Monk Kidd, The Lovely Bones by Alice Sebold, and The Notebook by Nicholas Sparks. These novels were selected to cover several major contemporary genres. They span a period from 2003 to 2014. None are experimental in the way that Portrait of the Artist is, but they do cover a range of styles between what we might call “low-brow” to “high-brow.” Each sentence of each novel was sentiment coded by three human raters. The precise details of this study, including statistics about inter-rater agreement and machine-to-human agreement, are part of a larger analysis I am conducting with Aaron Dominguez. What follows are six graphs showing moving averages of the human coded sentiment along side moving averages from two of the sentiment detection methods implemented in the Syuzhet R package. The similarity of the shapes derived from the the human and machine data is quite striking.