On April 6th, 2015, I posted Requiem for a low pass filter acknowledging that the smoothing filter as I had implemented it in the beta version of Syuzhet was not performing satisfactorily. Ben Schmidt had demonstrated that the filter was artificially distorting the edges of the plots, and prior to Ben’s post, Annie Swafford had argued that the method was producing an unacceptable “ringing” artifact. Within days of posting the “requiem,” I began hearing from people in the signal processing community offering solutions and suggesting I might have given up on the low pass filter too soon.
One good suggestion came from Tommy McGuire (via the Syuzhet GitHub page). Tommy added a “padding factor” argument to the get_transformed_values function in order deal with the periodicity artifacts at the beginnings and ends of a transfomred signal. McGuire’s changes addressed some of the issues and were rolled into the next version of the package.
A second important change was the addition of a function similar to the get_transformed_values function but using a discrete cosine transformation (see get_dct_transform) instead of the FFT. The idea for using DCT was offered by Bradley Riddle, a signal processing engineer who works on time series analysis software for in-air acoustic, SONAR, RADAR and speech data. DCT appears to have satisfactorily solved the problem of periodicity artifacts, but users can judge for themselves (see discussion of simple_plot below).
In the latest release (April 28, 2016), I kept the original get_transformed_values as modified by Tommy McGuire and also added in the new get_dct_transform. The DCT is much better behaved at the edges, and it requires less tweaking (i.e. no padding). Figure 1 shows a plot of Madame Bovary (the text Ben had used in his example of the edge artifacts) with the original plot line (without Tommy McGuire’s update) produced by get_transformed_values (in blue) and a new plot line (in red) produced by the get_dct_transform. The red line is a more accurate representation of the (tragic) plot as we know it.
As in the past, I have graphed several dozen novels that I (and my students and colleagues) know well in order to validate that the shapes being produced by the DCT method are accurate representations of the shifting emotions in the novels. I have also worked with a handful of creative writing colleagues here at UNL and elsewhere, graphing their novels and getting feedback from them about whether the shapes match their sense of their own books. In every case, the response has been “yes.” (Though that does not guarantee you won’t find an exception–please let me know if you do!)
Like the original get_transformed_values, the new get_dct_transform implements a low-pass filter to handle the smoothing. For those following the larger discussion, note that there is nothing unusual or strange about using low-pass filters for smoothing data. Indeed, the well known moving average is an example of a low-pass filter and a simple Google search will turn up countless articles about smoothing data with FFT and DCT. The trick with any smoother is determining how much smoothing you want to do. With a moving average, the witchcraft comes in setting the size of the moving widow to determine how much noise to remove. With the get_dct_transform it is about setting the number of low frequency components to retain. In any such smoothing you have to accept/assume that the important (desired) information is contained in the lower frequency variation and not in the higher frequency noise.
To help users visualize how two very common filters smooth data in comparison to the get_dct_transform, I added a function called “simple_plot.” With simple_plot it is easy to see how the three different smoothing methods represent the data. Figure 2 shows the output of calling simple_plot for Madame Bovary using the function’s default values. The top panel shows three lines produced by: 1) a Loess smoother, 2) a rolling mean, and 3) the get_dct_transform. (Note that with a rolling mean, you lose data at both beginning and the end of the series.) The bottom image shows a flatter DCT line (i.e. produced by retaining fewer low frequency components and, therefore, having less noise). The bottom image also uses the reverse transform process as a way to normalize the x-axis to 100 units. (In the new release, there is now another function, rescale_x_2, that can be used as an alternative way to normalize both the x and y axis.)
The other change in the latest release is the addition of a custom sentiment dictionary compiled with help from the students in my lab. I have documented the creation and testing of that dictionary in two previous blog posts: “That Sentimental Feeling” (12/20/2015) and “More Syuzhet Validation” (August 11, 2016). In these posts, human coded sentiment data is compared to machine derived data in eleven well known novels. We still have more work to do in terms of tweaking and validating the dictionary, but so far it is performing as well as the other dictionaries and in some case better.
Also worth mention here is yet another smoothing method suggested by Jianbo Gao, who has developed an innovative adaptive approach to smoothing time series data. Jianbo and I met at the Institute for Applied Mathematics last summer and, with John Laudun and Timothy Tangherlini, we wrote a paper titled “A Multiscale Theory for the Dynamical Evolution of Sentiment in Novels” that was delivered at the Conference on Behavioral, Economic, and Socio-Cultural Computing last November. I have not found the time to implement this adaptive smoother into the Syuzhet package, but it is on the todo list.
Also on the todo list for a future release is adding the ability to work with languages other than English. Thanks to Denis Roussel, GitHub contributor “denrou”, this work is now progressing nicely.
Over the past few years, a number of people have contributed to this work, either directly or indirectly. I think we are making good progress, and I want to acknowledge the following people in particular: Aaron Dominguez, Andrew Piper, Annie Swafford, Ben Schmidt, Bradley Riddle, Chris Stubben, David Bamman, Denis Roussel, Drue Marr, Ellie Wilke, Faith Aberle, Felix Peckitt, Gabi Kirilloff, Jianbo Gao, Julius Fredrick, Lincoln Mullen, Marti Hearst, Michael Hoffman, Natalie Mackley, Nissanka Wickremasinghe, Oliver Keyes, Peter Organisciak, Roz Thalken, Sarah Cohen, Scott Enderle, Tasha Saathoff, Ted Underwood, Timothy Schaffert, Tommy McGuire, and Walter Jacob. (If I forgot you, I’m sorry, please let me know).