Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.

Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.

Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.

If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.

For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.

In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.

I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.

For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.

For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).

Figure 2: Using 167,770 Edges
Figure 3: Using 167,770 Edges

Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.

Figure 4: Using 1,000,000 Edges
Figure 5: Using 1,000,000 Edges

One might wonder if what’s being graphed here is obvious? After all wouldn’t we expect topics to be time sensitive, faddish, and wouldn’t we expect style to be likewise? Well, I suppose expectations are a matter of personal opinion.

What my data show are that some topics appear and disappear over time (e.g. vampires) in what seem to be faddish ways, others seem to appear with regularity and even predictability (love), and some are just downright odd, appearing and disappearing in no recognizable pattern (animals). Such is also the case with the word frequencies that we often speak of as a proxy for “style.” In the 19th century, for example, use of the word “like” in English fiction was fairly consistent and flat compared to other frequent words that fluctuate more from year to year or decade to decade: e.g. “of” and “it”.

So, I don’t think it is a foregone conclusion that novels published in a particular time period are necessarily similar. It is possible that a particularly popular topic might catch on or that a powerful writer’s style might get imitated. It is equally plausible that in a race to “make it new” writers would intentionally avoid working with popular topics or imitating a typical style.

And when it comes to author gender/sex, I don’t think it is obvious that male writers will write like other males and females like other females. The data reveal that even while the majority (roughly 80%) in each class write more like members of their class, many women (~20%) write more like men and many men (~20%) write more like women. Which is to say, there are central tendencies and there are outliers. When it comes to author gender, study after study indicate that the central tendency is about 80% of writers. Looking at how these distributions evolve over time, seems to me a especially interesting place for ongoing research.

But what we are ultimately dealing with here, in these graphs, are the central tendencies. I continue to believe, as I have argued in Macroanalysis and in The Bestseller Code, that it is only through an understanding of the central tendencies that we can begin to understand and appreciate what it means to be an outlier.