Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.

Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.

Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.

If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.

For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.

In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.

I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.

For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.

For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).

Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.