"Everything . . . in nature's vast workshop from the extinction of some remote sun to the blossoming of one of the countless flowers which beautify our public parks is subject to a law of numeration as yet unascertained.” (Joyce, Ulysses, 1922)
Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.
Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.
Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.
If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.
For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.
In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.
I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.
For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.
For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).
Figure 2: Using 167,770 EdgesFigure 3: Using 167,770 Edges
Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.
Figure 4: Using 1,000,000 EdgesFigure 5: Using 1,000,000 Edges
One might wonder if what’s being graphed here is obvious? After all wouldn’t we expect topics to be time sensitive, faddish, and wouldn’t we expect style to be likewise? Well, I suppose expectations are a matter of personal opinion.
What my data show are that some topics appear and disappear over time (e.g. vampires) in what seem to be faddish ways, others seem to appear with regularity and even predictability (love), and some are just downright odd, appearing and disappearing in no recognizable pattern (animals). Such is also the case with the word frequencies that we often speak of as a proxy for “style.” In the 19th century, for example, use of the word “like” in English fiction was fairly consistent and flat compared to other frequent words that fluctuate more from year to year or decade to decade: e.g. “of” and “it”.
So, I don’t think it is a foregone conclusion that novels published in a particular time period are necessarily similar. It is possible that a particularly popular topic might catch on or that a powerful writer’s style might get imitated. It is equally plausible that in a race to “make it new” writers would intentionally avoid working with popular topics or imitating a typical style.
And when it comes to author gender/sex, I don’t think it is obvious that male writers will write like other males and females like other females. The data reveal that even while the majority (roughly 80%) in each class write more like members of their class, many women (~20%) write more like men and many men (~20%) write more like women. Which is to say, there are central tendencies and there are outliers. When it comes to author gender, study after study indicate that the central tendency is about 80% of writers. Looking at how these distributions evolve over time, seems to me a especially interesting place for ongoing research.
But what we are ultimately dealing with here, in these graphs, are the central tendencies. I continue to believe, as I have argued in Macroanalysis and in The Bestseller Code, that it is only through an understanding of the central tendencies that we can begin to understand and appreciate what it means to be an outlier.
Used by Google Analytics to determine which links on a page are being clicked
30 seconds
_ga_
ID used to identify users
2 years
_gid
ID used to identify users for 24 hours after last activity
24 hours
_gat
Used to monitor number of Google Analytics server requests when using Google Tag Manager
1 minute
__utmb
Used to distinguish new sessions and visits. This cookie is set when the GA.js javascript library is loaded and there is no existing __utmb cookie. The cookie is updated every time data is sent to the Google Analytics server.
30 minutes after last activity
__utmt
Used to monitor number of Google Analytics server requests
10 minutes
__utma
ID used to identify users and sessions
2 years after last activity
_gac_
Contains information related to marketing campaigns of the user. These are shared with Google AdWords / Google Ads when the Google Ads and Google Analytics accounts are linked together.
90 days
__utmz
Contains information about the traffic source or campaign that directed user to the website. The cookie is set when the GA.js javascript is loaded and updated when data is sent to the Google Anaytics server
6 months after last activity
__utmv
Contains custom information set by the web developer via the _setCustomVar method in Google Analytics. This cookie is updated every time new data is sent to the Google Analytics server.
2 years after last activity
__utmx
Used to determine whether a user is included in an A / B or Multivariate test.
18 months
_ga
ID used to identify users
2 years
__utmc
Used only with old Urchin versions of Google Analytics and not with GA.js. Was used to distinguish between new sessions and visits at the end of a session.
End of session (browser)
Marketing cookies are used to follow visitors to websites. The intention is to show ads that are relevant and engaging to the individual user.
A video-sharing platform for users to upload, view, and share videos across various genres and topics.
Registers a unique ID on mobile devices to enable tracking based on geographical GPS location.
1 day
VISITOR_INFO1_LIVE
Tries to estimate the users' bandwidth on pages with integrated YouTube videos. Also used for marketing
179 days
PREF
This cookie stores your preferences and other information, in particular preferred language, how many search results you wish to be shown on your page, and whether or not you wish to have Google’s SafeSearch filter turned on.
10 years from set/ update
YSC
Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
Session
DEVICE_INFO
Used to detect if the visitor has accepted the marketing category in the cookie banner. This cookie is necessary for GDPR-compliance of the website.
179 days
LOGIN_INFO
This cookie is used to play YouTube videos embedded on the website.
Revisiting Chapter Nine of Macroanalysis
Back when I was working on Macroanalysis, Gephi was a young and sometimes buggy application. So when it came to the network analysis in Chapter 9, I was limited in terms of the amount of data that could be visualized. For the network graphs, I reduced the number of edges from 5,660,695 down to 167,770 by selecting only those edges where the distances were quite close.
Gephi can now handle one million edges, so I thought it would be interesting to see how/if the results of my original analysis might change if I went from graphing 3% of the edges to 18%.
Readers familiar with my approach will recall that I calculated the similarity between every book in my corpus using euclidean distance. My feature set was a combination of topic data from the topic model discussed in chapter 8 and the stylistic data explored in chapter 6. Basically, every single book was compared to every other single book using the euclidean formula, the output of which is a distance matrix where the number of rows and the number of columns is equal to the number of books in the corpus. The values in the cells of the matrix are the computed euclidean distances.
If you take any single row (or column) in the matrix and sort it from smallest to largest, the smallest value will always be a 0 and that is because the distance from any book to itself is always zero. The next value will be the book that has the most similar composition of topics and style. So if you select the row for Jane Austen’s Pride and Prejudice, you’ll find that Sense and Sensibility and other books by Austen are close by in terms of distance. Austen has a remarkably stable style across her novels and the same topics tend to appear across her books.
For any given book, there are a handful of books that are very similar (short distances) and then a series of books that are fairly similar and then whole bunch of books that have little to no similarity. Consider the case of Pride and Prejudice. Figure 1 shows the sorted distances from Pride and Prejudice to the 35 most similar books in the corpus. You’ll notice there is a “knee” in the line right around the 7th book on the x-axis. Those first seven book are very similar. After that we see books becoming more and more distant along a fairly regular slope. If we were to plot the entire distribution, there would be another “knee” where books become incredibly dissimilar and the line shoots upward.
In chapter 9 of Macroanalysis, I was curious about influence and the relationship between individual books and the other books that were most similar to them. To explore these relationships at scale, I devised an ad hoc approach to culling the number of edges of interest to only those where the distances were comparatively short. In the case of Pride and Prejudice, the most similar books included other works by Austen, but also books stretching into the future as far as 1886. In other words, the most similar books are not necessarily colocated in time.
I admit that this culling process was not very well described in Macroanalysis and there is, I see now, one error of omission and one outright mistake. Neither of these impacted the results described in the book, but it’s definitely worth setting the record straight here. In the book (page 165), I write that I “removed those target books that were more than one standard deviation from the source book.” That’s not clear at all, and it’s probably misleading.
For each book, call it the “base” book, I first excluded all books published in the same year or before the publication year of the base book (i.e. a book could not influence a book published in the same year or before, so these should not be examined). I then calculated the mean distance of the remaining books from the base book. I then kept only those books that were less then 3/4 of a standard deviation below the mean (not one whole standard deviation as suggested in my text). For Pride and Prejudice, this formula meant that I retained the 26 most similar books. For the larger corpus, this is how I got from 5,660,695 edges down to 167,770.
For this blog post, I recreated the entire process. The next two images (figures 2 and 3) show the same results reported in the book. The network shapes look slightly different and the orientations are slightly different, but there is still clear evidence of a chronological signal (figure 2) and there is still a clear differentiation between books authored by males and books authored by females (figure 3).
Figures 4 and 5, below, show the same chronological and gender sorting, but now using 1 million edges instead of the original 167,770.
One might wonder if what’s being graphed here is obvious? After all wouldn’t we expect topics to be time sensitive, faddish, and wouldn’t we expect style to be likewise? Well, I suppose expectations are a matter of personal opinion.
What my data show are that some topics appear and disappear over time (e.g. vampires) in what seem to be faddish ways, others seem to appear with regularity and even predictability (love), and some are just downright odd, appearing and disappearing in no recognizable pattern (animals). Such is also the case with the word frequencies that we often speak of as a proxy for “style.” In the 19th century, for example, use of the word “like” in English fiction was fairly consistent and flat compared to other frequent words that fluctuate more from year to year or decade to decade: e.g. “of” and “it”.
So, I don’t think it is a foregone conclusion that novels published in a particular time period are necessarily similar. It is possible that a particularly popular topic might catch on or that a powerful writer’s style might get imitated. It is equally plausible that in a race to “make it new” writers would intentionally avoid working with popular topics or imitating a typical style.
And when it comes to author gender/sex, I don’t think it is obvious that male writers will write like other males and females like other females. The data reveal that even while the majority (roughly 80%) in each class write more like members of their class, many women (~20%) write more like men and many men (~20%) write more like women. Which is to say, there are central tendencies and there are outliers. When it comes to author gender, study after study indicate that the central tendency is about 80% of writers. Looking at how these distributions evolve over time, seems to me a especially interesting place for ongoing research.
But what we are ultimately dealing with here, in these graphs, are the central tendencies. I continue to believe, as I have argued in Macroanalysis and in The Bestseller Code, that it is only through an understanding of the central tendencies that we can begin to understand and appreciate what it means to be an outlier.