» Unigrams, and bigrams, and trigrams, oh my Matthew L. Jockers

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).

And here’s another chart comparing the usage of “asdf” to “qwer.”

And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.
And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data
In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).