I was unable to attend the DH 2012 meeting in Hamburg, but I recorded my paper as a screen cast, and my ever faithful colleague Glen Worthey kindly delivered it on my behalf. The full presentation can be viewed here as a QuickTime movie.
I could not make it to the DH conference in Hamburg this year (though I did manage to appear virtually). As chair of the Busa Award committee I had the pleasure of announcing that Willard McCarty had won the award. Willard will accept the award in 2013 when DH meets at the University of Nebraska. Here is the text of my announcement which was read today in Hamburg:
I was very pleased to serve as the Chair of the Busa Award committee this cycle, and though I am disappointed that I was unable to travel to Hamburg this year to make this announcement in person, I’m delighted with the end result. I am also delighted that the award will be given at the 2013 conference hosted by the University of Nebraska. Having recently joined the faculty there, I’m quite certain I will be attending next year’s meeting!
The winner of the 2013 Busa Award is a man of legendary kindness and generosity. His contributions to the growth and prominence of Digital Humanities will be familiar to us all. He is a gentleman, a scholar, a philosopher, and a long time fighter for the cause. He is, by one colleague’s accounting, the “Obi-Wan Kenobi” of Digital Humanities. And I must concur that “the force” is strong with this one. Please join me in congratulating Willard McCarty on his selection for the 2013 Busa Award.
In the last chapter of forthcoming my book, I write about the challenges of copyright law and how many a digital humanist is destined to become a 19th-centuryist if the law isn’t reformed to specifically allow for and recognize the importance of “non-expressive” use of digitized content.*
This week the Amicus Brief that I co-authored with Matthew Sag and Jason Schultz was submitted. The brief (see Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild, Inc. Et Al V. Hathitrust Et Al.) includes official endorsement from the Association of Computers in the Humanities as well as the support and signature of many individual scholars working in the field.
* “Non-expressive use” is Matthew Sag’s far more pleasing formulation of what many have come to call “non-consumptive use.”
In preparation for the publication of my book (Macroanalysis: Digital Methods and Literary History, UIUC Press, 2013), I’ve begun posting some graphs and other data to my (new) website. To get the ball rolling, I have created an interactive “theme viewer” where visitors will find a drop down menu of the 500 themes I harvested from a corpus of 3,346 19-century British, Irish and American novels using “topic modeling” and a series of pre-processing routines that I detail in the book. Each theme is accompanied by a word cloud showing the relative importance of each term to the topic, and each cloud is followed by four graphs showing the distribution of the topic/theme over time and across author genders and author nationalities. Here is a sample of a theme I have labeled “FACTORY AND WORKHOUSE LABOR.” You can click on the thumbnails below for larger images, but the real fun is over at the theme viewer.
For my forthcoming book, which includes a chapter on the uses of topic modeling in literary studies, I wrote the following vignette. It is my imperfect attempt at making the mathematical magic of LDA palatable to the average humanist. Imperfect, but hopefully more fun than plate notation. . .
. . . imagine a quaint town, somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .
One afternoon Herman Melville bumps into Jane Austen at the bocce ball court, and they get to talking.
“You know,” says Austen, “I have not written a thing in weeks.”
“Arrrrgh,” Melville replies, “me neither.”
So hand in hand they stroll down Gibbs Lane to the LDA Buffet. Now, down at the LDA Buffet no one gets fat. The buffet only serves light (leit?) motifs, themes, topics, and tropes (seasonal). Melville hands a plate to Austen, grabs another for himself, and they begin walking down the buffet line. Austen is finicky; she spoons a dainty helping of words out of the bucket marked “dancing.” A slightly larger spoonful of words, she takes from the “gossip” bucket and then a good ladle’s worth of “courtship.”
Melville makes a bee line for the “whaling” trough, and after piling on an Ahab-sized handful of whaling words, he takes a smaller spoonful of “seafaring” and then just a smidgen of “cetological jargon.”
The two companions find a table where they sit and begin putting all the words from their plates into sentences, paragraphs, and chapters.
At one point, Austen interrupts this business: “Oh Herman, you must try a bit of this courtship.”
He takes a couple of words but is not really fond of the topic. Then Austen, to her credit, asks permission before reaching across the table and sticking her fork in Melville’s pile of seafaring words, “just a taste,” she says. This work goes on for a little while; they order a few drinks and after a few hours, voila! Moby Dick and Persuasion are written . . .
[Now, dear reader, our story thus far provides an approximation of the first assumption made in LDA. We assume that documents are constructed out of some finite set of available topics. It is in the next part that things become a little complicated, but fear not, for you shall sample themes both grand and beautiful.]
. . . Filled with a sense of deep satisfaction, the two begin walking back to the lodging house. Along the way, they bump into a blurry-eyed Hemingway, who is just then stumbling out of the Rising Sun Saloon.
Having taken on a bit too much cargo, Hemingway stops on the sidewalk in front of the two literati. Holding out a shaky pointer finger, and then feigning an English accent, Hemingway says: “Stand and Deliver!”
To this, Austen replies, “Oh come now, Mr. Hemingway, must we do this every season?”
More gentlemanly then, Hemingway replies, “My dear Jane, isn’t it pretty to think so. Now if you could please be so kind as to tell me what’s in the offing down at the LDA Buffet.”
Austen turns to Melville and the two writers frown at each other. Hemingway was recently banned from the LDA Buffet. Then Austen turns toward Hemingway and holds up six fingers, the sixth in front of her now pursed lips.
“Six topics!” Hemingway says with surprise, “but what are today’s themes?”
“Now wouldn’t you like to know that you old sot.” Says Melville.
The thousand injuries of Melville, Hemingway had borne as best he could, but when Melville ventured upon insult he vowed revenge. Grabbing their recently completed manuscripts, Hemingway turned and ran toward the South. Just before disappearing down an alleyway, he calls back to the dumbfounded writers: “All my life I’ve looked at words as though I were seeing them for the first time. . . tonight I will do so again! . . . ”
[Hemingway has thus overcome the first challenge of topic modeling. He has a corpus and a set number of topics to extract from it. In reality determining the number of topics to extract from a corpus is a bit trickier. If only we could ask the authors, as Hemingway has done here, things would be so much easier.]
. . . Armed with the manuscripts and the knowledge that there were six topics on the buffet, Hemingway goes to work.
After making backup copies of the manuscripts, he then pours all the words from the originals into a giant Italian-leather attache. He shakes the bag vigorously and then begins dividing its contents into six smaller ceramic bowls, one for each topic. When each of the six bowls is full, Hemingway gets a first glimpse of the topics that the authors might have found at the LDA Buffet. Regrettably, these topics are not very good at all; in fact, they are terrible, a jumble of random unrelated words . . .
[And now for the magic that is Gibbs Sampling.]
. . . Hemingway knows that the two manuscripts were written based on some mixture of topics available at the LDA Buffet. So to improve on this random assignment of words to topic bowls, he goes through the copied manuscripts that he kept as back ups. One at a time, he picks a manuscript and pulls out a word. He examines the word in the context of the other words that are distributed throughout each of the six bowls and in the context of the manuscript from which it was taken. The first word he selects is “heaven,” and at this word he pauses, and asks himself two questions:
- “How much of ‘Topic A,’ as it is presently represented in bowl A, is present in the current document?”
- “Which topic, of all of the topics, has the most ‘heaven’ in it?” . . .
[Here again dear reader, you must take with me a small leap of faith and engage in a bit of further make believe. There are some occult statistics here accessible only to the initiated. Nevertheless, the assumptions of Hemingway and of the topic model are not so far-fetched or hard to understand. A writer goes to his or her imaginary buffet of themes and pulls them out in different proportions. The writer then blends these themes together into a work of art. That we might now be able to discover the original themes by reading the book is not at all amazing. In fact we do it all the time--every time we say that such and such a book is about "whaling" or "courtship." The manner in which the computer (or dear Hemingway) does this is perhaps less elegant and involves a good degree of mathematical magic. Like all magic tricks, however, the explanation for the surprise at the end is actually quite simple: in this case our magician simply repeats the process 10 billion times! NOTE: The real magician behind this LDA story is David Mimno. I sent David a draft, and along with other constructive feedback, he supplied this beautiful line about computational magic.]
. . . As Hemingway examines each word in its turn, he decides based on the calculated probabilities whether that word would be more appropriately moved into one of the other topic bowls. So, if he were examining the word “whale” at a particular moment, he would assume that all of the words in the six bowls except for “whale” were correctly distributed. He’d now consider the words in each of those bowls and in the original manuscripts, and he would choose to move a certain number of occurrences of “whale” to one bowl or another.
Fortunately, Hemingway has by now bumped into James Joyce who arrives bearing a cup of coffee on which a spoon and napkin lay crossed. Joyce, no stranger to bags-of-words, asks with compassion: “Is this going to be a long night.”
“Yes,” Hemingway said, “yes it will, yes.”
Hemingway must now run through this whole process over and over again many times. Ultimately, his topic bowls reach a steady state where words are no longer needing to be being reassigned to other bowls; the words have found their proper context.
After pausing for a well-deserved smoke, Hemingway dumps out the contents of the first bowl and finds that it contains the following words:
“whale sea men ship whales penfon air side life bounty night oil natives shark seas beard sailors hands harpoon mast top feet arms teeth length voyage eye heart leviathan islanders flask soul ships fishery sailor sharks company. . . “
He peers into another bowl that looks more like this:
“marriage happiness daughter union fortune heart wife consent affection wishes life attachment lover family promise choice proposal hopes duty alliance affections feelings engagement conduct sacrifice passion parents bride misery reason fate letter mind resolution rank suit event object time wealth ceremony opposition age refusal result determination proposals. . .”
After consulting the contents of each bowl, Hemingway immediately knows what topics were on the menu at the LDA Buffet. And, not only this, Hemingway knows exactly what Melville and Austen selected from the Buffet and in what quantities. He discovers that Moby Dick is composed of 40% whaling, 18% seafaring and 2% gossip (from that little taste he got from Jane) and so on . . .
[Thus ends the fable.]
For the rest of the (LDA) story, see David Mimno’s Topic Modeling Bibliography
I created the visualization below using Many Eyes and a data set derived from part-of-speech tagged novels from 19th century Britain. Found here are the 100 most “aberrant adjectives.” Aberrant here is determined by selecting those words that have the greatest amount of usage deviation (measured by relative frequency) over a 13 decade time period. To qualify a word must also appear in every decade.
Earlier this week Kathryn Schultz of the New York Times published a rather provocative, challenging, and in my opinion under-researched and over-sensationalized article about my colleague Franco Morreti’s work theorizing a mode of literary analysis that he has termed “distant-reading.” Others have already pointed out some of the errors Schultz made, and I’m fairly certain Moretti would be happy to clarify any confusion Schultz may have about his work if she were to actually interview him (i.e. before paraphrasing him). My interest here is to offer some specific thoughts and some background on “distant-reading” or what I have preferred to call “macroanalysis.”
The approach to the study of literature that I call macroanalysis, instead of distant-reading (for reasons explained below), is in general ways akin to the social-science of economics or, more specifically, macroeconomics. Before the 20th century there wasn’t a defined field of “Macroeconomics.” There was, however, microeconomics, which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to the study of individual texts via “close-readings” of the material. Macroeconomics, however, is about the study of the entire economy. It tends toward enumeration and quantification and is in this sense similar to literary inquiries that are not highly theorized: bibliographic studies, biographical studies, literary history, philology, and the enumerative analysis that is the foundation of humanities computing.
By way of an analogy, we might think about interpretive close-readings as corresponding to microeconomics while quantitative macroanalysis corresponds to macroeconomics. Consider, then, that in many ways the study of literary genres or literary periods is a type of macro approach to literature. Say, for example, a scholar specializes in early 20th century poetry. Presumably, this scholar could be called upon to provide sound generalizations, or “distant-readings” about 20th century poetry based on a broad reading of individual works within that period. This would be a sort of “macro-, or distant-, reading” of the period. But this parallel falls short of approximating for literature what macroeconomics is to economics, and it is in this context that I prefer the term macroanalysis over distant-reading. The former term places the emphasis on the quantifiable methodology over the more interpretive practice of “reading.” Broad attempts to generalize about a period or about a genre are frequently just another sort of micro-analysis, in which multiple “cases” or “close-readings” of individual texts are digested before generalizations about them are drawn in very qualitative ways. Macroeconomics, on the other hand, is a more number-based discipline, one grounded in quantitative analysis not qualitative assessments. Moreover, macroeconomics employs a number of quantitative benchmarks for assessing, scrutinizing, and even forecasting the macro-economy. While there is an inherent need for understanding the economy at the micro level, in order to contextualize the macro-results, macroeconomics does not directly involve itself in the specific cases, choosing instead to see the cases in the aggregate, looking to those elements of the specific cases that can be generalized, aggregated, and quantified.
Micro-oriented approaches to literature, highly interpretive readings of literature, remain fundamentally important. Just as microeconomics offers important perspectives on the economy. It is the exact interplay between the macro and micro scale that promises a new, enhanced, and perhaps even better understanding of the literary record. The two approaches work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro level, remains essential. While the methods of enquiry, of evidence gathering, are different, they are not antithetical, and they share the same ultimate goal of informing our understanding of the literary record, be it writ large or small. The most fundamental and important difference in the two approaches is that the macroanalytic approach reveals details about texts that are for all intents and purposes unavailable to close-readers of the texts. Writing of John Burrows’s study of Jane Austen’s oeuvre, Julia Flanders points out how Burrows’s computational study brings the most common words such as “the” and “of” into our field of view.
Flanders writes: “His [Burrows] effort, in other words, is to prove the stylistic and semantic significance of these words, to restore them to our field of view. Their absence from our field of view, their non-existence as facts for us, is precisely because they are so much there, so ubiquitous that they seem to make no difference.” (Flanders 2005)
At its most basic, the macroanalytic approach I’m advocating is simply another method of gathering information about texts, of accessing the details. The information is different from what is derived via close reading, but it not of lesser or greater value to scholars for being such.
Flanders goes on: “Burrows’ approach, although it wears its statistics prominently, foreshadows a subtle shift in the way the computer’s role vis-á-vis the detail is imagined. It foregrounds the computer not as a factual substantiator whose observations are different in kind from our own—because more trustworthy and objective—but as a device that extends the range of our perceptions to phenomena too minutely disseminated for out ordinary reading.” (Flanders 2005)
A macroanalytic approach not only helps us to see and understand the larger “literary economy” but, by means of its scope, to better see and understand the degree to which literature and the individual authors who manufacture the literature respond to or react against literary and cultural trends within their realm of experience. If authors are inevitably influenced by their predecessors, then we may even be able to chart and understand “anxieties of influence” in concrete, quantitative ways.
For historical and stylistic questions in particular, the macroanalytic approach has distinct advantages over the more traditional practice of studying literary periods and genres by means of a close study of “representative” texts. Speaking of his own efforts to provide a more encompassing view of literary history, Franco Moretti writes that “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as a whole . . .” (2005). To generalize about a “period” of literature based on a study of a relatively small number of books is to take a significant leap. It is less problematic, though, to consider how a macroanalytic study of several thousand texts might lead us to a better understanding of the individual texts. Until recently, we have not had the opportunity to even consider this later option, and it seems reasonable to imagine that we might, through the application of both approaches, reach a new and better informed understanding of our primary materials. This is what Juri Tynjanov imagined in 1927: “Strictly speaking”, writes Tynjanov, “one cannot study literary phenomena outside of their interrelationships.” Fortunately for me and for scholars such as Moretti, the multitude of interrelationships that overwhelmed and eluded Tynjanov and pushed the limits of close-reading lose can now be explored with the aid of computation, statistics and huge digital libraries.
My book on this subject,
Literary Studies, the Digital Library, and the Inevitability of Influence, is now under contact with [Update: will be published in 2013 as Macroanalysis: Digital Methods and Literary History by University of Illinois Press.
 I began using the term macroanalysis in late 2003. At the time, Moretti and I were putting together plans for a co-taught course titled “Electronic Data and Literary Theory.” The course we imagined would be a research seminar in the full sense of the word and in our syllabus (dated 11/3/2003) we wrote: “the main purpose of this seminar is methodological rather than historical: learning how to use electronic search systems to analyze large quantities of data — and hence get a new, better understanding of literary and cultural history.” During the course I began work developing a text analysis toolkit that I later called CATools (for Corpus Analysis Tools). In terms of methodology, I was learning a lot at the time from work in corpus linguistics but also discovering that we (literary folks) have an entirely different set of questions. So it made sense to do at least a bit of wheel reinvention. My first experiments with the macroanalytic methodology were constructed around a corpus of Irish-American novels that I had been building since my dissertation research. I presented the first results of this work in Liverpool, at the 2004 meeting of the American Conference for Irish Studies. My paper, titled “Making and Mining a Digital Archive: the Case of the Irish-American West Project,” was part how-to and part results–I’d made one non-trivial discovery about Irish-American literary history based on this new methodology. In the spring of 2005, I offered a more detailed methodological overview of the toolkit at the inaugural meeting of the Text Analysis Developer’s Alliance. An overview of my project was documented on the TADA blog. Later that summer (2005), I presented a more generalized methodological paper titled “A Macro-Economic Model for Literary Research” at the joint meeting of the ACH and ALLC in Victoria, BC. It was there that I first articulated the economic analogy that I have come to find most useful for explaining Moretti’s idea of “distant-reading.” In 2006, while I was in residence as Research Scholar in the Digital Humanities at the Stanford Humanities Center, I spent a good deal of time thinking about macro-scale approaches to literature and then writing corpus analysis code . By the summer of 2007, I had developed a whole new toolkit and presented the first significant findings in a paper titled “Macro-Analysis (2.0)” which I delivered at the 2007 Digital Humanities meeting in Illinois. Coincidentally, this was the same conference at which Moretti presented the opening keynote lecture, a paper exploring a corpus of 19th century novel titles, which would eventually be published in Critical Inquiry. That research utilized software that I had developed in the CATools package.
Rowfont Press of Wichita, Kansas has just published a newly illustrated edition of Charles Driscoll’s memoir Kansas Irish (with my Critical Introduction). The book is available at Amazon. Kansas Irish and the two sequels that follow provide the most complete and authentic rendering of Irish life on the American prairie in the 19th Century.
Several months ago, a group of us from the Stanford Literary Lab wrote and sent out for review the article that now appears in Pamphlet 1 of the Lab. The article, titled “Quantitative Formalism: an Experiment” was submitted, peer-reviewed, and approved for publication in a prestigious literary journal. There was, however, a catch. The editors of the journal asked that we trim the number of charts in the article and that we alter the tone and character of the article to make it less of a narrative. In other words, the article was of a style and content that the editors found to be too foreign to their traditions.
Rather than revise the article in ways we felt would misrepresent its function and intent, we turned to that most traditional of literary forms, the pamphlet. Considering the largely quantitative and digital methodology employed in our research, a pamphlet was a seemingly ironic choice. The most obvious venue would seem to have been the web. After all, aren’t the blog and the web site, the pamphlet forms of the digital age (see, for example, Pamphleteers and Web Sites)? We certainly considered an electronic format, and we have posted a pdf version of the essay on our web site, but we decided to print.
Why print the pamphlet? As a literary form, the pamphlet has a long tradition of going against the grain; it’s an alternative form that is malleable. In the pamphlet, George Orwell wrote, “one has complete freedom of expression, including, if one chooses, the freedom to be scurrilous, abusive, and seditious; or, on the other hand, to be more detailed, serious and ‘high-brow’ than is ever possible in a newspaper or in most kinds of periodicals.” It has been used in campaigning and marketing, but most famously, the pamphlet has been employed by political, religious, and social “provocateurs” and “radicals.” Many of these objects have been lost, but the best have been preserved, such as those crafted by talented satirists (Swift) sober thinkers (Paine) and social critics (Voltaire). And they are preserved because they are historical objects, “ephemera” as the librarians say. The pamphlet is an ephemeral object, an object “lasting only for a day.” As “an experiment” our “quantitative formalism” pamphlet is a middle point, a hash mark on the line of time, not an end point or destination not even a beginning.
It’s interesting to consider how the preferred citation style of literary scholarship, the MLA Style, places emphasis on page reference over time, over moment of publication. With some obvious exceptions, the basic logic here is that what someone says about Shakespeare today has the same validity, the same scholarly purchase, as something said fifty years ago. As a discipline, our “results” tend to be qualitative and interpretive, not bound in time or subject to revision based on the introduction of new evidence. They are, of course, subject to reinterpretation, but that is fundamentally different from the changes wrought by new discoveries. The date-based citation style, on the other hand, places emphasis on points in time and acknowledges the fast-paced and ephemeral nature of certain fields of research. This is most obvious in the sciences, in medicine, for example, where new experiments and new discoveries are constantly and quickly changing the field. One need only eavesdrop on the the Digital Humanities twittersphere for a few hours to note the similarities; the pace of change is rapid.
So why not a pamphlet? Why not recognize in both form, title, and narrative style that ours is an experiment? It is a bit of research that is useful for today but also something we entirely expect to change. Indeed, we are already working on the next iteration(s), the next experiments. Unlike the neatly closed arguments of our traditional work, these experiments of our Literary Lab open as many doors as they close. In fact, in the course of our research on novel genres, it became apparent that we could, must, go on forever. Each test led us to some new idea, some new direction to explore. There are some discoveries to be sure, and some of our results will likely, hopefully, stand the test of time. But my co-authors and I understand, or perhaps simply “believe,” that there is still much, much more work to be done. If it reads more like a lab report than a traditional essay, it’s because it is a lab report and self-consciously so, intentionally so.
Readers wishing to experience the full pleasure of a touchable paper pamphlet may contact me with their name and address. No charge, while supplies last:-)
I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.
It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.
- We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
- There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
- And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
- Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).
And here’s another chart comparing the usage of “asdf” to “qwer.”
And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.
- And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data
In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.
So, these are some problems; some are big and some are small.
Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).