• About
  • Contact
  • Courses
  • CV
  • Macroanalysis
    • Confusion Matrices
    • 500 Themes
    • Expanded Stopwords List
    • The LDA Buffet: A Topic Modeling Fable
    • Color Versions of Figures 9.3 and 9.4
  • Pubs & Preprints
  • Slides & etc.
  • Workshops
    • UW-Milwaukee, 2013
      • Workshop Code
    • 2013 DHWI
      • DHWI: R Code Day One
      • DHWI: R Code Day Two
      • DHWI: R Code Day Three
      • DHWI: R Code Day Four
      • DHWI: R Code Functions File
    • 2013 MLA/DH Commons

Matthew L. Jockers

Matthew L. Jockers

Category Archives: Commentary

“A Matter of Scale”

28 Thursday Mar 2013

Posted by Matthew Jockers in Commentary

≈ 1 Comment

Back in November, Julia Flanders and I were invited to stage a debate on the matter of “scale” in digital humanities research for the “Boston Area Days of DH” conference keynote: Julia was to represent the micro scale and I the macro.

Julia and I met up during the MLA conference in January and began sketching out how the talk might go. The first thing we discovered, of course, is that we did not in fact have a real difference of opinion on this matter of scale. Big data, small data, close reading and distant . . . these things matter much less than what a scholar actually decides to do and say. In other words, we were both ultimately interested in new knowledge and not too much concerned with the level of scale necessary to derive that new knowledge.

In other words, it’s a false and probably irrelevant debate. And while we agreed on this point in general terms, we discovered in the course of composing and editing the script for our mock debate that there were legitimate nuances that deserved to be put into the light of day. The script form our “debate” and all of the slides are now available via UNL’s open access repository as “A Matter of Scale.”

Julia has posted a few comments on the experience of co-authoring this presentation with me on her blog. Check it out at http://juliaflanders.wordpress.com/2013/03/28/a-matter-of-scale/.

Thoughts on a Literary Lab

04 Friday Jan 2013

Posted by Matthew Jockers in Commentary

≈ 2 Comments

[For the “Theories and Practices of the Literary Lab” roundtable at MLA yesterday, panelists were asked to speak for 5 minutes about their vision of a literary lab. Here are my remarks from that session--#147]

I take the descriptor “literary lab” literally, and to help explain my vision of a literary lab I want to describe how the Stanford Literary Lab that I founded with Franco Moretti came into being.

The Stanford Lab was born out of a class that I taught in the fall of 2009. In that course I assigned 1200 novels and challenged students to explore ways of reading, interpreting, and understanding literature at the macro-scale, as an aggregate system. Writing about the course and the lab that evolved from the course, Chronicle of Higher Ed reporter Marc Parry described it as being based on: “a controversial vision for changing a field still steeped in individual readers’ careful analyses of texts.” That may be how it looks from the outside, but there was no radical agenda then and no radical agenda today.

In the class, I asked the students to form into two research teams and to construct research projects around this corpus of 1200 novels. One group chose to investigate whether novel serialization in the 19th century had a detectable/measurable effect upon novelistic style. The other group pursued a project dealing with lexical change over the century, and they wrote a program called “the correlator” that was used to observe and measure semantic change.

After the class ended, two students, one from each group asked to continue their work as independent study; I agreed. Over the Christmas holiday, word spread to the other students from the seminar and by the New Year 13 of the original 14 in the seminar wanted to keep working. Instead of 13 independent studies, we formed an ad-hoc seminar group, and I found an empty office on the 4th floor where we began meeting, sometimes for several hours a day. We began calling this ugly, windowless room, the lab.

Several of the students in my fall class were also in a class with Franco Moretti and the crossover in terms of subject matter and methodology was fairly obvious. As the research deepened and became more nuanced, Franco began joining us for lab sessions and over the next few months other faculty and grad students were sucked into this evolving vortext. It was a very exciting time.

At some point, Franco and I (and perhaps a few of the students) began having conversations about formalizing this notion of a literary lab. I think at the time our motivation had more to do with the need to lobby for space and resources than anything else. As the projects grew and gained more steam, the room got smaller and smaller.

I mention all of this because I do not believe in the “if we build it they will come” notion of digital humanities labs. While it is true that they may come if we build them; it is also true, and I have seen this first hand, that they may come with absolutely no idea of what to do.

First and foremost a lab needs a real and specific research agenda. “Enabling Digital Humanities projects” is not a research agenda for a lab. Advancing or enabling digital humanities oriented research is an appropriate mission for a Center, such as our Center for Digital Humanities Research at Nebraska, but it is not the function of a lab, at least not in the limited literal sense that I imagine it. For me, a lab is not specifically an idea generator; a lab is a place in which ideas move from birth to maturation.

It would be incredible hyperbole to say that we formally articulated any of this in advance. Our lab was the opposite of premeditated. We did, however, have a loosely expressed set of core principles. We agreed that:

1. Our work would be narrowly focused on literary research of a quantitative nature.
2. All research would be collaborative, even when the outcome ends up having a single author.
3. All research would take the form of “experiments,” and we would be open to the possibilities of failure; indeed, we would see failure as new knowledge.
4. The lab would be open to students and faculty at all levels–and, on a more ad hoc basis, to students and faculty from other institutions.
5. In internal and external presentation and publication, we would favor the narrative genre of “lab reports” and attempt to show not only where we arrived, but how we got there.

I continue to believe that these were and are the right principles for a lab even while they conflict with much about the way Universities are organized.

In our lab we discovered that to focus, to really focus on the work, we had to resist and even reject some of the established standards of pedagogy, of academic hierarchy, and of publishing convention. We discovered that we needed to remove instructional barriers both internal and external in order to find and attract the right people and the right expertise. We did not do any of this in order to make a statement. We were not academic radicals bent on defying the establishment.

Nor should I leave you with the impression that we figured anything out. The lab remains an organic entity unified by what some might characterize as a monomaniacal focus on literary research. If there was any genius to what we did, it was in the decision to never compromise our focus, to do whatever was necessary to keep our focus on the literature.

Some Advice for DH Newbies

03 Thursday Jan 2013

Posted by Matthew Jockers in Commentary

≈ 1 Comment

In preparation for a panel session at DH Commons today, I was asked to consider the question: “What one step would you recommend a newcomer to DH take in order to join current conversations in the field?” and then speak for 3 – 4 minutes. Below is the 5 minute version of my answer. . .

With all the folks assembled here today, I figured we’d get some pretty good advice about what constitutes DH and how to get started, so I decided that I ought to say something different from what I’d expect others to say. I have two specific bits of advice, and I suppose that the second bit will be a little more controversial.

But let me foreground that by going back to 2011 when my colleague Glen Worthey and I organized the annual Digital Humanities conference at Stanford around a big tent, summer of love theme. We flung open the flaps on the Big Tent and said come on in . . . We believed, and we continue to believe, that there is a wide range of very good and very interesting work being done in “digital humanities.” We felt that we needed a big tent to enclose all that good work. But let’s face it, inside the big tent it’s a freakin’ three ring circus. Some folks like clowns and others want to see the jugglers. The DH conference is not like a conference on Victorian Literature. And that, of course, is the charm and the curse.

While it probably makes sense for a newcomer to poke around and gain some sense of the “disciplinary” history of the “field.” I think the best advice I can give a newcomer is to spend very little time thinking about what DH is and spend as much time as possible doing DH.

It doesn’t really matter if the world looks at your research and says of it: “Ahhhh, that’s some good Digital Humanities, man.” What matters, of course, is if the world looks at it and says, “Holy cow, I never thought of Jane Austen in those terms” or “Wow, this is really strong evidence that the development of Roman road networks was entirely dependent upon seasonal shifts.” The bottom line is that it is the work you do that is important, not how it gets defined.

So I suppose that is a bit of advice for newcomers, but let me answer the question more concretely and more controversially by speaking as someone who hangs out in one particular ring of the DH Big Tent.

If you understand what I have said thus far, then you know that it is impossible to speak for the Digital Humanities as a group, so, for some, what I am going to say is going to sound controversial. And if I hear that one of you newcomers ran out at the end of this session yelling “Jockers thinks I need to learn a programming language to be a digital humanist,” then I’m going to have to kick your butt right out of the big tent!

Learning a programming language, though, is precisely what I am going to recommend. I’m even going to go a bit further and suggest a specific language called R.

By recommending that you learn R, I am also advocating learning some statistics. R is primarily a language used for statistical computing, which is more or less the flavor of Digital Humanities that I practice. If you want to be able to read and understand the work that we do in this particular ring of the big tent you will need some understanding of statistics; if you want to be able to replicate and expand upon this kind of work, you are going to need to know a programming language, so I recommend learning some R and killing two birds with one stone.

And for those of you who don’t get turned on by p-values, for loops, and latent dirichlet allocation, I think learning a programing language is still in your best interests. Even if you never write a single line of code, knowing a programming language will allow you to talk to the natives, that is, you will be able to converse with the non-humanities programmers and web masters and DBAs and systems administrators, who we so often collaborate with as digital humanists. Whether or not you program yourself, you will need to translate your humanistic questions into terms that a non-specialist in the humanities will understand. You may never write poetry in Italian, but if you are going to travel in Rome, you should at least know how to ask for directions to the coliseum.

DH2012 and the 2013 Busa Award

20 Friday Jul 2012

Posted by Matthew Jockers in Commentary

≈ Leave a Comment

I could not make it to the DH conference in Hamburg this year (though I did manage to appear virtually). As chair of the Busa Award committee I had the pleasure of announcing that Willard McCarty had won the award. Willard will accept the award in 2013 when DH meets at the University of Nebraska. Here is the text of my announcement which was read today in Hamburg:

I was very pleased to serve as the Chair of the Busa Award committee this cycle, and though I am disappointed that I was unable to travel to Hamburg this year to make this announcement in person, I’m delighted with the end result. I am also delighted that the award will be given at the 2013 conference hosted by the University of Nebraska. Having recently joined the faculty there, I’m quite certain I will be attending next year’s meeting!

The winner of the 2013 Busa Award is a man of legendary kindness and generosity. His contributions to the growth and prominence of Digital Humanities will be familiar to us all. He is a gentleman, a scholar, a philosopher, and a long time fighter for the cause. He is, by one colleague’s accounting, the “Obi-Wan Kenobi” of Digital Humanities. And I must concur that “the force” is strong with this one. Please join me in congratulating Willard McCarty on his selection for the 2013 Busa Award.

Amicus Brief Filed

09 Monday Jul 2012

Posted by Matthew Jockers in Commentary

≈ Leave a Comment

In the last chapter of forthcoming my book, I write about the challenges of copyright law and how many a digital humanist is destined to become a 19th-centuryist if the law isn’t reformed to specifically allow for and recognize the importance of “non-expressive” use of digitized content.*

This week the Amicus Brief that I co-authored with Matthew Sag and Jason Schultz was submitted. The brief (see Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild, Inc. Et Al V. Hathitrust Et Al.) includes official endorsement from the Association of Computers in the Humanities as well as the support and signature of many individual scholars working in the field.

* “Non-expressive use” is Matthew Sag’s far more pleasing formulation of what many have come to call “non-consumptive use.”

On Distant Reading and Macroanalysis

01 Friday Jul 2011

Posted by Matthew Jockers in Commentary

≈ 6 Comments

Earlier this week Kathryn Schultz of the New York Times published a rather provocative, challenging, and in my opinion under-researched and over-sensationalized article about my colleague Franco Morreti’s work theorizing a mode of literary analysis that he has termed “distant-reading.” Others have already pointed out some of the errors Schultz made, and I’m fairly certain Moretti would be happy to clarify any confusion Schultz may have about his work if she were to actually interview him (i.e. before paraphrasing him). My interest here is to offer some specific thoughts and some background on “distant-reading” or what I have preferred to call “macroanalysis.”[1]

The approach to the study of literature that I call macroanalysis, instead of distant-reading (for reasons explained below), is in general ways akin to the social-science of economics or, more specifically, macroeconomics. Before the 20th century there wasn’t a defined field of “Macroeconomics.” There was, however, microeconomics, which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to the study of individual texts via “close-readings” of the material. Macroeconomics, however, is about the study of the entire economy. It tends toward enumeration and quantification and is in this sense similar to literary inquiries that are not highly theorized: bibliographic studies, biographical studies, literary history, philology, and the enumerative analysis that is the foundation of humanities computing.

By way of an analogy, we might think about interpretive close-readings as corresponding to microeconomics while quantitative macroanalysis corresponds to macroeconomics. Consider, then, that in many ways the study of literary genres or literary periods is a type of macro approach to literature. Say, for example, a scholar specializes in early 20th century poetry. Presumably, this scholar could be called upon to provide sound generalizations, or “distant-readings” about 20th century poetry based on a broad reading of individual works within that period. This would be a sort of “macro-, or distant-, reading” of the period. But this parallel falls short of approximating for literature what macroeconomics is to economics, and it is in this context that I prefer the term macroanalysis over distant-reading. The former term places the emphasis on the quantifiable methodology over the more interpretive practice of “reading.” Broad attempts to generalize about a period or about a genre are frequently just another sort of micro-analysis, in which multiple “cases” or “close-readings” of individual texts are digested before generalizations about them are drawn in very qualitative ways. Macroeconomics, on the other hand, is a more number-based discipline, one grounded in quantitative analysis not qualitative assessments. Moreover, macroeconomics employs a number of quantitative benchmarks for assessing, scrutinizing, and even forecasting the macro-economy. While there is an inherent need for understanding the economy at the micro level, in order to contextualize the macro-results, macroeconomics does not directly involve itself in the specific cases, choosing instead to see the cases in the aggregate, looking to those elements of the specific cases that can be generalized, aggregated, and quantified.

Micro-oriented approaches to literature, highly interpretive readings of literature, remain fundamentally important. Just as microeconomics offers important perspectives on the economy. It is the exact interplay between the macro and micro scale that promises a new, enhanced, and perhaps even better understanding of the literary record. The two approaches work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro level, remains essential. While the methods of enquiry, of evidence gathering, are different, they are not antithetical, and they share the same ultimate goal of informing our understanding of the literary record, be it writ large or small. The most fundamental and important difference in the two approaches is that the macroanalytic approach reveals details about texts that are for all intents and purposes unavailable to close-readers of the texts. Writing of John Burrows’s study of Jane Austen’s oeuvre, Julia Flanders points out how Burrows’s computational study brings the most common words such as “the” and “of” into our field of view.

Flanders writes: “His [Burrows] effort, in other words, is to prove the stylistic and semantic significance of these words, to restore them to our field of view. Their absence from our field of view, their non-existence as facts for us, is precisely because they are so much there, so ubiquitous that they seem to make no difference.” (Flanders 2005)

At its most basic, the macroanalytic approach I’m advocating is simply another method of gathering information about texts, of accessing the details. The information is different from what is derived via close reading, but it not of lesser or greater value to scholars for being such.

Flanders goes on: “Burrows’ approach, although it wears its statistics prominently, foreshadows a subtle shift in the way the computer’s role vis-á-vis the detail is imagined. It foregrounds the computer not as a factual substantiator whose observations are different in kind from our own—because more trustworthy and objective—but as a device that extends the range of our perceptions to phenomena too minutely disseminated for out ordinary reading.” (Flanders 2005)

A macroanalytic approach not only helps us to see and understand the larger “literary economy” but, by means of its scope, to better see and understand the degree to which literature and the individual authors who manufacture the literature respond to or react against literary and cultural trends within their realm of experience. If authors are inevitably influenced by their predecessors, then we may even be able to chart and understand “anxieties of influence” in concrete, quantitative ways.

For historical and stylistic questions in particular, the macroanalytic approach has distinct advantages over the more traditional practice of studying literary periods and genres by means of a close study of “representative” texts. Speaking of his own efforts to provide a more encompassing view of literary history, Franco Moretti writes that “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as a whole . . .” (2005). To generalize about a “period” of literature based on a study of a relatively small number of books is to take a significant leap. It is less problematic, though, to consider how a macroanalytic study of several thousand texts might lead us to a better understanding of the individual texts. Until recently, we have not had the opportunity to even consider this later option, and it seems reasonable to imagine that we might, through the application of both approaches, reach a new and better informed understanding of our primary materials. This is what Juri Tynjanov imagined in 1927: “Strictly speaking”, writes Tynjanov, “one cannot study literary phenomena outside of their interrelationships.” Fortunately for me and for scholars such as Moretti, the multitude of interrelationships that overwhelmed and eluded Tynjanov and pushed the limits of close-reading lose can now be explored with the aid of computation, statistics and huge digital libraries.

My book on this subject, Literary Studies, the Digital Library, and the Inevitability of Influence, is now under contact with [Update: will be published in 2013 as Macroanalysis: Digital Methods and Literary History by University of Illinois Press.

[1] I began using the term macroanalysis in late 2003. At the time, Moretti and I were putting together plans for a co-taught course titled “Electronic Data and Literary Theory.” The course we imagined would be a research seminar in the full sense of the word and in our syllabus (dated 11/3/2003) we wrote: “the main purpose of this seminar is methodological rather than historical: learning how to use electronic search systems to analyze large quantities of data — and hence get a new, better understanding of literary and cultural history.” During the course I began work developing a text analysis toolkit that I later called CATools (for Corpus Analysis Tools). In terms of methodology, I was learning a lot at the time from work in corpus linguistics but also discovering that we (literary folks) have an entirely different set of questions. So it made sense to do at least a bit of wheel reinvention. My first experiments with the macroanalytic methodology were constructed around a corpus of Irish-American novels that I had been building since my dissertation research. I presented the first results of this work in Liverpool, at the 2004 meeting of the American Conference for Irish Studies. My paper, titled “Making and Mining a Digital Archive: the Case of the Irish-American West Project,” was part how-to and part results–I’d made one non-trivial discovery about Irish-American literary history based on this new methodology. In the spring of 2005, I offered a more detailed methodological overview of the toolkit at the inaugural meeting of the Text Analysis Developer’s Alliance. An overview of my project was documented on the TADA blog. Later that summer (2005), I presented a more generalized methodological paper titled “A Macro-Economic Model for Literary Research” at the joint meeting of the ACH and ALLC in Victoria, BC. It was there that I first articulated the economic analogy that I have come to find most useful for explaining Moretti’s idea of “distant-reading.” In 2006, while I was in residence as Research Scholar in the Digital Humanities at the Stanford Humanities Center, I spent a good deal of time thinking about macro-scale approaches to literature and then writing corpus analysis code . By the summer of 2007, I had developed a whole new toolkit and presented the first significant findings in a paper titled “Macro-Analysis (2.0)” which I delivered at the 2007 Digital Humanities meeting in Illinois. Coincidentally, this was the same conference at which Moretti presented the opening keynote lecture, a paper exploring a corpus of 19th century novel titles, which would eventually be published in Critical Inquiry. That research utilized software that I had developed in the CATools package.

Kansas Irish Reprint

10 Thursday Mar 2011

Posted by Matthew Jockers in Commentary

≈ Leave a Comment

Rowfont Press of Wichita, Kansas has just published a newly illustrated edition of Charles Driscoll’s memoir Kansas Irish (with my Critical Introduction). The book is available at Amazon. Kansas Irish and the two sequels that follow provide the most complete and authentic rendering of Irish life on the American prairie in the 19th Century.

On Pamphleteering and Pamphlet One

03 Thursday Feb 2011

Posted by Matthew Jockers in Commentary

≈ Leave a Comment

Several months ago, a group of us from the Stanford Literary Lab wrote and sent out for review the article that now appears in Pamphlet 1 of the Lab. The article, titled “Quantitative Formalism: an Experiment” was submitted, peer-reviewed, and approved for publication in a prestigious literary journal. There was, however, a catch. The editors of the journal asked that we trim the number of charts in the article and that we alter the tone and character of the article to make it less of a narrative. In other words, the article was of a style and content that the editors found to be too foreign to their traditions.

Rather than revise the article in ways we felt would misrepresent its function and intent, we turned to that most traditional of literary forms, the pamphlet. Considering the largely quantitative and digital methodology employed in our research, a pamphlet was a seemingly ironic choice. The most obvious venue would seem to have been the web. After all, aren’t the blog and the web site, the pamphlet forms of the digital age (see, for example, Pamphleteers and Web Sites)? We certainly considered an electronic format, and we have posted a pdf version of the essay on our web site, but we decided to print.

Why print the pamphlet? As a literary form, the pamphlet has a long tradition of going against the grain; it’s an alternative form that is malleable. In the pamphlet, George Orwell wrote, “one has complete freedom of expression, including, if one chooses, the freedom to be scurrilous, abusive, and seditious; or, on the other hand, to be more detailed, serious and ‘high-brow’ than is ever possible in a newspaper or in most kinds of periodicals.” It has been used in campaigning and marketing, but most famously, the pamphlet has been employed by political, religious, and social “provocateurs” and “radicals.” Many of these objects have been lost, but the best have been preserved, such as those crafted by talented satirists (Swift) sober thinkers (Paine) and social critics (Voltaire). And they are preserved because they are historical objects, “ephemera” as the librarians say. The pamphlet is an ephemeral object, an object “lasting only for a day.” As “an experiment” our “quantitative formalism” pamphlet is a middle point, a hash mark on the line of time, not an end point or destination not even a beginning.

It’s interesting to consider how the preferred citation style of literary scholarship, the MLA Style, places emphasis on page reference over time, over moment of publication. With some obvious exceptions, the basic logic here is that what someone says about Shakespeare today has the same validity, the same scholarly purchase, as something said fifty years ago. As a discipline, our “results” tend to be qualitative and interpretive, not bound in time or subject to revision based on the introduction of new evidence. They are, of course, subject to reinterpretation, but that is fundamentally different from the changes wrought by new discoveries. The date-based citation style, on the other hand, places emphasis on points in time and acknowledges the fast-paced and ephemeral nature of certain fields of research. This is most obvious in the sciences, in medicine, for example, where new experiments and new discoveries are constantly and quickly changing the field. One need only eavesdrop on the the Digital Humanities twittersphere for a few hours to note the similarities; the pace of change is rapid.

So why not a pamphlet? Why not recognize in both form, title, and narrative style that ours is an experiment? It is a bit of research that is useful for today but also something we entirely expect to change. Indeed, we are already working on the next iteration(s), the next experiments. Unlike the neatly closed arguments of our traditional work, these experiments of our Literary Lab open as many doors as they close. In fact, in the course of our research on novel genres, it became apparent that we could, must, go on forever. Each test led us to some new idea, some new direction to explore. There are some discoveries to be sure, and some of our results will likely, hopefully, stand the test of time. But my co-authors and I understand, or perhaps simply “believe,” that there is still much, much more work to be done. If it reads more like a lab report than a traditional essay, it’s because it is a lab report and self-consciously so, intentionally so.

Readers wishing to experience the full pleasure of a touchable paper pamphlet may contact me with their name and address. No charge, while supplies last:-)

Unigrams, and bigrams, and trigrams, oh my

22 Wednesday Dec 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ 1 Comment

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

  1. We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
  2. There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
  3. And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
  4. Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).

    chart.png

    And here’s another chart comparing the usage of “asdf” to “qwer.”

    chart-1.png
    And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.

  5. And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data

    In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).

SEASR Grant

02 Tuesday Nov 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Leave a Comment

This month a group of researchers at Stanford, University of Illinois, University of Maryland, and George Mason were awarded a $790,000 grant from the Mellon Foundation to advance the prior work of the SEASR project. I’ll be serving as the overall Project Director and as one of the researchers in the Stanford component of the grant. In this phase of the SEASR project, we will focus on leveraging the existing SEASR infrastructure in support of four “use cases.” But “use case” hardly describes the research intensive nature of the proposed work, nor does it capture the strongly humanistic bias of the work proposed. Each partner has committed to a specific research project and each has the expressed goal of advancing humanities research and publishing their results. I’d like to emphasize this point about advancing humanities research.

This grant represents an important step beyond the tool building, QA and UI testing stages of software development. All too often, it seems, our digital humanities projects devote a great deal of time, money, and labor to infrastructure and prototyping and then all too frequently the results languish in the great sea of hammers without a nail. Sure, a few journeymen carpenters stick these tools in their belts and hammer away, but all too often it seems that more effort goes into building the tools and then the resources sit around gathering dust while humanities research marches on in the time-tested modes with which we are most familiar.

Of course, I don’t mean this to be a criticism of the tool builders or the tools built. The TAPOR project, for example, offers many useful text analysis widgets, and I frequency send my colleagues and students there for quick and dirty text-analysis. And just last month I had occasion to use and cite Stefan Sinclair’s Voyeur application. I was thrilled to have Voyeur at my finger tips; it provided a quick and easy way to do exactly what I wanted.

But often, the analytic tasks involved in our projects are multifaceted and cannot be addressed by any one tool. Instead, these projects involve “flows” in which our “humanistic” data travels though a series of analytic “filters” and comes out on the other end in some altered form. The TAPOR project attempts to be a virtual text analysis “workbench” in which the craftsman can slide a project around the bench from one tool to the next. This model works well for smallish projects but is not robust enough for large scale projects and, despite some significant interface improvements over the years, remains, for me at least, a bit clunky. I find it great for quick tasks with one or two texts, but inefficient for processing multiple texts or multiple processes. Part of the TAPOR mission was to develop a suite of tools that could be used by the average, ordinary humanist: which is to say, the humanist without any real technical chops. It succeeds on that front to be sure.

SEASR offers an alternative approach and what it provides in terms of processing power and computational elegance it gives up in terms of ease of use and transparency. The SEASR “interface” is one that involves constructing modular “workflows” in which each module corresponds to some computational task. These modules are linked together such that one process feeds into the next and the business of “sliding” a project around from one tool to another on the virtual workbench is taken over by the workflow manager.

In this grant we have specifically deemphasized UI development in favor of output, in favor of “results” in the humanities sense of the word. As we write in the proposal, “The main emphasis of the project will be on developing, coordinating, and investigating the research questions posed by the participating humanities scholars.” The scholars in this project include myself and Franco Moretti at Stanford, Dan Cohen at GMU, Tanya Clement at University of Maryland, Ted Underwood and John Unsworth both of UIUC. On the technical end, we have Michael Welge and Loretta Auvil of the Automated Learning Group, of the National Center for Supercomputing Applications.

As the project gets rolling, I will have more to post about the specific research questions we are each addressing and the ongoing results of our work. . .

← Older posts

♣ Contact

Matthew L. Jockers
Department of English
University of Nebraska-Lincoln
Twitter: @mljockers

♣ Blogroll

  • Ben Schmidt
  • Matthew Sag
  • Scott B. Weingart
  • Stéfan Sinclair
  • Stephen Ramsay
  • Ted Underwood

♣ Archives

♣ Recent Comments

  • Ranjan on Executing R in Php
  • How I Came to Text-Mine Software | Reverse Engineering on Unfolding the Novel
  • How I Came to Text-Mine Software | Reverse Engineering on Macroanalysis
  • Presentation Thoughts on DHWI: R Code Day One
  • Sapping Attention: Genders and Genres: tracking pronouns on Pronouns in 19th Century Fiction

♣ Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Creative Commons License
This work by Matthew Jockers is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Proudly powered by WordPress Theme: Chateau by Ignacio Ricci.