Unfolding the Novel

I’m excited to announce a new research project dubbed “Unfolding the Novel” (which is a play on both “paper” and “protein” folding). In collaboration with colleagues from the Stanford Literary Lab and Arizona State University and in partnership with researchers of the Book Genome project of BookLamp.com we have begun work that traces stylistic and thematic change across 300 years of fiction, from 1700-2000! Today UNL posted a news release announcing the partnership and some of our goals.

The primary goal of the project is to map major stylistic and thematic trends over 300 years of creative literature. To facilitate this work, BookLamp is providing access to a large store of metadata pertaining to mostly 20th and 21st century works of fiction. This data will be combined with similar data we have already compiled from the 19th century and new data we are curating now from the 18th century. The research team will not access the actual books but will explore at the macroscale in ways that are similar to what one can do with the data provided to researchers at the Google Ngrams project. A major difference, however, is that the data in the “Unfolding” project is highly curated, limited to fiction in English, and enriched with additional metadata including information about both gender and genre distribution.

Our initial data set consists of token frequency information that has been aggregated across one or more global metadata facets including but not limited to publication year, author gender, and book genre. Such data includes, for example, a table containing the year-­to-­year mean relative frequencies of the most common words in the corpus (e.g the relative frequencies of the words “the, a, an, of, and” etc).

I’ll be reporting on the project here as things progress, but for now, it’s back to the drudgery of the text mines. . . ;-)

Thoughts on a Literary Lab

[For the “Theories and Practices of the Literary Lab” roundtable at MLA yesterday, panelists were asked to speak for 5 minutes about their vision of a literary lab. Here are my remarks from that session--#147]

I take the descriptor “literary lab” literally, and to help explain my vision of a literary lab I want to describe how the Stanford Literary Lab that I founded with Franco Moretti came into being.

The Stanford Lab was born out of a class that I taught in the fall of 2009. In that course I assigned 1200 novels and challenged students to explore ways of reading, interpreting, and understanding literature at the macro-scale, as an aggregate system. Writing about the course and the lab that evolved from the course, Chronicle of Higher Ed reporter Marc Parry described it as being based on: “a controversial vision for changing a field still steeped in individual readers’ careful analyses of texts.” That may be how it looks from the outside, but there was no radical agenda then and no radical agenda today.

In the class, I asked the students to form into two research teams and to construct research projects around this corpus of 1200 novels. One group chose to investigate whether novel serialization in the 19th century had a detectable/measurable effect upon novelistic style. The other group pursued a project dealing with lexical change over the century, and they wrote a program called “the correlator” that was used to observe and measure semantic change.

After the class ended, two students, one from each group asked to continue their work as independent study; I agreed. Over the Christmas holiday, word spread to the other students from the seminar and by the New Year 13 of the original 14 in the seminar wanted to keep working. Instead of 13 independent studies, we formed an ad-hoc seminar group, and I found an empty office on the 4th floor where we began meeting, sometimes for several hours a day. We began calling this ugly, windowless room, the lab.

Several of the students in my fall class were also in a class with Franco Moretti and the crossover in terms of subject matter and methodology was fairly obvious. As the research deepened and became more nuanced, Franco began joining us for lab sessions and over the next few months other faculty and grad students were sucked into this evolving vortext. It was a very exciting time.

At some point, Franco and I (and perhaps a few of the students) began having conversations about formalizing this notion of a literary lab. I think at the time our motivation had more to do with the need to lobby for space and resources than anything else. As the projects grew and gained more steam, the room got smaller and smaller.

I mention all of this because I do not believe in the “if we build it they will come” notion of digital humanities labs. While it is true that they may come if we build them; it is also true, and I have seen this first hand, that they may come with absolutely no idea of what to do.

First and foremost a lab needs a real and specific research agenda. “Enabling Digital Humanities projects” is not a research agenda for a lab. Advancing or enabling digital humanities oriented research is an appropriate mission for a Center, such as our Center for Digital Humanities Research at Nebraska, but it is not the function of a lab, at least not in the limited literal sense that I imagine it. For me, a lab is not specifically an idea generator; a lab is a place in which ideas move from birth to maturation.

It would be incredible hyperbole to say that we formally articulated any of this in advance. Our lab was the opposite of premeditated. We did, however, have a loosely expressed set of core principles. We agreed that:

1. Our work would be narrowly focused on literary research of a quantitative nature.
2. All research would be collaborative, even when the outcome ends up having a single author.
3. All research would take the form of “experiments,” and we would be open to the possibilities of failure; indeed, we would see failure as new knowledge.
4. The lab would be open to students and faculty at all levels–and, on a more ad hoc basis, to students and faculty from other institutions.
5. In internal and external presentation and publication, we would favor the narrative genre of “lab reports” and attempt to show not only where we arrived, but how we got there.

I continue to believe that these were and are the right principles for a lab even while they conflict with much about the way Universities are organized.

In our lab we discovered that to focus, to really focus on the work, we had to resist and even reject some of the established standards of pedagogy, of academic hierarchy, and of publishing convention. We discovered that we needed to remove instructional barriers both internal and external in order to find and attract the right people and the right expertise. We did not do any of this in order to make a statement. We were not academic radicals bent on defying the establishment.

Nor should I leave you with the impression that we figured anything out. The lab remains an organic entity unified by what some might characterize as a monomaniacal focus on literary research. If there was any genius to what we did, it was in the decision to never compromise our focus, to do whatever was necessary to keep our focus on the literature.

Some Advice for DH Newbies

In preparation for a panel session at DH Commons today, I was asked to consider the question: “What one step would you recommend a newcomer to DH take in order to join current conversations in the field?” and then speak for 3 – 4 minutes. Below is the 5 minute version of my answer. . .

With all the folks assembled here today, I figured we’d get some pretty good advice about what constitutes DH and how to get started, so I decided that I ought to say something different from what I’d expect others to say. I have two specific bits of advice, and I suppose that the second bit will be a little more controversial.

But let me foreground that by going back to 2011 when my colleague Glen Worthey and I organized the annual Digital Humanities conference at Stanford around a big tent, summer of love theme. We flung open the flaps on the Big Tent and said come on in . . . We believed, and we continue to believe, that there is a wide range of very good and very interesting work being done in “digital humanities.” We felt that we needed a big tent to enclose all that good work. But let’s face it, inside the big tent it’s a freakin’ three ring circus. Some folks like clowns and others want to see the jugglers. The DH conference is not like a conference on Victorian Literature. And that, of course, is the charm and the curse.

While it probably makes sense for a newcomer to poke around and gain some sense of the “disciplinary” history of the “field.” I think the best advice I can give a newcomer is to spend very little time thinking about what DH is and spend as much time as possible doing DH.

It doesn’t really matter if the world looks at your research and says of it: “Ahhhh, that’s some good Digital Humanities, man.” What matters, of course, is if the world looks at it and says, “Holy cow, I never thought of Jane Austen in those terms” or “Wow, this is really strong evidence that the development of Roman road networks was entirely dependent upon seasonal shifts.” The bottom line is that it is the work you do that is important, not how it gets defined.

So I suppose that is a bit of advice for newcomers, but let me answer the question more concretely and more controversially by speaking as someone who hangs out in one particular ring of the DH Big Tent.

If you understand what I have said thus far, then you know that it is impossible to speak for the Digital Humanities as a group, so, for some, what I am going to say is going to sound controversial. And if I hear that one of you newcomers ran out at the end of this session yelling “Jockers thinks I need to learn a programming language to be a digital humanist,” then I’m going to have to kick your butt right out of the big tent!

Learning a programming language, though, is precisely what I am going to recommend. I’m even going to go a bit further and suggest a specific language called R.

By recommending that you learn R, I am also advocating learning some statistics. R is primarily a language used for statistical computing, which is more or less the flavor of Digital Humanities that I practice. If you want to be able to read and understand the work that we do in this particular ring of the big tent you will need some understanding of statistics; if you want to be able to replicate and expand upon this kind of work, you are going to need to know a programming language, so I recommend learning some R and killing two birds with one stone.

And for those of you who don’t get turned on by p-values, for loops, and latent dirichlet allocation, I think learning a programing language is still in your best interests. Even if you never write a single line of code, knowing a programming language will allow you to talk to the natives, that is, you will be able to converse with the non-humanities programmers and web masters and DBAs and systems administrators, who we so often collaborate with as digital humanists. Whether or not you program yourself, you will need to translate your humanistic questions into terms that a non-specialist in the humanities will understand. You may never write poetry in Italian, but if you are going to travel in Rome, you should at least know how to ask for directions to the coliseum.

DH2012 and the 2013 Busa Award

I could not make it to the DH conference in Hamburg this year (though I did manage to appear virtually). As chair of the Busa Award committee I had the pleasure of announcing that Willard McCarty had won the award. Willard will accept the award in 2013 when DH meets at the University of Nebraska. Here is the text of my announcement which was read today in Hamburg:

I was very pleased to serve as the Chair of the Busa Award committee this cycle, and though I am disappointed that I was unable to travel to Hamburg this year to make this announcement in person, I’m delighted with the end result. I am also delighted that the award will be given at the 2013 conference hosted by the University of Nebraska. Having recently joined the faculty there, I’m quite certain I will be attending next year’s meeting!

The winner of the 2013 Busa Award is a man of legendary kindness and generosity. His contributions to the growth and prominence of Digital Humanities will be familiar to us all. He is a gentleman, a scholar, a philosopher, and a long time fighter for the cause. He is, by one colleague’s accounting, the “Obi-Wan Kenobi” of Digital Humanities. And I must concur that “the force” is strong with this one. Please join me in congratulating Willard McCarty on his selection for the 2013 Busa Award.

Amicus Brief Filed

In the last chapter of forthcoming my book, I write about the challenges of copyright law and how many a digital humanist is destined to become a 19th-centuryist if the law isn’t reformed to specifically allow for and recognize the importance of “non-expressive” use of digitized content.*

This week the Amicus Brief that I co-authored with Matthew Sag and Jason Schultz was submitted. The brief (see Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild, Inc. Et Al V. Hathitrust Et Al.) includes official endorsement from the Association of Computers in the Humanities as well as the support and signature of many individual scholars working in the field.

* “Non-expressive use” is Matthew Sag’s far more pleasing formulation of what many have come to call “non-consumptive use.”

Macroanalysis

In preparation for the publication of my book (Macroanalysis: Digital Methods and Literary History, UIUC Press, 2013), I’ve begun posting some graphs and other data to my (new) website. To get the ball rolling, I have created an interactive “theme viewer” where visitors will find a drop down menu of the 500 themes I harvested from a corpus of 3,346 19-century British, Irish and American novels using “topic modeling” and a series of pre-processing routines that I detail in the book. Each theme is accompanied by a word cloud showing the relative importance of each term to the topic, and each cloud is followed by four graphs showing the distribution of the topic/theme over time and across author genders and author nationalities. Here is a sample of a theme I have labeled “FACTORY AND WORKHOUSE LABOR.” You can click on the thumbnails below for larger images, but the real fun is over at the theme viewer.

The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

For my forthcoming book, which includes a chapter on the uses of topic modeling in literary studies, I wrote the following vignette. It is my imperfect attempt at making the mathematical magic of LDA palatable to the average humanist. Imperfect, but hopefully more fun than plate notation. . .

. . . imagine a quaint town, somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

One afternoon Herman Melville bumps into Jane Austen at the bocce ball court, and they get to talking.

“You know,” says Austen, “I have not written a thing in weeks.”

“Arrrrgh,” Melville replies, “me neither.”

So hand in hand they stroll down Gibbs Lane to the LDA Buffet. Now, down at the LDA Buffet no one gets fat. The buffet only serves light (leit?) motifs, themes, topics, and tropes (seasonal). Melville hands a plate to Austen, grabs another for himself, and they begin walking down the buffet line. Austen is finicky; she spoons a dainty helping of words out of the bucket marked “dancing.” A slightly larger spoonful of words, she takes from the “gossip” bucket and then a good ladle’s worth of “courtship.”

Melville makes a bee line for the “whaling” trough, and after piling on an Ahab-sized handful of whaling words, he takes a smaller spoonful of “seafaring” and then just a smidgen of “cetological jargon.”

The two companions find a table where they sit and begin putting all the words from their plates into sentences, paragraphs, and chapters.

At one point, Austen interrupts this business: “Oh Herman, you must try a bit of this courtship.”

He takes a couple of words but is not really fond of the topic. Then Austen, to her credit, asks permission before reaching across the table and sticking her fork in Melville’s pile of seafaring words, “just a taste,” she says. This work goes on for a little while; they order a few drinks and after a few hours, voila! Moby Dick and Persuasion are written . . .

[Now, dear reader, our story thus far provides an approximation of the first assumption made in LDA. We assume that documents are constructed out of some finite set of available topics. It is in the next part that things become a little complicated, but fear not, for you shall sample themes both grand and beautiful.]

. . . Filled with a sense of deep satisfaction, the two begin walking back to the lodging house. Along the way, they bump into a blurry-eyed Hemingway, who is just then stumbling out of the Rising Sun Saloon.

Having taken on a bit too much cargo, Hemingway stops on the sidewalk in front of the two literati. Holding out a shaky pointer finger, and then feigning an English accent, Hemingway says: “Stand and Deliver!”

To this, Austen replies, “Oh come now, Mr. Hemingway, must we do this every season?”

More gentlemanly then, Hemingway replies, “My dear Jane, isn’t it pretty to think so. Now if you could please be so kind as to tell me what’s in the offing down at the LDA Buffet.”

Austen turns to Melville and the two writers frown at each other. Hemingway was recently banned from the LDA Buffet. Then Austen turns toward Hemingway and holds up six fingers, the sixth in front of her now pursed lips.

“Six topics!” Hemingway says with surprise, “but what are today’s themes?”

“Now wouldn’t you like to know that you old sot.” Says Melville.

The thousand injuries of Melville, Hemingway had borne as best he could, but when Melville ventured upon insult he vowed revenge. Grabbing their recently completed manuscripts, Hemingway turned and ran toward the South. Just before disappearing down an alleyway, he calls back to the dumbfounded writers: “All my life I’ve looked at words as though I were seeing them for the first time. . . tonight I will do so again! . . . ”

[Hemingway has thus overcome the first challenge of topic modeling. He has a corpus and a set number of topics to extract from it. In reality determining the number of topics to extract from a corpus is a bit trickier. If only we could ask the authors, as Hemingway has done here, things would be so much easier.]

. . . Armed with the manuscripts and the knowledge that there were six topics on the buffet, Hemingway goes to work.

After making backup copies of the manuscripts, he then pours all the words from the originals into a giant Italian-leather attache. He shakes the bag vigorously and then begins dividing its contents into six smaller ceramic bowls, one for each topic. When each of the six bowls is full, Hemingway gets a first glimpse of the topics that the authors might have found at the LDA Buffet. Regrettably, these topics are not very good at all; in fact, they are terrible, a jumble of random unrelated words . . .

[And now for the magic that is Gibbs Sampling.]

. . . Hemingway knows that the two manuscripts were written based on some mixture of topics available at the LDA Buffet. So to improve on this random assignment of words to topic bowls, he goes through the copied manuscripts that he kept as back ups. One at a time, he picks a manuscript and pulls out a word. He examines the word in the context of the other words that are distributed throughout each of the six bowls and in the context of the manuscript from which it was taken. The first word he selects is “heaven,” and at this word he pauses, and asks himself two questions:

  1. “How much of ‘Topic A,’ as it is presently represented in bowl A, is present in the current document?”
  2. “Which topic, of all of the topics, has the most ‘heaven’ in it?” . . .

[Here again dear reader, you must take with me a small leap of faith and engage in a bit of further make believe. There are some occult statistics here accessible only to the initiated. Nevertheless, the assumptions of Hemingway and of the topic model are not so far-fetched or hard to understand. A writer goes to his or her imaginary buffet of themes and pulls them out in different proportions. The writer then blends these themes together into a work of art. That we might now be able to discover the original themes by reading the book is not at all amazing. In fact we do it all the time--every time we say that such and such a book is about "whaling" or "courtship." The manner in which the computer (or dear Hemingway) does this is perhaps less elegant and involves a good degree of mathematical magic. Like all magic tricks, however, the explanation for the surprise at the end is actually quite simple: in this case our magician simply repeats the process 10 billion times! NOTE: The real magician behind this LDA story is David Mimno. I sent David a draft, and along with other constructive feedback, he supplied this beautiful line about computational magic.]

. . . As Hemingway examines each word in its turn, he decides based on the calculated probabilities whether that word would be more appropriately moved into one of the other topic bowls. So, if he were examining the word “whale” at a particular moment, he would assume that all of the words in the six bowls except for “whale” were correctly distributed. He’d now consider the words in each of those bowls and in the original manuscripts, and he would choose to move a certain number of occurrences of “whale” to one bowl or another.

Fortunately, Hemingway has by now bumped into James Joyce who arrives bearing a cup of coffee on which a spoon and napkin lay crossed. Joyce, no stranger to bags-of-words, asks with compassion: “Is this going to be a long night.”

“Yes,” Hemingway said, “yes it will, yes.”

Hemingway must now run through this whole process over and over again many times. Ultimately, his topic bowls reach a steady state where words are no longer needing to be being reassigned to other bowls; the words have found their proper context.

After pausing for a well-deserved smoke, Hemingway dumps out the contents of the first bowl and finds that it contains the following words:

“whale sea men ship whales penfon air side life bounty night oil natives shark seas beard sailors hands harpoon mast top feet arms teeth length voyage eye heart leviathan islanders flask soul ships fishery sailor sharks company. . . “

He peers into another bowl that looks more like this:

“marriage happiness daughter union fortune heart wife consent affection wishes life attachment lover family promise choice proposal hopes duty alliance affections feelings engagement conduct sacrifice passion parents bride misery reason fate letter mind resolution rank suit event object time wealth ceremony opposition age refusal result determination proposals. . .”

After consulting the contents of each bowl, Hemingway immediately knows what topics were on the menu at the LDA Buffet. And, not only this, Hemingway knows exactly what Melville and Austen selected from the Buffet and in what quantities. He discovers that Moby Dick is composed of 40% whaling, 18% seafaring and 2% gossip (from that little taste he got from Jane) and so on . . .

[Thus ends the fable.]

For the rest of the (LDA) story, see David Mimno’s Topic Modeling Bibliography

Aberrant Adjectives in 19th Century Novels

I created the visualization below using Many Eyes and a data set derived from part-of-speech tagged novels from 19th century Britain. Found here are the 100 most “aberrant adjectives.” Aberrant here is determined by selecting those words that have the greatest amount of usage deviation (measured by relative frequency) over a 13 decade time period. To qualify a word must also appear in every decade.

 

On Distant Reading and Macroanalysis

Earlier this week Kathryn Schultz of the New York Times published a rather provocative, challenging, and in my opinion under-researched and over-sensationalized article about my colleague Franco Morreti’s work theorizing a mode of literary analysis that he has termed “distant-reading.” Others have already pointed out some of the errors Schultz made, and I’m fairly certain Moretti would be happy to clarify any confusion Schultz may have about his work if she were to actually interview him (i.e. before paraphrasing him). My interest here is to offer some specific thoughts and some background on “distant-reading” or what I have preferred to call “macroanalysis.”[1]

The approach to the study of literature that I call macroanalysis, instead of distant-reading (for reasons explained below), is in general ways akin to the social-science of economics or, more specifically, macroeconomics. Before the 20th century there wasn’t a defined field of “Macroeconomics.” There was, however, microeconomics, which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to the study of individual texts via “close-readings” of the material. Macroeconomics, however, is about the study of the entire economy. It tends toward enumeration and quantification and is in this sense similar to literary inquiries that are not highly theorized: bibliographic studies, biographical studies, literary history, philology, and the enumerative analysis that is the foundation of humanities computing.

By way of an analogy, we might think about interpretive close-readings as corresponding to microeconomics while quantitative macroanalysis corresponds to macroeconomics. Consider, then, that in many ways the study of literary genres or literary periods is a type of macro approach to literature. Say, for example, a scholar specializes in early 20th century poetry. Presumably, this scholar could be called upon to provide sound generalizations, or “distant-readings” about 20th century poetry based on a broad reading of individual works within that period. This would be a sort of “macro-, or distant-, reading” of the period. But this parallel falls short of approximating for literature what macroeconomics is to economics, and it is in this context that I prefer the term macroanalysis over distant-reading. The former term places the emphasis on the quantifiable methodology over the more interpretive practice of “reading.” Broad attempts to generalize about a period or about a genre are frequently just another sort of micro-analysis, in which multiple “cases” or “close-readings” of individual texts are digested before generalizations about them are drawn in very qualitative ways. Macroeconomics, on the other hand, is a more number-based discipline, one grounded in quantitative analysis not qualitative assessments. Moreover, macroeconomics employs a number of quantitative benchmarks for assessing, scrutinizing, and even forecasting the macro-economy. While there is an inherent need for understanding the economy at the micro level, in order to contextualize the macro-results, macroeconomics does not directly involve itself in the specific cases, choosing instead to see the cases in the aggregate, looking to those elements of the specific cases that can be generalized, aggregated, and quantified.

Micro-oriented approaches to literature, highly interpretive readings of literature, remain fundamentally important. Just as microeconomics offers important perspectives on the economy. It is the exact interplay between the macro and micro scale that promises a new, enhanced, and perhaps even better understanding of the literary record. The two approaches work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro level, remains essential. While the methods of enquiry, of evidence gathering, are different, they are not antithetical, and they share the same ultimate goal of informing our understanding of the literary record, be it writ large or small. The most fundamental and important difference in the two approaches is that the macroanalytic approach reveals details about texts that are for all intents and purposes unavailable to close-readers of the texts. Writing of John Burrows’s study of Jane Austen’s oeuvre, Julia Flanders points out how Burrows’s computational study brings the most common words such as “the” and “of” into our field of view.

Flanders writes: “His [Burrows] effort, in other words, is to prove the stylistic and semantic significance of these words, to restore them to our field of view. Their absence from our field of view, their non-existence as facts for us, is precisely because they are so much there, so ubiquitous that they seem to make no difference.” (Flanders 2005)

At its most basic, the macroanalytic approach I’m advocating is simply another method of gathering information about texts, of accessing the details. The information is different from what is derived via close reading, but it not of lesser or greater value to scholars for being such.

Flanders goes on: “Burrows’ approach, although it wears its statistics prominently, foreshadows a subtle shift in the way the computer’s role vis-á-vis the detail is imagined. It foregrounds the computer not as a factual substantiator whose observations are different in kind from our own—because more trustworthy and objective—but as a device that extends the range of our perceptions to phenomena too minutely disseminated for out ordinary reading.” (Flanders 2005)

A macroanalytic approach not only helps us to see and understand the larger “literary economy” but, by means of its scope, to better see and understand the degree to which literature and the individual authors who manufacture the literature respond to or react against literary and cultural trends within their realm of experience. If authors are inevitably influenced by their predecessors, then we may even be able to chart and understand “anxieties of influence” in concrete, quantitative ways.

For historical and stylistic questions in particular, the macroanalytic approach has distinct advantages over the more traditional practice of studying literary periods and genres by means of a close study of “representative” texts. Speaking of his own efforts to provide a more encompassing view of literary history, Franco Moretti writes that “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as a whole . . .” (2005). To generalize about a “period” of literature based on a study of a relatively small number of books is to take a significant leap. It is less problematic, though, to consider how a macroanalytic study of several thousand texts might lead us to a better understanding of the individual texts. Until recently, we have not had the opportunity to even consider this later option, and it seems reasonable to imagine that we might, through the application of both approaches, reach a new and better informed understanding of our primary materials. This is what Juri Tynjanov imagined in 1927: “Strictly speaking”, writes Tynjanov, “one cannot study literary phenomena outside of their interrelationships.” Fortunately for me and for scholars such as Moretti, the multitude of interrelationships that overwhelmed and eluded Tynjanov and pushed the limits of close-reading can now be explored with the aid of computation, statistics and huge digital libraries.

My book on this subject, Literary Studies, the Digital Library, and the Inevitability of Influence, is now under contact with [Update: will be published in 2013 as Macroanalysis: Digital Methods and Literary History by University of Illinois Press.

[1] I began using the term macroanalysis in late 2003. At the time, Moretti and I were putting together plans for a co-taught course titled “Electronic Data and Literary Theory.” The course we imagined would be a research seminar in the full sense of the word and in our syllabus (dated 11/3/2003) we wrote: “the main purpose of this seminar is methodological rather than historical: learning how to use electronic search systems to analyze large quantities of data — and hence get a new, better understanding of literary and cultural history.” During the course I began work developing a text analysis toolkit that I later called CATools (for Corpus Analysis Tools). In terms of methodology, I was learning a lot at the time from work in corpus linguistics but also discovering that we (literary folks) have an entirely different set of questions. So it made sense to do at least a bit of wheel reinvention. My first experiments with the macroanalytic methodology were constructed around a corpus of Irish-American novels that I had been building since my dissertation research. I presented the first results of this work in Liverpool, at the 2004 meeting of the American Conference for Irish Studies. My paper, titled “Making and Mining a Digital Archive: the Case of the Irish-American West Project,” was part how-to and part results–I’d made one non-trivial discovery about Irish-American literary history based on this new methodology. In the spring of 2005, I offered a more detailed methodological overview of the toolkit at the inaugural meeting of the Text Analysis Developer’s Alliance. An overview of my project was documented on the TADA blog. Later that summer (2005), I presented a more generalized methodological paper titled “A Macro-Economic Model for Literary Research” at the joint meeting of the ACH and ALLC in Victoria, BC. It was there that I first articulated the economic analogy that I have come to find most useful for explaining Moretti’s idea of “distant-reading.” In 2006, while I was in residence as Research Scholar in the Digital Humanities at the Stanford Humanities Center, I spent a good deal of time thinking about macro-scale approaches to literature and then writing corpus analysis code . By the summer of 2007, I had developed a whole new toolkit and presented the first significant findings in a paper titled “Macro-Analysis (2.0)” which I delivered at the 2007 Digital Humanities meeting in Illinois. Coincidentally, this was the same conference at which Moretti presented the opening keynote lecture, a paper exploring a corpus of 19th century novel titles, which would eventually be published in Critical Inquiry. That research utilized software that I had developed in the CATools package.