• Workshops
    • University of Chicago (2019)
    • University College London (May 10, 2017)
    • Vanderbilt (April 5, 2017)
    • Wesleyan (March 3, 2017)
    • National Humanities Center: June 8 – 12
      • Day One Code
      • Day Two Code
      • Day Three Code
      • Day Four Code
      • Day Five Code
      • Year Two Instructions
      • Year Two XML
    • Harvard: April 3, 2015
    • Yale: December 5, 2014
    • DH 2014: Introduction to Text Analysis and Topic Modeling with R
    • University of Gothenburg
    • Michigan State University
    • University of Kansas
    • UW-Milwaukee, 2013
      • Workshop Code
    • 2013 DHWI
      • DHWI: R Code Day One
      • DHWI: R Code Day Two
      • DHWI: R Code Day Three
      • DHWI: R Code Day Four
      • DHWI: R Code Functions File
    • 2013 MLA/DH Commons
  • Publications
  • Noted
  • Lectures/Events
  • Courses
  • Books
    • The Bestseller Code
    • Text Analysis with R for Students of Literature
    • Macroanalysis
      • Confusion Matrices
      • Expanded Stopwords List
      • The LDA Buffet: A Topic Modeling Fable
      • 500 Themes
      • Color Versions of Figures 9.3 and 9.4
  • Blog Posts
  • About

Matthew L. Jockers

Category Archives: Commentary

Plot Arcs (Schmidt Style)

05 Monday Jan 2015

Posted by Matthew Jockers in Commentary

≈ Comments Off on Plot Arcs (Schmidt Style)

A few weeks ago Ben Schmidt posted a provocative blog entry titled “Typical TV episodes: visualizing topics in screen time.” It’s worth a careful read. . .

Ben began by topic modeling the closed captioning data from a series of popular TV series and then visualizing the ten most common topics over the time span of each episode. In other words, the x-axis is time, and the y-axis is a measure of topical presence. The end result is something that begins to look a bit like what we could call plot.

Ben followed this post with an even more provocative one on 12/16/14 “Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts“. This post led a number of us (Underwood, Mimno, Cherny, etc.) to question what the approach might reveal if applied to novels . . .

In my own recent work, I have been attempting to model plot movement in narrative fiction by analyzing the rise and fall of emotional valence across narrative time. It has been clear to me, however, that my method is somewhat impoverished by a lack context for the emotions I am measuring; Ben’s topic-based approach to plot structure might be just the context I’m missing, and some correlation analysis might be just the right recipe . . . as usual, Ben has given us a lot to think about—i.e. Happy Holidays!

After following the discussion on Twitter and on Ben’s blog, David Mimno wrote to me about whipping up some of these topical plot lines based on the 500 Topic model that I had built for Macroanalysis. Needless to say, I thought this was a great idea. (David and I had previously revisited my topical data for an article in Poetics.) Within a few hours, David had run the entire collection of 500 topics and produced 500 graphs showing the general behavior of each topic across all of the 3,500 texts in my corpus. You will find the output of David’s work here: http://mimno.infosci.cornell.edu/novels/plot.html

In David’s short introductory paragraph, he calls our attention to two specific topic graphs, one for the topic labeled “school” and another labeled “punishment.” You can find my graphs for these two topics here (school) and here (punishment). In referencing these two plots, David calls our attention to one topic (school) that appears prominently at the beginnings of novels in this corpus (think Bildungsroman, perhaps?) and another topic (punishment) that tends to be prominent at the end of novels (think Newgate novels or Oliver Twist, perhaps?).

Like the data from Ben, this data David has mined from my 19th century novels topic model is incredibly rich and demands deeper inspection. I’ve only begun to digest it in bits, but I do observe that a lot of topics carrying negative valence seem to rise over the course of narrative time. This makes intuitive sense if we believe that the central conflict of a novel must grow more intense as the novel progresses. The exciting thing to do ext is to move from the macro to the micro scale and look at the individual novels within this larger context. Perhaps we’ll be able to identify archetypal patterns and then observe which novels stick to the archetypes and which digress. . . what a feast!

Luckily we have a whole new year to indulge!

NHC Summer Institutes in Digital Humanities

09 Tuesday Dec 2014

Posted by Matthew Jockers in Commentary

≈ Comments Off on NHC Summer Institutes in Digital Humanities

I’m pleased to announce that Willard McCarty and I are leading a two-year set of summer institutes in digital humanities at the National Humanities Center. Here is the official announcement:

“The first of the National Humanities Center’s summer institutes in digital humanities, devoted to digital textual studies, will convene for two one-week sessions, first in June 2015 and again in 2016. The objective of the Institute in Digital Textual Studies is to develop participants’ technological and scholarly imaginations and to combine them into a powerful investigative instrument. Led by Willard McCarty (King’s College London and University of Western Sydney) and Matthew Jockers (University of Nebraska), the Institute aims to further the development of individual as well as collaborative projects in literary and textual studies. The Institute will take place in Chapel Hill, North Carolina, in 2015 and at the National Humanities Center in Research Triangle Park, North Carolina, in 2016.”

The first workshop will take place June 8 – 12. Applications are now open. See http://nationalhumanitiescenter.org/digital-humanities/application.html

NHC Flyer

Reading Macroanalysis: The Hard Way!

12 Thursday Jun 2014

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on Reading Macroanalysis: The Hard Way!

This past November, Judge Denny Chin ruled to dismiss the Authors Guild’s case against Google; the Guild vowed they would appeal the decision and two months ago their appeal was submitted. I’ll leave it to my legal colleagues to discuss the merit (or lack) in the Guild’s various arguments, but one thing I found curious was the Guild’s assertion that 78% of every book is available, for free, to visitors to the Google Books pages.

According to the Guild’s appeal:

Since 2005, Google has displayed verbatim text from copyrighted books on these pages. . . Google generally divides each page image into eighths, which it calls “snippets.”. . . Once a user retrieves a book through her initial search, she can enter any other search terms she chooses, and the author’s verbatim words will be displayed in three snippets for each search. Although Google has stated that any given search by a user “only” displays three snippets of each book, a single user can view far more than three snippets from a Library Project book by performing multiple searches using different terms, including terms suggested by Google. . . Even minor variations in search terms will yield different displays of text. . . Google displays snippets from each book, except that it withholds display of 10% of the pages in each book and of one snippet per page. . .Thus, Google makes the vast majority of the text of these books—in all, 78% of each work—available for display to its users.

I decided to test the Guild’s assertion, and what better book to use than my own: Macroanalysis: Digital Methods and Literary History.

In the “Preview,” Google displays the front matter (table of contents, acknowledgements, etc) followed by the first 16 pages of my text. I consider this tempting pabulum for would be readers and within the bounds of fair use, not to mention free advertising for me. The last sentence in the displayed preview is cut off; it ends as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now. . . ” Thus ends page 16 and thus ends Google’s preview.

According to the author’s guild, however, a visitor to this book page can access much more of the book by using a clever method of keyword searching. What the Guild does not tell us, however, is just how impractical and ridiculous such searching is. But that is my conclusion and I’m getting ahead of myself here. . .

To test the guild’s assertion, I decided to read my book for free via Google books. I began by reading the material just described above, the front matter and the first 16 pages (very exciting stuff, BTW). At the end of this last sentence, it is pretty easy to figure out what the next word would be; surely any reader of English could guess that the next word, after “. . .scaling of digital content that is now. . . ” would be the word “available.”

Just to be sure, though, I double-checked that I was guessing correctly by consulting the print copy of the book. Crap! The next word was not “available.” The full sentence reads as follows: “We have not yet seen the scaling of our scholarly questions in accordance with the massive scaling of digital content that is now held in twenty-first-century digital libraries.”

Now why is this mistake of mine important to note? Reading 78% of my book online, as the Guild asserts, requires that the reader anticipate what words will appear in the concealed sections of the book. When I entered the word “available” into the search field, I was hoping to get a snippet of text from the next page, a snippet that would allow me to read the rest of the sentence. But because I guessed wrong, I in fact got non-contiguous snippets from pages 77, 174, 72, 9, 56, 15, 37, 162, 8, 4, 80, 120, 154, 46, 133, 79, 27, 97, 147, and 17, in that order. These are all the pages in the book where I use the word “available” but none include the rest of the sentence I want to read. Ugh.

Fortunately, I have a copy of the full text on my desk. So I turn to page 17 and read the sentence. Aha! I now conduct a search for the word “held.” This search results in eight snippets; the last of these, as it happens, is the snippet I want from page 17. This new snippet contains the next 42 words. The snippet is in fact just the end of the incomplete sentence from page 16 followed by another incomplete sentence ending with the words: “but we have not yet fully articulated or explored the ways in which. . . ”

So here I have to admit that I’m the author of this book, and I have no idea what follows. I go back to my hard copy to find that the sentence ends as follows: “. . . these massive corpora offer new avenues for research and new ways of thinking about our literary subject.”

Without the full text by my side, I’d be hard pressed to come up with the right search terms to get the next snippet. Luckily I have the original text, so I enter the word “massive” hoping to get the next contiguous snippet. Six snippets are revealed, the last of these includes the sentence I was hoping to find and read. After the word “which,” I am rewarded with “these massive corpora offer new avenues for” and then the snippet ends! Crap, I really want to read this book for free!

So I think to myself, “what if instead of trying to guess a keyword from the next sentence, I just use a keyword from the last part of the snippet. “avenues” seems like a good candidate, so I plug it in. Crap! The same snippet is show again. Looks like I’m going to have to keep guessing. . .

Let’s see, “new avenues for. . . ” perhaps new avenues for “research”? (Ok, I’m cheating again by going back to the hard copy on my desk, but I think a savvy user determined to read this book for free might guess the word “research”). I plug it in. . . 38 snippets are returned! I scroll though them and find the one from page 17. The key snippet now includes the end of the sentence: “research and new ways of thinking about our literary subject.”

Now I’m making progress. Unfortunately, I have no idea what comes next. Not only is this the end of a sentence, but it looks like it might be the end of a paragraph. How to read the next sentence? I try the word “subject” and Google simply returns the same snippet again (along with assorted others from elsewhere in the book). So I cheat again and look at my copy of the book. I enter the word “extent” which appears in the next sentence. My cheating is rewarded and I get most of the next sentence: “To some extent, our thus-far limited use of digital content is a result of a disciplinary habit of thinking small: the traditionally minded scholar recognizes value in digital texts because they are individually searchable, but this same scholar, as a. . . ”

Thank goodness I have tenure and nothing better to do!

The next word is surely the word “result,” which I now dutifully enter into the search field. Among the 32 snippets that the search returns, I find my target snippet. I am rewarded with a copy of the exact same snippet I just saw with no additional words. Crap! I’m going to have to be even more cleaver if I’m going to game this system.

Back to my copy of the book I turn. The sentence continues “as a result of a traditional training,” so I enter the word “traditional,” and I’m rewarded with . . . the same damn passage again! I have already seen it twice, now thrice. My search for the term “traditional” returns a hit for “traditionally” in the passage I have already seen and, importantly, no hit for the instance of “traditional” that I know (from reading the copy of the book on my desk) appears in the next line. How about “training,” I wonder. Nothing! Clearly Google is on to me now. I get results for other instances of the word “training” but not for the one that I know appears in the continuation of the sentence I have already seen.

Well, this certainly is reading Macroanalysis the hard way. I’ve now spent 30 minutes to gain access to exactly 100 words beyond what was offered in the initial preview. And, of course, my method involved having access to the full text! Without the full text, I don’t think such a process of searching and reading is possible, and if it is possible, it is certainly not feasible!

But let’s assume that a super savvy text pirate, with extensive training in English language syntax could guess the right words to search and then perform at least as well as I did using a full text version of my book as a crutch. My book contains, roughly, 80,000 words. Not counting the ~5k offered in the preview, that leaves 75,000 words to steal. At a rate of 200 words per hour, it would take this super savvy text pirate 375 hours to reconstruct my book. That’s about 47 days of full-time, eight-hour work.

I get it. Times are tough and some folks simply need to steal books from snippet view because they can’t afford to buy them. I’m sympathetic to these folks; they need to satisfy their intense passion for reading and knowledge and who could blame them? Then again, if we consider the opportunity cost at $7.25 per hour (the current minimum wage), then stealing this book from snippet view would cost a savvy text pirate $2,218.75 in lost wages. The eBook version of my text, linked to from the Google Books web page, sells for $14.95. Hmmm?

A Novel Method for Detecting Plot

05 Thursday Jun 2014

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on A Novel Method for Detecting Plot

While studying anthropology at the University of Chicago, Kurt Vonnegut proposed writing a master’s thesis on the shape of narratives. He argued that “the fundamental idea is that stories have shapes which can be drawn on graph paper, and that the shape of a given society’s stories is at least as interesting as the shape of its pots or spearheads.” The idea was rejected.

In 2011, Open Culture featured a video in which Vonnegut expanded on this idea and suggested that computers might someday be able to model the shape of stories, that is, the movement of the narratives, the plots. The video is about four minutes long; it’s worth watching.

About the same time that I discovered this video, I was working on a project in which I was applying the tools and techniques of sentiment analysis to works of fiction.[1] Initially I was interested in tracing the evolution of emotional content in novels over the course of the 19th century. By accident I discovered that the sentiment I was detecting and measuring in the fiction could be used as a highly accurate proxy for plot movement.

Joyce’s Portrait of the Artist as a Young Man is a story that I know fairly well. Once upon a time a moo cow came down along the road. . .and so on . . .

Here is the shape of Portrait of the Artist as a Young Man that my computer drew based on an analysis of the sentiment markers in the text:

poa1

If you are familiar with the plot, you’ll readily see that the computer’s version of the story is accurate. As it happens, I was teaching Portrait last fall, so I projected this image onto the white board and asked my students to annotate it. Here are a few of the high (and low) points that we identified.

poa2

Because the x-axis represents the progress of the narrative as a percentage, it is easy to move from the graph to the actual pages in the text, regardless of the edition one happens to be using. That’s precisely what we did in the class. We matched our human reading of the book with the points on the graph on a page-by-page basis.

Here is a graph from another Irish novel that you might know; this is Wilde’s Picture of Dorian Gray.

dorian1

If you remember the story, you’ll see how well this plot line models the movement of the story. Discovering the accuracy of these graphs was quite thrilling.

This next image shows Dan Brown’s blockbuster novel The Da Vinci Code. Notice how much more regular the fluctuations are. This is the profile of a page turner. Notice too how the more generalized blue trend line hovers above neutral in terms of its emotional valence. Dan Brown never lets the plot become too troubled or too much of a downer. He baits us and teases us with fluctuating emotion.

brown1

Now compare Da Vinci Code to one of my favorite contemporary novels, Cormac McCarthy’s Blood Meridian. Blood Meridian is a dark book and the more generalized blue trend line lingers in the realms of negative emotion throughout the text; it is a very different book from The Da Vinci Code.[2]

mccarthy1

I won’t get into the precise details of how I am measuring emotional valence in these books here.[3] It’s a bit too complicated for an already too long blog post. I will note, however, that the process involves two major components: a controlled vocabulary of positive and negative sentiment markers collected by Bing Liu of the University of Illinois at Chicago and a machine model that I trained to identify and score passages as positive or negative.

In a follow-up post, I’ll describe how I normalized the plot shapes in 40,000 novels in order to compare the shapes and discover what appear to be six archetypal plots!

NOTES:
[1] In the field natural language processing there is an area of research known as sentiment analysis or, sometimes, opinion mining. And when our colleagues engage in this kind of work, they very often focus their study on a highly stylized genre of non-fiction: the review, specifically movie reviews and product reviews. The idea behind this work is to develop computational methods for detecting what we, literary folk, might call mood, or tone, or sentiment, or perhaps even refer to as affect. The psychologists prefer the word valence, and valence seems most appropriate to this research of mine because the psychologists also like to measure degrees of positive and negative valence. I am not aware of anyone working in sentiment analysis who is specifically interested in modeling emotional valence in fiction. In fact, the great majority of work in this field is so far removed from what we care about in literary studies that I spent about six months simply wondering whether or not the methods developed by folks trying to gauge opinions in movie reviews could even be usefully employed in studies of literature.
[2] I gained access to some of these novels through a data transfer agreement made between the University of Nebraska and a private company that is no longer in business. See Unfolding the Novel.
[3] I’m working on a longer and more formal version of this research report for publication. The longer version will include all the details of the methodology. Stay Tuned:-)

So What?

07 Wednesday May 2014

Posted by Matthew Jockers in Commentary

≈ Comments Off on So What?

Over the past few days, several people have written to ask what I thought about the article by Adam Kirsch in New Republic (“Technology Is Taking Over English Departments The false promise of the digital humanities.”) In short, I think it lacks insight and new knowledge. But, of course, that is precisely the complaint that Kirsch levels against the digital humanities. . .

Several months ago, I was interviewed for a story about topic modeling to appear in the web publication Nautilus. The journalist, Dana Mackenzie, wanted to dive into the “so what” question and ask how my quantitative and empirical methods were being received by literary scholars and other humanists. He asked the question bluntly because he’d read the Stanley Fish blog in the NYT and knew already that there was some push back from the more traditional among us. But honestly, this is not a question I spend much time thinking about, so I referred Dana to my UNL colleague Steve Ramsay and to Matthew Kirshenbaum at the University of Maryland. They have each addressed this question formally and are far more eloquent on the subject than I am.

What matters to me, and I think what should matter to most of us is the work itself, and I believe, perhaps naively, that the value of the work is, or should be, self-evident. The answer to the question of “so what?” should be obvious. Unfortunately, it is not always obvious, especially to readers like Kirsch who are not working in the sub fields of this massive big tent we have come to call “digital humanities” (and for the record, I do despise that term for its lack of specificity). Kirsch and others inevitably gravitate to the most easily accessible and generalized resources often avoiding or missing some of the best work in the field.

“So what?” is, of course, the more informal and less tactful way of asking what one sometimes hears (or might wish to hear) asked after an academic paper given at the Digital Humanities conference, e.g. “I was struck by your use of latent Dirichlet allocation, but where is the new knowledge gained from your analysis?”

But questions such as this are not specific to digital humanities (I was struck by your use of Derrida, but where is the new knowledge gained from your analysis). In a famous essay, Claude Levi-Strauss asked “so what” after reading Vladimir Propp’s Morphology of the Folktale. If I understand Levi-Strauss correctly the beef with Propp is that he never gets beyond the model; Propp fails to answer the “so what” question. To his credit, Levi-Strauss gives Propp props for revealing the formal model of the folktale when he writes that: “Before the epoch of formalism we were indeed unaware of what these tales had in common.”

But then, in the very next sentence, Levi-Strauss complains that Propp’s model fails to account for content and context, and so we are “deprived of any means of understanding how they differ.”

“The error of formalism” Levi-Strauss writes, is “the belief that grammar can be tackled at once and vocabulary later.” In short, the critique of Propp is just simply that Propp did not move beyond observation of what is and into interpretation of what that thing that is, means (Propp 1984).

To be fair, I think that Levi-Strauss gave Propp some credit and took Propp’s work as a foundation upon which to build more nuanced layers of meaning. Propp identified a finite set of 31 functions that could be identified across narratives; Levi-Strauss wished to say something about narratives within their cultural and historical context. . .

This is, I suppose, the difference between discovering DNA and making DNA useful. But bear in mind that the one ever depends upon the other. Leslie Pray writes about the history of DNA in a Nature article from 2008:

Many people believe that American biologist James Watson and English physicist Francis Crick discovered DNA in the 1950s. In reality, this is not the case. Rather, DNA was first identified in the late 1860s by a Swiss chemist. . . and other scientists . . . carried out . . . research . . . that revealed additional details about the DNA molecule . . . Without the scientific foundation provided by these pioneers, Watson and Crick may never have reached their groundbreaking conclusion of 1953.

(Pray 2008)

I suppose I take exception to the idea that the kind of work I am engaged in, because it is quantitative and methodological, because it seeks first to define what is, and only then to describe why that which is matters, must meet some additional criteria of relevance.

There is often a double standard at work here. The use of numbers (computers, quantification, etc.) in literary studies often triggers a knee jerk reaction. When the numbers come out, the gloves come off.

When discussing my work, I am sometimes asked whether the methods and approaches I advocate and employ succeed in bringing new knowledge to our study of literature. My answer is a firm and resounding “yes.” At the same time, I need to emphasize that computational work in the humanities can be simply about testing, rejecting, or reconfirming, what we think we already know. And I think that is a good thing!

During a lecture about macro-patters of literary style in the 19th century novel, I used the example of Moby Dick. I reported how in terms of style and theme Moby Dick is a statistical mutant among a corpus of 1000 other 19th century American novels. A colleague raised his hand and pointed out that literary scholars already know that Moby Dick is an aberration. Why bother computing a new answer to a question for which we already have an answer?

My colleague’s question says something about our scholarly traditions in the humanities. It is not the sort of question that one would ask a physicist after a lecture confirming the existence of the Higgs Boson! It is, at the same time, an ironic question; we humanists have tended to favor a notion that literary arguments are never closed!

In other words, do we really know that Moby Dick is an aberration? Could a skillful scholar/humanist/rhetorician argue the counter point? I think that the answer to the first question is “no” and the second is “yes.” Maybe Moby Dick is only an outlier in comparison to the other twenty or thirty American novels that we have traditionally studied along side Moby Dick?

My point in using Moby Dick was not to pretend that I had discovered something new about the position of the novel in the American literary tradition, but rather to bring new evidence and a new perspective to the matter and in this case fortify the existing hypothesis.

If quantitative evidence happens to confirm what we have come to believe using far more qualitative methods, I think that new evidence should be viewed as a good thing. If the latest Mars rover returns more evidence that the planet could have once supported life, that new evidence would be important and welcomed. True, it would not be as shocking or exciting as the first discovery of microbes on Mars, or the first discovery of ice on Mars, but it would be viewed as important evidence nevertheless, and it would add one more piece to a larger puzzle. Why should a discussion of Moby Dick’s place in literary history be any different?

In short computational approaches to literary study can provide complementary evidence, and I think that is a good thing.

Computational approaches can also provide contradictory evidence, evidence that challenges our traditional, impressionistic, or anecdotal theories.

In 1990 my dissertation adviser, Charles Fanning, published an excellent book titled The Irish Voice in America. It remains the definitive text in the field. In that book he argued for what he called a “lost generation” of Irish-American writers in the period from 1900 to 1930. His research suggested that Irish-American writing in this period declined, and so he formed a theory about this lost generation and argued that adverse social forces led Irish-Americans away from writing about the Irish experience.

In 2004, I presented new evidence about this period in Irish-American literary history. It was quantitative evidence showing not just why Fanning had observed what he had observed but also why his generalizations from those observations were problematic. Charlie was in the audience that day and after my lecture he came up to say hello. It was an awkward moment, but to my delight, Charlie smiled and said, “it was just an idea.” His social theory was his best guess given the evidence available in 1990, and he understood that.

My point is to say that in this case, computational and quantitative methods provided an opportunity for falsification. But just because such methods can provide contradiction or falsification, we must not get caught up in a numbers game where we only value the testable ideas. Some problems lend themselves to computational or quantitative testing; others do not, and I think that is a fine thing. There is a lot of room under the big tent we call the humanities.

And finally, these methods I find useful to employ can lead to genuinely new discoveries. Computational text analysis has a way of bringing into our field of view certain details and qualities of texts that we would miss with just the naked eye (as John Burrows and Julia Flanders have made clear). I like to think that the “Analysis” section of Macroanalysis offers a few such discoveries, but maybe Mr. Kirsch already knew all that? For a much simpler example, consider Patrick Juola’s recent discovery that J. K. Rowling was the author of The Cuckoo’s Calling, a book Rowling wrote under the pseudonym Robert Galbraith. I think Joula’s discovery is a very good thing, and it is not something that we already knew. I could cite a number of similar examples from research in stylometry, but this example happens to be accessible and appealing to a wide range of non-specialists: just the sort of simple folk I assume Kirsch is attempting to persuade in his polemic against the digital humanities.

Works Cited:

Propp, Vladimir. Theory and History of the Folktale. Trans. Ariadna Y. Martin and Richard Martin. Edited by Anatoly Liberman. University of Minnesota Press, 1984. 180

Pray, L. (2008) Discovery of DNA structure and function: Watson and Crick. Nature

Characterization in Literature and the Macroanalysis Lab

08 Wednesday Jan 2014

Posted by Matthew Jockers in Commentary

≈ 1 Comment

I have just posted the syllabus for my spring macroanalysis class focusing on Characterization in Literature. The class is experimental in many senses of the word. We will be experimenting in the class and the class will be an experiment. If all goes according to plan, the only thing about this class that will be different from a research lab is the grade I have to assign at the end—that is the one remaining bit about collaborative learning that still kicks me . . .

To be successful everyone is going to have to be high-performing and self-motivated, me included. For me, at least, the motivation comes from what I think is a really tough nut to crack: algorithmic detection and analysis of character and character types. So far the work in this area has been largely about character networks: how is Hamlet related to Gertrude, etc. That’s good work, but it depends heavily upon the human coding of character metadata before processing. That is precisely why our early experiments at the Stanford Literary Lab focused on Drama. . . the character names are already explicit in the speaker markup. Beyond drama, there have been some important steps taken in the direction of auto-detection of character in fiction, such as those by Graham Sack and Elson et. al, but I think we still have a lot more stepping to do, a whole lot more.

The work I envision for the course will include leveraging obvious tools such as those for named entity recognition and then thinking through and dealing with the more complicated problems of pronoun disambiguation. But my deeper interest here goes far beyond simple detection of entities. The holy grail that I see here lies not in detecting the presence or absence of individual characters but in detecting and tracking character archetypes on a grand macroscale. What if we could begin to answer questions such as these:

  • Are there different classes of villains in the 19th century novel?
  • Do we see a rise in the number of minor characters over the 20th century?
  • What are the qualities that define heroines?
  • How, if at all, do those qualities change/evolve over time? (think Jane Austen’s Emma vs. Stieg Larsson’s Lisbeth).
  • Etc.

We may get nowhere; we may fail miserably. (Of course if I did not already have a couple of pretty good ideas for how to get at these questions I would not be bothering. . . but that, for now, is the secret sauce 😉 )

At the more practical, “skills” level, I’m requiring students to learn and submit all their work using LaTeX! (This may prove to be controversial or crazy–I only learned LaTeX six months ago.) For that they will also be learning how to use the knitr package for R in order to embed R code directly into the LaTeX, and all of this work will take place inside the (awesome) R IDE, RStudio. Hold on to your hats; it’s going to be a wild ride!

Obi Wan McCarty

19 Friday Jul 2013

Posted by Matthew Jockers in Commentary

≈ 1 Comment

[Below is the text of my introduction of Willard McCarty, winner of the 2013 Busa Award.]

As the chair of the awards committee that selected Prof. McCarty for this award it is my pleasure to offer a few words of introduction.

I’m going to go out on a limb this afternoon and assume that you already know that Willard McCarty is Professor of Humanities Computing and Director of the Doctoral Program in the Department of Digital Humanities at King’s College London, and that he is Professor in the Digital Humanities Research Group, University of Western Sydney and that he is a Fellow of the Royal Anthropological Institute (London). I’ll assume that you already know that he is Editor of the British journal, Interdisciplinary Science Reviews and that he’s founding Editor of the online seminar Humanist. And I am sure you know that Willard is recipient of the Canadian Award for Outstanding Achievement in Computing in the Arts and Humanities, and of the prestigious Richard W. Lyman Award of the National Humanities Center. You have probably already read his 2005 book titled Humanities Computing, and you know of his many, many other writings and musing.

So I’m not going to talk about any of that stuff.

And since I’m sure that everyone here knows that the Roberto Busa Award was established in 1998. I’m not going to explain how the Busa award was set up to recognize outstanding lifetime achievement in the application of information and communications technologies to humanities research.

No I’m not going to say anything about that either.

Instead, I wish to say a few words about this fellow here. Screen Shot 2013-07-19 at 7.23.12 AM

This is Obi-Wan McCarty. Long before I met him in person, he had become a virtual friend, model, and mentor.

I began computing in the humanities in 1993, and like so many of us in those early days I was a young maverick with little or no idea what had been done before. Those were the days before the rebellion, when the dark forces of the Empire were still quite strong. It was a time when an English major with a laptop was considered a dangerous rebel. At times I was scared, and I felt alone in a dark side of a galaxy far, far, away.

And then somewhere between 1993 and 2001 I began to sense a force in the galaxy.

One day, in early 2001, I was walking with my friend Glen Worthey, and I mentioned how I had recently discovered the Humanist list and how there had been this message posted by Willard McCarty with the cryptic subject line “14.”

“Ah yes,” Glen said, “Obi-Wan McCarty. The force is strong with him.”

Message 14 from Obi-Wan was a birthday message. Humanist was 14 that day and Willard began his message with a reflection on “repetition” and how frequently newcomers to the list would ask questions that had already been asked. Rather than chastise those newbies, and tell them to go STFA (search the freakin’ archive), Willard encouraged them. He wrote in that message of how “repetition is a means of maintaining group memory.” I was encouraged by those words and by Willard’s ongoing and relentless commitment not simply to deep, thoughtful, and challenging scholarship, but to nurturing, teaching, welcoming, and mentoring each new generation.

So Willard, thank you for your personal mentorship, thank you for continuing to demonstrate that scholarly excellence and generosity are kindred spirits. Congratulations on this award. May the force be with you.

“A Matter of Scale”

28 Thursday Mar 2013

Posted by Matthew Jockers in Commentary

≈ Comments Off on “A Matter of Scale”

Back in November, Julia Flanders and I were invited to stage a debate on the matter of “scale” in digital humanities research for the “Boston Area Days of DH” conference keynote: Julia was to represent the micro scale and I the macro.

Julia and I met up during the MLA conference in January and began sketching out how the talk might go. The first thing we discovered, of course, is that we did not in fact have a real difference of opinion on this matter of scale. Big data, small data, close reading and distant . . . these things matter much less than what a scholar actually decides to do and say. In other words, we were both ultimately interested in new knowledge and not too much concerned with the level of scale necessary to derive that new knowledge.

In other words, it’s a false and probably irrelevant debate. And while we agreed on this point in general terms, we discovered in the course of composing and editing the script for our mock debate that there were legitimate nuances that deserved to be put into the light of day. The script form our “debate” and all of the slides are now available via UNL’s open access repository as “A Matter of Scale.”

Julia has posted a few comments on the experience of co-authoring this presentation with me on her blog. Check it out at http://juliaflanders.wordpress.com/2013/03/28/a-matter-of-scale/.

Thoughts on a Literary Lab

04 Friday Jan 2013

Posted by Matthew Jockers in Commentary

≈ 2 Comments

[For the “Theories and Practices of the Literary Lab” roundtable at MLA yesterday, panelists were asked to speak for 5 minutes about their vision of a literary lab. Here are my remarks from that session–#147]

I take the descriptor “literary lab” literally, and to help explain my vision of a literary lab I want to describe how the Stanford Literary Lab that I founded with Franco Moretti came into being.

The Stanford Lab was born out of a class that I taught in the fall of 2009. In that course I assigned 1200 novels and challenged students to explore ways of reading, interpreting, and understanding literature at the macro-scale, as an aggregate system. Writing about the course and the lab that evolved from the course, Chronicle of Higher Ed reporter Marc Parry described it as being based on: “a controversial vision for changing a field still steeped in individual readers’ careful analyses of texts.” That may be how it looks from the outside, but there was no radical agenda then and no radical agenda today.

In the class, I asked the students to form into two research teams and to construct research projects around this corpus of 1200 novels. One group chose to investigate whether novel serialization in the 19th century had a detectable/measurable effect upon novelistic style. The other group pursued a project dealing with lexical change over the century, and they wrote a program called “the correlator” that was used to observe and measure semantic change.

After the class ended, two students, one from each group asked to continue their work as independent study; I agreed. Over the Christmas holiday, word spread to the other students from the seminar and by the New Year 13 of the original 14 in the seminar wanted to keep working. Instead of 13 independent studies, we formed an ad-hoc seminar group, and I found an empty office on the 4th floor where we began meeting, sometimes for several hours a day. We began calling this ugly, windowless room, the lab.

Several of the students in my fall class were also in a class with Franco Moretti and the crossover in terms of subject matter and methodology was fairly obvious. As the research deepened and became more nuanced, Franco began joining us for lab sessions and over the next few months other faculty and grad students were sucked into this evolving vortext. It was a very exciting time.

At some point, Franco and I (and perhaps a few of the students) began having conversations about formalizing this notion of a literary lab. I think at the time our motivation had more to do with the need to lobby for space and resources than anything else. As the projects grew and gained more steam, the room got smaller and smaller.

I mention all of this because I do not believe in the “if we build it they will come” notion of digital humanities labs. While it is true that they may come if we build them; it is also true, and I have seen this first hand, that they may come with absolutely no idea of what to do.

First and foremost a lab needs a real and specific research agenda. “Enabling Digital Humanities projects” is not a research agenda for a lab. Advancing or enabling digital humanities oriented research is an appropriate mission for a Center, such as our Center for Digital Humanities Research at Nebraska, but it is not the function of a lab, at least not in the limited literal sense that I imagine it. For me, a lab is not specifically an idea generator; a lab is a place in which ideas move from birth to maturation.

It would be incredible hyperbole to say that we formally articulated any of this in advance. Our lab was the opposite of premeditated. We did, however, have a loosely expressed set of core principles. We agreed that:

1. Our work would be narrowly focused on literary research of a quantitative nature.
2. All research would be collaborative, even when the outcome ends up having a single author.
3. All research would take the form of “experiments,” and we would be open to the possibilities of failure; indeed, we would see failure as new knowledge.
4. The lab would be open to students and faculty at all levels–and, on a more ad hoc basis, to students and faculty from other institutions.
5. In internal and external presentation and publication, we would favor the narrative genre of “lab reports” and attempt to show not only where we arrived, but how we got there.

I continue to believe that these were and are the right principles for a lab even while they conflict with much about the way Universities are organized.

In our lab we discovered that to focus, to really focus on the work, we had to resist and even reject some of the established standards of pedagogy, of academic hierarchy, and of publishing convention. We discovered that we needed to remove instructional barriers both internal and external in order to find and attract the right people and the right expertise. We did not do any of this in order to make a statement. We were not academic radicals bent on defying the establishment.

Nor should I leave you with the impression that we figured anything out. The lab remains an organic entity unified by what some might characterize as a monomaniacal focus on literary research. If there was any genius to what we did, it was in the decision to never compromise our focus, to do whatever was necessary to keep our focus on the literature.

Some Advice for DH Newbies

03 Thursday Jan 2013

Posted by Matthew Jockers in Commentary

≈ 1 Comment

In preparation for a panel session at DH Commons today, I was asked to consider the question: “What one step would you recommend a newcomer to DH take in order to join current conversations in the field?” and then speak for 3 – 4 minutes. Below is the 5 minute version of my answer. . .

With all the folks assembled here today, I figured we’d get some pretty good advice about what constitutes DH and how to get started, so I decided that I ought to say something different from what I’d expect others to say. I have two specific bits of advice, and I suppose that the second bit will be a little more controversial.

But let me foreground that by going back to 2011 when my colleague Glen Worthey and I organized the annual Digital Humanities conference at Stanford around a big tent, summer of love theme. We flung open the flaps on the Big Tent and said come on in . . . We believed, and we continue to believe, that there is a wide range of very good and very interesting work being done in “digital humanities.” We felt that we needed a big tent to enclose all that good work. But let’s face it, inside the big tent it’s a freakin’ three ring circus. Some folks like clowns and others want to see the jugglers. The DH conference is not like a conference on Victorian Literature. And that, of course, is the charm and the curse.

While it probably makes sense for a newcomer to poke around and gain some sense of the “disciplinary” history of the “field.” I think the best advice I can give a newcomer is to spend very little time thinking about what DH is and spend as much time as possible doing DH.

It doesn’t really matter if the world looks at your research and says of it: “Ahhhh, that’s some good Digital Humanities, man.” What matters, of course, is if the world looks at it and says, “Holy cow, I never thought of Jane Austen in those terms” or “Wow, this is really strong evidence that the development of Roman road networks was entirely dependent upon seasonal shifts.” The bottom line is that it is the work you do that is important, not how it gets defined.

So I suppose that is a bit of advice for newcomers, but let me answer the question more concretely and more controversially by speaking as someone who hangs out in one particular ring of the DH Big Tent.

If you understand what I have said thus far, then you know that it is impossible to speak for the Digital Humanities as a group, so, for some, what I am going to say is going to sound controversial. And if I hear that one of you newcomers ran out at the end of this session yelling “Jockers thinks I need to learn a programming language to be a digital humanist,” then I’m going to have to kick your butt right out of the big tent!

Learning a programming language, though, is precisely what I am going to recommend. I’m even going to go a bit further and suggest a specific language called R.

By recommending that you learn R, I am also advocating learning some statistics. R is primarily a language used for statistical computing, which is more or less the flavor of Digital Humanities that I practice. If you want to be able to read and understand the work that we do in this particular ring of the big tent you will need some understanding of statistics; if you want to be able to replicate and expand upon this kind of work, you are going to need to know a programming language, so I recommend learning some R and killing two birds with one stone.

And for those of you who don’t get turned on by p-values, for loops, and latent dirichlet allocation, I think learning a programing language is still in your best interests. Even if you never write a single line of code, knowing a programming language will allow you to talk to the natives, that is, you will be able to converse with the non-humanities programmers and web masters and DBAs and systems administrators, who we so often collaborate with as digital humanists. Whether or not you program yourself, you will need to translate your humanistic questions into terms that a non-specialist in the humanities will understand. You may never write poetry in Italian, but if you are going to travel in Rome, you should at least know how to ask for directions to the coliseum.

← Older posts
Newer posts →

♣ Contact


Matthew L. Jockers

ORCID iD iconhttps://orcid.org/0000-0001-5599-3706

Twitter: @mljockers
Amazon Author Profile
Goodreads Author Profile

♣ On Quantification:

". . . everything . . . in nature's vast workshop from the extinction of some remote sun to the blossoming of one of the countless flowers which beautify our public parks is subject to a law of numeration as yet unascertained.” (Joyce, Ulysses, 1922)

♣ Recent Comments

  • An Interview with Matthew Jockers | What Is A Media Lab? on Macroanalysis
  • What Is A Media Lab? on Text Analysis with R for Students of Literature
  • Matthew Jockers | Digital Arts & Humanities at Harvard University on Text Analysis with R for Students of Literature
  • Introduction to Digital Humanities | Intro to DH on 500 Themes
  • Early Christian Monasticism in the Digital Age | First foray into topic modeling on Text Analysis with R for Students of Literature

♣ Archives

♣ Blogroll

  • Ben Schmidt
  • Matthew Sag
  • Scott B. Weingart
  • Stéfan Sinclair
  • Stephen Ramsay
  • Ted Underwood

♣ Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Creative Commons License
This work by Matthew Jockers is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Proudly powered by WordPress Theme: Chateau by Ignacio Ricci.