• About
  • Articles
  • Books
    • The Bestseller Code
    • Text Analysis with R for Students of Literature
    • Macroanalysis
      • Confusion Matrices
      • Expanded Stopwords List
      • The LDA Buffet: A Topic Modeling Fable
      • 500 Themes
      • Color Versions of Figures 9.3 and 9.4
  • Courses
  • Lectures/Events
  • Papers
  • Workshops
    • University of Chicago (2019)
    • University College London (May 10, 2017)
    • Vanderbilt (April 5, 2017)
    • Wesleyan (March 3, 2017)
    • National Humanities Center: June 8 – 12
      • Day One Code
      • Day Two Code
      • Day Three Code
      • Day Four Code
      • Day Five Code
      • Year Two Instructions
      • Year Two XML
    • Harvard: April 3, 2015
    • Yale: December 5, 2014
    • DH 2014: Introduction to Text Analysis and Topic Modeling with R
    • University of Gothenburg
    • Michigan State University
    • University of Kansas
    • UW-Milwaukee, 2013
      • Workshop Code
    • 2013 DHWI
      • DHWI: R Code Day One
      • DHWI: R Code Day Two
      • DHWI: R Code Day Three
      • DHWI: R Code Day Four
      • DHWI: R Code Functions File
    • 2013 MLA/DH Commons
  • Noted

Matthew L. Jockers

Category Archives: Text-Mining

A Novel Method for Detecting Plot

05 Thursday Jun 2014

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on A Novel Method for Detecting Plot

While studying anthropology at the University of Chicago, Kurt Vonnegut proposed writing a master’s thesis on the shape of narratives. He argued that “the fundamental idea is that stories have shapes which can be drawn on graph paper, and that the shape of a given society’s stories is at least as interesting as the shape of its pots or spearheads.” The idea was rejected.

In 2011, Open Culture featured a video in which Vonnegut expanded on this idea and suggested that computers might someday be able to model the shape of stories, that is, the movement of the narratives, the plots. The video is about four minutes long; it’s worth watching.

About the same time that I discovered this video, I was working on a project in which I was applying the tools and techniques of sentiment analysis to works of fiction.[1] Initially I was interested in tracing the evolution of emotional content in novels over the course of the 19th century. By accident I discovered that the sentiment I was detecting and measuring in the fiction could be used as a highly accurate proxy for plot movement.

Joyce’s Portrait of the Artist as a Young Man is a story that I know fairly well. Once upon a time a moo cow came down along the road. . .and so on . . .

Here is the shape of Portrait of the Artist as a Young Man that my computer drew based on an analysis of the sentiment markers in the text:

poa1

If you are familiar with the plot, you’ll readily see that the computer’s version of the story is accurate. As it happens, I was teaching Portrait last fall, so I projected this image onto the white board and asked my students to annotate it. Here are a few of the high (and low) points that we identified.

poa2

Because the x-axis represents the progress of the narrative as a percentage, it is easy to move from the graph to the actual pages in the text, regardless of the edition one happens to be using. That’s precisely what we did in the class. We matched our human reading of the book with the points on the graph on a page-by-page basis.

Here is a graph from another Irish novel that you might know; this is Wilde’s Picture of Dorian Gray.

dorian1

If you remember the story, you’ll see how well this plot line models the movement of the story. Discovering the accuracy of these graphs was quite thrilling.

This next image shows Dan Brown’s blockbuster novel The Da Vinci Code. Notice how much more regular the fluctuations are. This is the profile of a page turner. Notice too how the more generalized blue trend line hovers above neutral in terms of its emotional valence. Dan Brown never lets the plot become too troubled or too much of a downer. He baits us and teases us with fluctuating emotion.

brown1

Now compare Da Vinci Code to one of my favorite contemporary novels, Cormac McCarthy’s Blood Meridian. Blood Meridian is a dark book and the more generalized blue trend line lingers in the realms of negative emotion throughout the text; it is a very different book from The Da Vinci Code.[2]

mccarthy1

I won’t get into the precise details of how I am measuring emotional valence in these books here.[3] It’s a bit too complicated for an already too long blog post. I will note, however, that the process involves two major components: a controlled vocabulary of positive and negative sentiment markers collected by Bing Liu of the University of Illinois at Chicago and a machine model that I trained to identify and score passages as positive or negative.

In a follow-up post, I’ll describe how I normalized the plot shapes in 40,000 novels in order to compare the shapes and discover what appear to be six archetypal plots!

NOTES:
[1] In the field natural language processing there is an area of research known as sentiment analysis or, sometimes, opinion mining. And when our colleagues engage in this kind of work, they very often focus their study on a highly stylized genre of non-fiction: the review, specifically movie reviews and product reviews. The idea behind this work is to develop computational methods for detecting what we, literary folk, might call mood, or tone, or sentiment, or perhaps even refer to as affect. The psychologists prefer the word valence, and valence seems most appropriate to this research of mine because the psychologists also like to measure degrees of positive and negative valence. I am not aware of anyone working in sentiment analysis who is specifically interested in modeling emotional valence in fiction. In fact, the great majority of work in this field is so far removed from what we care about in literary studies that I spent about six months simply wondering whether or not the methods developed by folks trying to gauge opinions in movie reviews could even be usefully employed in studies of literature.
[2] I gained access to some of these novels through a data transfer agreement made between the University of Nebraska and a private company that is no longer in business. See Unfolding the Novel.
[3] I’m working on a longer and more formal version of this research report for publication. The longer version will include all the details of the methodology. Stay Tuned:-)

Text Analysis with R . . . coming soon.

21 Monday Apr 2014

Posted by Matthew Jockers in R-Code, Text-Mining, Tips and Code

≈ Comments Off on Text Analysis with R . . . coming soon.

My new book, Text Analysis with R for Students of Literature is due from Springer sometime in May. I got the cover proofs this week (below). Looking good:-)

cover.tawr

Simple Point of View Detection

06 Sunday Apr 2014

Posted by Matthew Jockers in R-Code, Text-Mining

≈ Comments Off on Simple Point of View Detection

[Note 4/6/14 @ 2:24 CST: oops, had a small error in the code and corrected it: the second if statement should have been “< 1.5" which made me think of a still simpler way to code the function as edited.] [Note 4/6/14 @ 2:52 CST: After getting some feedback from Jonathan Goodwin about Ford's The Good Soldier, I added a slightly modified version of the function to the bottom of this post. The new function makes it easier to experiment by dialing in/out a threshold value for determining what the function labels as “first” vs. “third.”]

In my Macroanalysis class this term, my students are studying character and characterization. As you might expect, the manner in which we/they analyze character varies depending upon whether the story is being told in the first or third person. Since we are working with a corpus of about 3500 books, it was not practical (at least in the time span of a single class) to hand code each book for its point of view (POV). So instead of opening each text and skimming it to determine its POV, I whipped up a super simple “POV detector” function in R.*

Below you will find the function and then three examples showing how to call the function using links to the Project Gutenberg versions of Feodor Dostoevsky’s first person novel, Notes from the Underground, Stoker’s epistolary Dracula, and Joyce’s third person narrative Portrait of the Artist as a Young Man.**

We have not done anything close to a careful analysis of the results of these predictions, but we have checked a semi-random selection of 30 novels from the full corpus of 3500. At least for these 30, the predictions are spot on. If you take this function for a test drive and discover texts that don’t get correctly assigned, please let me know. This is a very “low-hanging fruit” approach and the key variable (called “first.third.ratio”) can easily be tuned. Of course, we might also consider a more sophisticated machine classification approach, but until we learn otherwise, this functions seems to be doing quite well. Please test it out and let us know what you discover…

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
predict.pov.from.plain<-function(path.to.file){
  first.person<-c("i", "me", "my", "myself", "we", "us")
  third.person<-c("he", "she", "his", "hers", "him", "her", "himself", "herself")
  text.lines <- scan(path.to.file, what="character", sep="\n")
  text.v <- paste(text.lines, collapse=" ")
  text.lower.v <- tolower(text.v)
  text.words.l <- strsplit(text.lower.v, "\\W")
  text.word.v <- unlist(text.words.l)
  not.blanks.v  <-  which(text.word.v!="")
  text.word.v <-  text.word.v[not.blanks.v]
  text.freqs.t <- table(text.word.v)/length(text.word.v)
  first.sum<-sum(text.freqs.t[first.person], na.rm = TRUE)
  third.sum<-sum(text.freqs.t[third.person], na.rm = TRUE)
  first.third.ratio<-third.sum/first.sum
  if(first.third.ratio >= 1.5){
    pov<-"third"
  } else {
    pov<-"first"
  }
  return(pov)
}
 
# Now Some Examples:
# Notes from Underground, First Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/600/pg600.txt")
 
# Dracula Epistolary, First Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/345/pg345.txt")
 
# Portrait of the Artist as a young man, Third Person POV
predict.pov.from.plain("http://www.gutenberg.org/cache/epub/4217/pg4217.txt")

Here is a slightly revised version of the function that allows you to set a different “threshold” when calling the function. Johnathan Goodwin reported on Twitter that Ford’s The Good Soldier was being reported as “third” person (which is wrong). Using this new version of the function, you can dial up or down the threshold until you find the a sweet spot for a particular text (such as The Good Soldier) and then use that threshold to test on other texts.

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
predict.pov.from.plain.w.thresh<-function(path.to.file, threshold=1.5){
  first.person<-c("i", "me", "my", "myself", "we", "us")
  third.person<-c("he", "she", "his", "hers", "him", "her", "himself", "herself")
  text.lines <- scan(path.to.file, what="character", sep="\n")
  text.v <- paste(text.lines, collapse=" ")
  text.lower.v <- tolower(text.v)
  text.words.l <- strsplit(text.lower.v, "\\W")
  text.word.v <- unlist(text.words.l)
  not.blanks.v  <-  which(text.word.v!="")
  text.word.v <-  text.word.v[not.blanks.v]
  text.freqs.t <- table(text.word.v)/length(text.word.v)
  first.sum<-sum(text.freqs.t[first.person], na.rm = TRUE)
  third.sum<-sum(text.freqs.t[third.person], na.rm = TRUE)
  first.third.ratio<-third.sum/first.sum
  if(first.third.ratio >= threshold){
    pov<-"third"
  } else {
    pov<-"first"
  }
  return(pov)
}
# The Good Soldier is first Person POV but if we set the threshold too low, it gets assigned third person. . . This version of the function allows you to tune the threshold for experimenting with different texts.
# using a threshold of 2.1, this book is assigned a proper POV
predict.pov.from.plain.w.thresh("http://www.gutenberg.org/cache/epub/4217/pg4217.txt", threshold=2.1)

*[Actually, I whipped up two functions: the one seen here and another one that takes Part of Speech (POS) tagged text as input. Both versions seem to work equally well but this one that takes plain text as input is easier to implement.]

**[Note that none of the Gutenberg boilerplate text is removed in this process. In the implementation I am using with my students, all metadata has already been removed.]

Experimenting with “gender” package in R

25 Tuesday Feb 2014

Posted by Matthew Jockers in R-Code, Text-Mining, Tips and Code

≈ 1 Comment

Yesterday afternoon, Lincoln Mullen and Cameron Blevins released a new R package that is designed to guess (infer) the gender of a name. In my class on literary characterization at the macroscale, students are working on a project that involves a computational study of character genders. . . needless to say, the ‘gender‘ package couldn’t have come at a better time. I’ve only begun to experiment with the package this morning, but so far it looks very promising.

It doesn’t do everything that we need, but it’s a great addition to our workflow. I’ve copied below some R code that uses the gender package in combination with some named entity recognition in order to try and extract character names and genders in a small sample of prose from Twain’s Tom Sawyer. I tried a few other text samples and discovered some significant challenges (e.g. Mrs. Dashwood), but these have more to do with last names and the difficult problems of accurate NER than anything to do with the gender package.

Anyhow, I’ve just begun to experiment, so no big conclusions here, just some starter code to get folks thinking. Hopefully others will take this idea and run with it!

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(gender)
library(openNLP)
require(NLP)
 
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
 
s <- as.String("Tom did play hookey, and he had a very good time. He got back home barely in season to help Jim, the small colored boy, saw next-day's wood and split the kindlings before supper—at least he was there in time to tell his adventures to Jim while Jim did three-fourths of the work. Tom's younger brother (or rather half-brother) Sid was already through with his part of the work (picking up chips), for he was a quiet boy, and had no adventurous, trouble-some ways. While Tom was eating his supper, and stealing sugar as opportunity offered, Aunt Polly asked him questions that were full of guile, and very deep—for she wanted to trap him into damaging revealments. Like many other simple-hearted souls, it was her pet vanity to believe she was endowed with a talent for dark and mysterious diplomacy, and she loved to contemplate her most transparent devices as marvels of low cunning.")
 
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
entity_annotator <- Maxent_Entity_Annotator()
named.ents<-s[entity_annotator(s, a2)]
named.ents.l <- strsplit(named.ents, "\\W")
named.ents.v <- unlist(named.ents.l)
not.blanks.v  <-  which(named.ents.v!="")
named.ents.v <-  named.ents.v[not.blanks.v]
gender(tolower(named.ents.v))

And here is the output:

1
2
3
4
5
6
7
   name gender proportion_male proportion_female
1   tom   male          0.9971            0.0029
2   jim   male          0.9968            0.0032
3   tom   male          0.9971            0.0029
4   tom   male          0.9971            0.0029
5  aunt   <NA>              NA                NA
6 polly female          0.0000            1.0000

Text Analysis with R for Students of Literature

03 Tuesday Sep 2013

Posted by Matthew Jockers in Text-Mining, Tips and Code

≈ Comments Off on Text Analysis with R for Students of Literature

[Update (9/3/13 8:15 CST): Contributors list now active at the main Text Analysis with R for Students of Literature Resource Page]

Below this post you will find a link where you can download a draft of Text Analysis with R for Students of Literature. The book is under review with Springer as part of a new series titled “Quantitative Methods in the Humanities and Social Sciences.”

Springer agreed to let me post the draft manuscript here (thank you, Springer), and my hope is that you will download the manuscript, take it for a test drive, and then send me your thoughts. I’m especially interested to hear about areas of confusion, places where you get lost, or where you feel your students might get lost. I’m also interested in hearing about the errors (hopefully not too many), and, naturally, I’ll be delighted to hear about anything you like.

I’m open to suggestions for new sections, but before you suggest that I include another chapter on “your favorite topic,” please read the Preface where I lay out the scope of the book. It’s a beginner’s book, and helping “literary folk” get started with R is my primary goal. This is not the place to get into debates or details about hyper parameter optimization or the relative merits of p-values.*

Please also read the Acknowledgements. It is there that I hint at the spirit and intent behind the book and behind this call for feedback. I did not learn R without help, and there is still a lot about R that I have to learn. I want to acknowledge both of these facts directly and specifically. Those who offer feedback will be added to a list of contributors to be included in the print and online editions of the final text. Feedback of a substantial nature will be acknowledged directly and specifically.

Book is now in production and draft has been removed.That’s it. Download Text Analysis with R for Students of Literature (1.3MB .pdf)

* Besides, that ground has been well-covered by Scott Weingart

“Secret” Recipe for Topic Modeling Themes

12 Friday Apr 2013

Posted by Matthew Jockers in Text-Mining, Tips and Code

≈ Comments Off on “Secret” Recipe for Topic Modeling Themes

The recently (yesterday) published issue of JDH is all about topic modeling. It’s a great issue, and it got me thinking about some of the lessons I have learned over seven or eight years of modeling literary corpora. One of the important things I have learned is that the quality of the final model (which is to say the coherence and usefulness of the topics) is largely dependent upon preprocessing. I know, I know: “that’s not much fun.”

Fun or no, it is the reality, and there’s no getting around it. One of the first things you discover when you begin modeling literary materials is that books have a lot of characters. And here I don’t mean the letters “A, B, C,” but actual literary characters as in “Ahab, Beowulf, and Copperfield.” These characters can cause no end of headaches in topic modeling. Let me explain. . .

As I write this blog post, I am running a smallish topic modeling job over a corpus of 50 novels that I have selected for use in a topic modeling workshop I am teaching next week in Milwaukee. Without any preprocessing I get topics that look like these two:

A topic of words from Moby Dick

A topic of words from Moby Dick

A topic of words from Dracula

A topic of words from Dracula

There is nothing wrong with these topics except that one is obviously a “Moby Dick” topic and the other a “Dracula” topic. A big part of the reason these topics formed in this way is because of the power of the character names (yes, “whale” is a character). The presence of the character names tends to bias the model and make it collect collocates that cluster around character names. Instead of getting a topic having to do with “seafaring” (a theme, by the way, that appears in both Moby Dick and Dracula) we get these broad novel-specific topics instead.

That is not what we want.

To deal with this character “problem,” I begin by expanding the usual topic modeling “stop list” from the 100 or so high frequency, closed class words (such as “the, of, a, and. . .”) to include about 5,600 common names, or “named entities.” I posted this “expanded stoplist” to my blog some months ago as ancillary material for my book; feel free to copy it for your own work. I built my expanded stop list through a combination of named entity recognition and the scraping of baby name web sites:-)

Using the exact same model parameters that produced the two topics above, but now with the expanded stop list, I get topics that are much less about individual novels and much more about themes that cross novels. Here are two examples.

A topic of seafaring words

A topic of seafaring words

A topic of words relating to Native Americans

A topic of words relating to Native Americans, but mostly from Last of the Mohicans?

The first topic cloud seems pretty good. In the previous run of the model, without the expanded stop list, there was no such topic. The second one; however, is still problematic, largely because my expanded stopwords list, even at 5,631 words, is still imperfect. “Heyward” is a character from Last of the Mohicans whose name is not in my stop list.

But in addition to this imperfection, I would argue that there are other problems as well, at least if our objective is to harvest topics of a thematic nature. Notice, for example, the word “continued” just to the left of “heyward” and then notice “demanded” near the bottom of the cloud. These words do not contribute very much at all to the thematic sense of the topic, so ideally they too should be stopped out.

As a next step in preprocessing, therefore, I employ Part-of-Speech tagging or “POS-Tagging” in order to identify and ultimately “stop out” all of the words that are not nouns! Since I can already hear my friend Ted Underwood screaming about “discourses,” let me justify this suggestion with a small but important caveat: I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation.

POS tagging is well documented, so I’m not going to foreground it here other than to say that it’s an imperfect method. It does make mistakes, but the best taggers (such as the Stanford Tagger that I usually use) have very (+97%) accuracy (see, for example Manning 2011).

After running a POS tagger, I have a simple little script that uses a simple little regular expression to change the following tagged sentences:

The/DT family/NN of/IN Dashwood/NNP had/VBD been/VBN long/RB settled/VBN in/IN Sussex./NNP Their/PRP$ estate/NN was/VBD large,/RB and/CC their/PRP$ residence/NN was/VBD at/IN Norland/NNP Park,/NNP in/IN the/DT centre/NN of/IN their/PRP$ property,/NN where,/, for/IN many/JJ generations,/NNS they/PRP had/VBD lived/VBN in/IN so/RB respectable/JJ a/DT manner,/JJ as/IN to/TO engage/VB the/DT general/JJ good/JJ opinion/NN of/IN their/PRP$ surrounding/VBG acquaintance./NN

into

family estate residence centre property generations opinion acquaintance

Just with this transformation to nouns alone, you can begin to see how a theme of “property” or “family estates” might eventually evolve from these words during the topic modeling process. But there is still one more preprocessing step before we can run the LDA. The next step (which can really be the first step) is text chunking or segmentation.

Topic models like to have lots of texts; or more precisely they like to have lots of bags of words. Topic models such as LDA do not take into account word order, they assume that each text or document is a bag of words. Novels are very big bags, and if we don’t chunk them up into smaller pieces we end up getting topics of a very general nature. By chunking each novel into smaller pieces, we allow the model to discover themes that occur only in specific places within novels and not just across entire novels. Consider the theme of death, for example. While there may be entire novels about death, more than likely death is going to pop up once or twice in every novel. In order for the topic model to detect or find a death topic, however, it needs to encounter bags of words that are largely about death. If the whole novel is a single bag of words, then death might not be prominent enough to rise to the level of “topicdom.”

I have found through lots and lots of experimentation that 500-1000 word chunks are pretty darn good when modeling novels. It might help to think in terms of pages: 500-1000 words is roughly 2-4 pages. The argument for this length goes something like this: a good death scene takes several pages to develop. . . etc.

Exactly what chunk length I choose is a matter of some witchcraft and alchemy; it is similar to the witchcraft and tarot involved in choosing the number of topics. I’ll not unpack either of those here, but you can read more in chapter 8 of my book (plug). Here the point is to simply say that some chunking needs to happen if you are working with big documents.

So here, finally, is my “secret” recipe in pseudo code:

1
2
3
4
5
6
7
8
9
10
11
for each novel as novel {
POS tag novel
split tagged novel into 1000 word chunks
for each chunk as chunk {
remove non-nouns from chunk
lowercase everything
remove stop list words from chunk
}
}
run LDA over chunks
analyze data

Of course, there is a lot more to it than this: you need to keep track of which chunks go with which novels and so on. But this is the general recipe.* Here are two topics derived from the same corpus of novels, now without character names and without non-nouns.

Art and Music

Art and Music

Crime and Justice

Crime and Justice

* The word “Secret” in my title is in quotes because there is nothing secret about the ingredients in this particular recipe. The idea of combining POS tagging, text chunking, and LDA is well established in various papers, including, for example, “TagLDA: Bringing document structure knowledge into topic models” (2006) and Reading Tea Leaves: How Humans Interpret Topic Models (2009).

Pronouns in 19th Century Fiction

22 Friday Feb 2013

Posted by Matthew Jockers in Text-Mining

≈ 1 Comment

Some folks I follow on Twitter (@scott_bot, @benmschmidt, @rayncordell, @foxyfolklorist, and others) were engaged in a conversation this week about the frequency of gendered pronouns in a corpus of 233 fairy tales from @foxyfolklorist’s dissertation. For a bit of literary contextualization, I tweeted a bar graph showing the frequency of 13 pronouns in a corpus of ~3,500 19th century novels. The bar graph (seen again here) breaks down pronoun usage by author gender (M, F, and U).

Pronoun Use by Gender in 19th C. Fiction

It is natural to wonder, as David Mimno (@dmimno) did this morning, if there is any significance to the gender results: is gender really correlated to these observed means or are the observed means just an artifact of messy data. One way to explore the extent to which these observed means really are an entailment of gender is to ask what the means would look like if gender were not a factor. In other words, what would happen if all the data about author gender were shuffled and the means then recalculated?*

If we do this shuffling and recalculating a whole bunch of times, say 100 times, we can then plot all the fake “genderless” permutations along side the actual observed means and thereby see whether the observed means are outside or inside what we would expect if gender were not a factor influencing pronoun use.**

Below are the plots for the 13 pronouns from my original bar graph (above). What you’ll see below is that for certain pronouns, such as “him,” “I,” “me,” “my” and “your”, the observed (“real”) means are within the range of “expected” values if gender were not a consideration. For other pronouns, however, such as “he,” “her,” “she” and “we,” the observed values are outside the values in the randomized “fake” data generated by taking gender out of the equation.

Another fascinating element of these graphs is found in the third “U” column. These are authors of unknown gender. It is hard not too look at these observed values and wonder about the most likely genders of those anonymous writers. . .

he_pronoun0

she_pronoun1

him_pronoun2

her_pronoun3

i_pronoun4

me_pronoun5

my_pronoun6

you_pronoun10

your_pronoun11

we_pronoun7

it_pronoun12

mrs_pronoun9

mr_pronoun8

* [As it happens, this is precisely the approach that David Mimno suggested we take in some other work (under review) in which we assess the significance of topic use (rather than pronoun use) by male and female authors.]

** [Naturally, it could be that the determining factor here is not really gender at all. It could be that “we” (readers, editors, publishers, etc) have selected for books authored by men that express one set of linguistic qualities and books by women that express another set. In other words, these graphs don’t prove that women and men necessarily use pronouns differently, only that they do so (or don’t depending on the pronoun in question) in this particular corpus of 19th century fiction.]

Unfolding the Novel

20 Wednesday Feb 2013

Posted by Matthew Jockers in Text-Mining

≈ 3 Comments

I’m excited to announce a new research project dubbed “Unfolding the Novel” (which is a play on both “paper” and “protein” folding). In collaboration with colleagues from the Stanford Literary Lab and Arizona State University and in partnership with researchers of the Book Genome project of BookLamp.com we have begun work that traces stylistic and thematic change across 300 years of fiction, from 1700-2000! Today UNL posted a news release announcing the partnership and some of our goals.

The primary goal of the project is to map major stylistic and thematic trends over 300 years of creative literature. To facilitate this work, BookLamp is providing access to a large store of metadata pertaining to mostly 20th and 21st century works of fiction. This data will be combined with similar data we have already compiled from the 19th century and new data we are curating now from the 18th century. The research team will not access the actual books but will explore at the macroscale in ways that are similar to what one can do with the data provided to researchers at the Google Ngrams project. A major difference, however, is that the data in the “Unfolding” project is highly curated, limited to fiction in English, and enriched with additional metadata including information about both gender and genre distribution.

Our initial data set consists of token frequency information that has been aggregated across one or more global metadata facets including but not limited to publication year, author gender, and book genre. Such data includes, for example, a table containing the year-­to-­year mean relative frequencies of the most common words in the corpus (e.g the relative frequencies of the words “the, a, an, of, and” etc).

I’ll be reporting on the project here as things progress, but for now, it’s back to the drudgery of the text mines. . . 😉

Computing and Visualizing the 19th-Century Literary Genome

20 Friday Jul 2012

Posted by Matthew Jockers in Text-Mining

≈ Comments Off on Computing and Visualizing the 19th-Century Literary Genome

I was unable to attend the DH 2012 meeting in Hamburg, but I recorded my paper as a screen cast, and my ever faithful colleague Glen Worthey kindly delivered it on my behalf. The full presentation can be viewed here as a QuickTime movie.

Picture of my (empty) seat in Hamburg (I assume that is a glass of Riesling). Image Credit: Jan Rybicki.

Macroanalysis

28 Monday May 2012

Posted by Matthew Jockers in Text-Mining

≈ 1 Comment

In preparation for the publication of my book (Macroanalysis: Digital Methods and Literary History, UIUC Press, 2013), I’ve begun posting some graphs and other data to my (new) website. To get the ball rolling, I have created an interactive “theme viewer” where visitors will find a drop down menu of the 500 themes I harvested from a corpus of 3,346 19-century British, Irish and American novels using “topic modeling” and a series of pre-processing routines that I detail in the book. Each theme is accompanied by a word cloud showing the relative importance of each term to the topic, and each cloud is followed by four graphs showing the distribution of the topic/theme over time and across author genders and author nationalities. Here is a sample of a theme I have labeled “FACTORY AND WORKHOUSE LABOR.” You can click on the thumbnails below for larger images, but the real fun is over at the theme viewer.




← Older posts
Newer posts →

♣ Contact

head shot
Matthew L. Jockers

Professor of English and Data Analytics
Washington State University
PO Box 642630 | Pullman, WA 99164-2630
509-335-5540 | matthew.jockers@wsu.edu

ORCID iD iconhttps://orcid.org/0000-0001-5599-3706

 

Twitter: @mljockers
Amazon Author Profile
Goodreads Author Profile

Also See: Authors A.I.

♣ On Quantification:

". . . everything . . . in nature's vast workshop from the extinction of some remote sun to the blossoming of one of the countless flowers which beautify our public parks is subject to a law of numeration as yet unascertained.” (Joyce, Ulysses, 1922)

♣ Recent Comments

  • An Interview with Matthew Jockers | What Is A Media Lab? on Macroanalysis
  • What Is A Media Lab? on Text Analysis with R for Students of Literature
  • Matthew Jockers | Digital Arts & Humanities at Harvard University on Text Analysis with R for Students of Literature
  • Introduction to Digital Humanities | Intro to DH on 500 Themes
  • Early Christian Monasticism in the Digital Age | First foray into topic modeling on Text Analysis with R for Students of Literature

♣ Archives

♣ Blogroll

  • Ben Schmidt
  • Matthew Sag
  • Scott B. Weingart
  • Stéfan Sinclair
  • Stephen Ramsay
  • Ted Underwood

♣ Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Creative Commons License
This work by Matthew Jockers is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Proudly powered by WordPress Theme: Chateau by Ignacio Ricci.