The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

29 Thursday Sep 2011

Posted by Matthew Jockers in Text-Mining

≈ Comments Off on The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

For my forthcoming book, which includes a chapter on the uses of topic modeling in literary studies, I wrote the following vignette. It is my imperfect attempt at making the mathematical magic of LDA palatable to the average humanist. Imperfect, but hopefully more fun than plate notation. . .

. . . imagine a quaint town, somewhere in New England perhaps. The town is a writer’s retreat, a place they come in the summer months to seek inspiration. Melville is there, Hemingway, Joyce, and Jane Austen just fresh from across the pond. In this mythical town there is spot popular among the inhabitants; it is a little place called the “LDA Buffet.” Sooner or later all the writers go there to find themes for their novels. . .

One afternoon Herman Melville bumps into Jane Austen at the bocce ball court, and they get to talking.

“You know,” says Austen, “I have not written a thing in weeks.”

“Arrrrgh,” Melville replies, “me neither.”

So hand in hand they stroll down Gibbs Lane to the LDA Buffet. Now, down at the LDA Buffet no one gets fat. The buffet only serves light (leit?) motifs, themes, topics, and tropes (seasonal). Melville hands a plate to Austen, grabs another for himself, and they begin walking down the buffet line. Austen is finicky; she spoons a dainty helping of words out of the bucket marked “dancing.” A slightly larger spoonful of words, she takes from the “gossip” bucket and then a good ladle’s worth of “courtship.”

Melville makes a bee line for the “whaling” trough, and after piling on an Ahab-sized handful of whaling words, he takes a smaller spoonful of “seafaring” and then just a smidgen of “cetological jargon.”

The two companions find a table where they sit and begin putting all the words from their plates into sentences, paragraphs, and chapters.

At one point, Austen interrupts this business: “Oh Herman, you must try a bit of this courtship.”

He takes a couple of words but is not really fond of the topic. Then Austen, to her credit, asks permission before reaching across the table and sticking her fork in Melville’s pile of seafaring words, “just a taste,” she says. This work goes on for a little while; they order a few drinks and after a few hours, voila! Moby Dick and Persuasion are written . . .

[Now, dear reader, our story thus far provides an approximation of the first assumption made in LDA. We assume that documents are constructed out of some finite set of available topics. It is in the next part that things become a little complicated, but fear not, for you shall sample themes both grand and beautiful.]

. . . Filled with a sense of deep satisfaction, the two begin walking back to the lodging house. Along the way, they bump into a blurry-eyed Hemingway, who is just then stumbling out of the Rising Sun Saloon.

Having taken on a bit too much cargo, Hemingway stops on the sidewalk in front of the two literati. Holding out a shaky pointer finger, and then feigning an English accent, Hemingway says: “Stand and Deliver!”

To this, Austen replies, “Oh come now, Mr. Hemingway, must we do this every season?”

More gentlemanly then, Hemingway replies, “My dear Jane, isn’t it pretty to think so. Now if you could please be so kind as to tell me what’s in the offing down at the LDA Buffet.”

Austen turns to Melville and the two writers frown at each other. Hemingway was recently banned from the LDA Buffet. Then Austen turns toward Hemingway and holds up six fingers, the sixth in front of her now pursed lips.

“Six topics!” Hemingway says with surprise, “but what are today’s themes?”

“Now wouldn’t you like to know that you old sot.” Says Melville.

The thousand injuries of Melville, Hemingway had borne as best he could, but when Melville ventured upon insult he vowed revenge. Grabbing their recently completed manuscripts, Hemingway turned and ran toward the South. Just before disappearing down an alleyway, he calls back to the dumbfounded writers: “All my life I’ve looked at words as though I were seeing them for the first time. . . tonight I will do so again! . . . ”

[Hemingway has thus overcome the first challenge of topic modeling. He has a corpus and a set number of topics to extract from it. In reality determining the number of topics to extract from a corpus is a bit trickier. If only we could ask the authors, as Hemingway has done here, things would be so much easier.]

. . . Armed with the manuscripts and the knowledge that there were six topics on the buffet, Hemingway goes to work.

After making backup copies of the manuscripts, he then pours all the words from the originals into a giant Italian-leather attache. He shakes the bag vigorously and then begins dividing its contents into six smaller ceramic bowls, one for each topic. When each of the six bowls is full, Hemingway gets a first glimpse of the topics that the authors might have found at the LDA Buffet. Regrettably, these topics are not very good at all; in fact, they are terrible, a jumble of random unrelated words . . .

[And now for the magic that is Gibbs Sampling.]

. . . Hemingway knows that the two manuscripts were written based on some mixture of topics available at the LDA Buffet. So to improve on this random assignment of words to topic bowls, he goes through the copied manuscripts that he kept as back ups. One at a time, he picks a manuscript and pulls out a word. He examines the word in the context of the other words that are distributed throughout each of the six bowls and in the context of the manuscript from which it was taken. The first word he selects is “heaven,” and at this word he pauses, and asks himself two questions:

“How much of ‘Topic A,’ as it is presently represented in bowl A, is present in the current document?”
“Which topic, of all of the topics, has the most ‘heaven’ in it?” . . .

[Here again dear reader, you must take with me a small leap of faith and engage in a bit of further make believe. There are some occult statistics here accessible only to the initiated. Nevertheless, the assumptions of Hemingway and of the topic model are not so far-fetched or hard to understand. A writer goes to his or her imaginary buffet of themes and pulls them out in different proportions. The writer then blends these themes together into a work of art. That we might now be able to discover the original themes by reading the book is not at all amazing. In fact we do it all the time–every time we say that such and such a book is about “whaling” or “courtship.” The manner in which the computer (or dear Hemingway) does this is perhaps less elegant and involves a good degree of mathematical magic. Like all magic tricks, however, the explanation for the surprise at the end is actually quite simple: in this case our magician simply repeats the process 10 billion times! NOTE: The real magician behind this LDA story is David Mimno. I sent David a draft, and along with other constructive feedback, he supplied this beautiful line about computational magic.]

. . . As Hemingway examines each word in its turn, he decides based on the calculated probabilities whether that word would be more appropriately moved into one of the other topic bowls. So, if he were examining the word “whale” at a particular moment, he would assume that all of the words in the six bowls except for “whale” were correctly distributed. He’d now consider the words in each of those bowls and in the original manuscripts, and he would choose to move a certain number of occurrences of “whale” to one bowl or another.

Fortunately, Hemingway has by now bumped into James Joyce who arrives bearing a cup of coffee on which a spoon and napkin lay crossed. Joyce, no stranger to bags-of-words, asks with compassion: “Is this going to be a long night.”

“Yes,” Hemingway said, “yes it will, yes.”

Hemingway must now run through this whole process over and over again many times. Ultimately, his topic bowls reach a steady state where words are no longer needing to be being reassigned to other bowls; the words have found their proper context.

After pausing for a well-deserved smoke, Hemingway dumps out the contents of the first bowl and finds that it contains the following words:

“whale sea men ship whales penfon air side life bounty night oil natives shark seas beard sailors hands harpoon mast top feet arms teeth length voyage eye heart leviathan islanders flask soul ships fishery sailor sharks company. . . “

He peers into another bowl that looks more like this:

“marriage happiness daughter union fortune heart wife consent affection wishes life attachment lover family promise choice proposal hopes duty alliance affections feelings engagement conduct sacrifice passion parents bride misery reason fate letter mind resolution rank suit event object time wealth ceremony opposition age refusal result determination proposals. . .”

After consulting the contents of each bowl, Hemingway immediately knows what topics were on the menu at the LDA Buffet. And, not only this, Hemingway knows exactly what Melville and Austen selected from the Buffet and in what quantities. He discovers that Moby Dick is composed of 40% whaling, 18% seafaring and 2% gossip (from that little taste he got from Jane) and so on . . .

[Thus ends the fable.]

For the rest of the (LDA) story, see David Mimno’s Topic Modeling Bibliography

Aberrant Adjectives in 19th Century Novels

18 Sunday Sep 2011

Posted by Matthew Jockers in Text-Mining

≈ Comments Off on Aberrant Adjectives in 19th Century Novels

I created the visualization below using Many Eyes and a data set derived from part-of-speech tagged novels from 19th century Britain. Found here are the 100 most “aberrant adjectives.” Aberrant here is determined by selecting those words that have the greatest amount of usage deviation (measured by relative frequency) over a 13 decade time period. To qualify a word must also appear in every decade.

Unigrams, and bigrams, and trigrams, oh my

22 Wednesday Dec 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ 1 Comment

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).

And here’s another chart comparing the usage of “asdf” to “qwer.”

And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.
And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data
In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).

SEASR Grant

02 Tuesday Nov 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on SEASR Grant

This month a group of researchers at Stanford, University of Illinois, University of Maryland, and George Mason were awarded a $790,000 grant from the Mellon Foundation to advance the prior work of the SEASR project. I’ll be serving as the overall Project Director and as one of the researchers in the Stanford component of the grant. In this phase of the SEASR project, we will focus on leveraging the existing SEASR infrastructure in support of four “use cases.” But “use case” hardly describes the research intensive nature of the proposed work, nor does it capture the strongly humanistic bias of the work proposed. Each partner has committed to a specific research project and each has the expressed goal of advancing humanities research and publishing their results. I’d like to emphasize this point about advancing humanities research.

This grant represents an important step beyond the tool building, QA and UI testing stages of software development. All too often, it seems, our digital humanities projects devote a great deal of time, money, and labor to infrastructure and prototyping and then all too frequently the results languish in the great sea of hammers without a nail. Sure, a few journeymen carpenters stick these tools in their belts and hammer away, but all too often it seems that more effort goes into building the tools and then the resources sit around gathering dust while humanities research marches on in the time-tested modes with which we are most familiar.

Of course, I don’t mean this to be a criticism of the tool builders or the tools built. The TAPOR project, for example, offers many useful text analysis widgets, and I frequency send my colleagues and students there for quick and dirty text-analysis. And just last month I had occasion to use and cite Stefan Sinclair’s Voyeur application. I was thrilled to have Voyeur at my finger tips; it provided a quick and easy way to do exactly what I wanted.

But often, the analytic tasks involved in our projects are multifaceted and cannot be addressed by any one tool. Instead, these projects involve “flows” in which our “humanistic” data travels though a series of analytic “filters” and comes out on the other end in some altered form. The TAPOR project attempts to be a virtual text analysis “workbench” in which the craftsman can slide a project around the bench from one tool to the next. This model works well for smallish projects but is not robust enough for large scale projects and, despite some significant interface improvements over the years, remains, for me at least, a bit clunky. I find it great for quick tasks with one or two texts, but inefficient for processing multiple texts or multiple processes. Part of the TAPOR mission was to develop a suite of tools that could be used by the average, ordinary humanist: which is to say, the humanist without any real technical chops. It succeeds on that front to be sure.

SEASR offers an alternative approach and what it provides in terms of processing power and computational elegance it gives up in terms of ease of use and transparency. The SEASR “interface” is one that involves constructing modular “workflows” in which each module corresponds to some computational task. These modules are linked together such that one process feeds into the next and the business of “sliding” a project around from one tool to another on the virtual workbench is taken over by the workflow manager.

In this grant we have specifically deemphasized UI development in favor of output, in favor of “results” in the humanities sense of the word. As we write in the proposal, “The main emphasis of the project will be on developing, coordinating, and investigating the research questions posed by the participating humanities scholars.” The scholars in this project include myself and Franco Moretti at Stanford, Dan Cohen at GMU, Tanya Clement at University of Maryland, Ted Underwood and John Unsworth both of UIUC. On the technical end, we have Michael Welge and Loretta Auvil of the Automated Learning Group, of the National Center for Supercomputing Applications.

As the project gets rolling, I will have more to post about the specific research questions we are each addressing and the ongoing results of our work. . .

Panning for Memes

30 Sunday May 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on Panning for Memes

Over in the English Department Literature Lab, we have been experimenting with Topic Modeling as a means of discovering latent themes (aka topics) in a corpus of 19th century novels. Topic Modeling is an unsupervised machine learning process that employs Latent Dirichlet allocation. “It posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.”

We’ve been experimenting using the Java-Based MAchine Learning for LanguagE Toolkit (Mallet) from UMASS Amherst and a corpus of British and American novels from the 19th century. In one experiment we ran the topic modeler over just the British corpus, in another over just the American corpus. But when we combined the two collections and ran the model over the whole corpus, we discovered that certain topics showed up in only one or the other corpus. For example, one solely American topic was composed of words related to slavery and words written in southern dialect. And there was a strictly British topic clearly indicative of the royalty and aristocracy: words such as “lord,” “king”, “duke,” “sir”, “lady.” This was an interesting result and not simply because it provides a quantitative way of distinguishing topics or themes that are distinct to one nation or another, but also because the topics themselves could be read and interpreted in context.

More interesting for me, however, were two topics that appeared in both corpora. The first, which appeared more in the British corpus was related to “soldiering.” A second topic, which was more common in the American corpus, has to do with Indian wars. The “soldiering” topic was composed of the following words:

“men,” “general,” “captain,” “colonel,” “army,” “horse,” “sir,” “enemy,” “soldier,” “battle,” “day,” “war,” “officer,” “great,” “country,” “house,” “time,” “head,” “left,” “road,” “british,” “soldiers,” “washington,” “night,” “fire,” “father,” “officers,” “heard,” “moment.”

The Indians topic included:

“indian,” “men,” “indians,” “great,” “time,” “chief,” “river,” “party,” “red,” “white,” “place,” “savages,” “woods,” “day,” “side,” “fire,” “war,” “savage,” “water,” “canoe,” “rifle,” “people,” “warriors,” “returned,” “feet,” “friends,” “tree,” “night,” “distance.”

What was most fascinating, however, was that when the soldiering topic was found in the American corpus it usually had to do with Indians, and when the Indian topic appeared in the British corpus it was almost completely in the context of the Irish! As an Irish-Studies scholar, who wrote a theses on the role of the American West in Irish and Irish-American literature, this was an incredibly rich discovery. The literature of the Irish and the Irish Diaspora is filled with comparisons between the Irish situation vis-à-vis the British and the Native American situation vis-à-vis what one Irish American author described as the “Tide of Empire.”

Reader’s wishing to follow this line of comparison in some more contemporary works might want to have a look at Joyce’s short story “An Encounter,” Flann O’Brien’s book At Swim Two Birds, Paul Muldoon’s Madoc and Patrick McCabe’sThe Butcher Boy.

Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

19 Friday Mar 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Social Networking for digital humanities nerds? Which DH bloggers are you most compatible with? Let’s get the right nerds with the right nerds–match making made in digital humanities heaven.

After seeing Stefan Sinclair’s Voyeuristic analysis of the Day of DH Blog posts, I wrote and asked him how to get access to the “corpus” of posts. He hooked me up, and I pre-processed the data with a few php scripts, then ran an LDA topic modeling process and then some more post processing with R in order to see the most important themes of the day and also to cluster the 117 bloggers based on their thematic similarity.

So, here’s the what and then the how. As for the why? Why not?

What:

117 Day of DH Bloggers

10 Unsupervised Topics (10 is arbitrary–I could have picked 100). These topics are generated by an analysis of the words and word sequences in the individual blogger’s sites. The purpose is to harvest out the most prominent “themes” or topics. These themes are presented in series of word lists. it is up to the researcher to then “label” the word clusters. I have labeled a few of them (in [brackets] at the beginning of the word lists below–you might use another label–this is the subjective part). Here they are:

[human interaction in DH] work today people time working things email year week days bit good meeting tomorrow

day thing mail dh de image based fact called things change ago encoding house

[Academic Writing–including Grants] day time dh start post blog proposal google write great posts lunch nice articles

[Digital publishing and archives] http talk future collection making online version publishing field morning life traditional daily large

conference university blog morning read internet access couple computers archive involved including great written

[DH Teaching] students dh teaching humanities class technology scholars university lab group library support scholarship student

[DH Projects] digital project humanities work projects room meeting collections office building task database spent st

data project xml working projects web interesting user set spend system ways couple time

digital day humanities media writing post computing twitter english humanist real phd web rest

[reading and text-analysis] book text tools software books today reading literary texts coffee edition search tool textual

Unfortunately, the Day of DH corpus isn’t truly big enough to get the sort of crystal clear topics that I have harvested from much larger collections, but still, the topics above, seen in aggregate, do give us a sense of what’s “hot” to talk about in the field.

But let’s get to the sexy part. . .

In addition to harvesting out the prominent topics, the modeling tool outputs data indicating how much (what proportion) of each blog is about each topic. The resulting matrix is of dimension 117×10 (117 blogs and 10 topics). The data in the cells are percentages for each topic in each author’s blog. The values in each row add up to 100%. With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes. Using this data, I then output a matrix showing which author’s have the most in common; thus, I do a little subtle match-making in advance of our digital rendezvous in London–birds of a feather blog together?

Here are the groups:

Group1

aimeemorrison
ariefwidodo
barbarabordalejo
caraleitch
carlosmartinez
carlwhithaus
clairewarwick
craigharkema
ellimylonas
geoffreyrockwell
glenworthey
guydaarmstrong
henrietterouedcunliffe
ianjohnson
janrybicki
jenterysayers
jonbath
juliaflanders
juliannenyhan
justinerichardson
kai-christianbruhn
kathleenfitzpatrick
keithlawson
lauramandell
lauraweakly
malterehbein
matthewjockers
meganmeredith-lobay
melissaterras
milenaradzikowska
miranhladnik
patricksahle
paulspence
peterrobinson
pouyllau
rafaelalvarado
raysiemens
reneaudet
rogerosborne
rudymcdaniel
stanruecker
stephanieschlitz
susangreenberg
victoriasmith
vikazafrin
williamturkel

Group2

alejandrogiacometti
annacaprarelli
danasolomon
ernestopriego
karensmith
leedurbin
matthewcarlos
paolosordi
sarasteger
stephanethibault
yinliu

Group3

alialbarran
amandagailey
cyrilbriquet
federicomeschini
ntlab
stefansinclair
torstenschassan

Group4

aligrotkowski
ashtonnichols
calenhenry
devonfitzgerald
enricasalvatori
ericforcier
garrywong
jameschartrand
joelyuvienco
johnnewman
peterorganisciak
shannonlucky
silviarussell
simonmahony
sophiahoosein
stevenhayes
taraandrews
violalasmana
willardmccarty

Group5

alunedwards
hopegreenberg
lewisulman

Group6

amandavisconti
jamessmith
martinholmes
sperberg-mcqueen
waynegraham

Group7

bethanynowviskie
josephgilbert
katherineharris
kellyjohnston
kirstenuszkalo
margaretgraham
matthewgold
paulyoungman

Group8

charlestravis
craigbellamy
franzfischer
jeremyboggs
johnwall
kathrynbarre
shawnday
teresadobson

Group9

jasonboyd
jolanda-pieta
joriszundert
michaelmaguire
thomascrombez
williamallen

Group10

louburnard
nevenjovanovic
sharongoetz
stephenramsay

Twitterers @sramsay and @mattwilkens were poking around here today and wondered what the topics would look like if there were only five topics and five clusters instead of 10 and 10. Here are the topics:

data work time text working tools people thing system xml mail software things texts
day time morning lot work bit find web class teaching student days dh real
digital humanities day tomorrow book twitter university blog computing reading books writing tei emails
day dh today time post things write start online writing working computer year hours
project digital work projects students meeting today people humanities dh scholars library year lab

And here are the Blogger-Mates clusters when I set n=5:

Group1

aimeemorrison
alejandrogiacometti
alialbarran
amandagailey
annacaprarelli
ashtonnichols
barbarabordalejo
carlosmartinez
carlwhithaus
clairewarwick
craigbellamy
craigharkema
danasolomon
devonfitzgerald
enricasalvatori
ernestopriego
garrywong
glenworthey
guydaarmstrong
henrietterouedcunliffe
ianjohnson
jameschartrand
janrybicki
jenterysayers
joelyuvienco
johnnewman
jonbath
juliannenyhan
justinerichardson
karensmith
kathleenfitzpatrick
keithlawson
leedurbin
lewisulman
malterehbein
matthewgold
matthewjockers
meganmeredith-lobay
melissaterras
michaelmaguire
miranhladnik
nevenjovanovic
patricksahle
peterrobinson
raysiemens
reneaudet
rogerosborne
shannonlucky
silviarussell
simonmahony
sophiahoosein
stefansinclair
stephanieschlitz
susangreenberg
taraandrews
thomascrombez
torstenschassan
vikazafrin
violalasmana
willardmccarty
williamallen
williamturkel
yinliu

Group2

aligrotkowski
ariefwidodo
calenhenry
caraleitch
charlestravis
ericforcier
geoffreyrockwell
jolanda-pieta
juliaflanders
lauraweakly
margaretgraham
matthewcarlos
milenaradzikowska
nt2lab
paolosordi
peterorganisciak
rudymcdaniel
sarasteger
sharongoetz
stanruecker
stevenhayes
victoriasmith

Group3

alunedwards
hopegreenberg
katherineharris
stephanethibault
teresadobson

Group4

amandavisconti
cyrilbriquet
federicomeschini
jamessmith
joriszundert
martinholmes
rafaelalvarado
sperberg-mcqueen
stephenramsay
waynegraham

Group5

bethanynowviskie
ellimylonas
franzfischer
jasonboyd
jeremyboggs
johnwall
josephgilbert
kai-christianbruhn
kathrynbarre
kellyjohnston
kirstenuszkalo
lauramandell
louburnard
paulspence
paulyoungman
pouyllau
shawnday

Analyze This (Page)

12 Friday Mar 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on Analyze This (Page)

“TAToo” is a fun Flash widget developed by Peter Organisciak at the University of Alberta. Peter works under the supervision of Digital Humanists Par Excellence and TAPoR Gurus Geoffrey Rockwell and Stan Ruecker. The widget (just some embed-able code) does “layman’s” text analysis on the web pages in which its code is embedded. I’ve added the code to the right sidebar of my blog to take it for a test drive.

The tool offers several “views” of your text. The default view is a word cloud in which words with greater frequency are both larger and more bold. Looking at the word cloud can give you a pretty quick sense of the page’s key terms and concepts.

By clicking on the “Tool:” bar at the top of the widget, you can select other options. The “List Words” view provides a term frequency list. You can then click on any word in the list to see its collocates. Alternatively, there are both a Collocate view and a Concordance view that allow users to enter a specific word and get information about the company that word keeps.

Kudos to Peter and the rest of the TAPoR Tools team for continuing to pursue the fine art of tool making.

65,000 Texts to Mine?

10 Wednesday Feb 2010

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on 65,000 Texts to Mine?

A story in the Feb. 7th issue of the Telegraph reports that the British Library is going to make 65,000 first edition texts available for public download via Amazon’s Kindle. This news is almost as exciting as Google’s decision some years ago to partner with a consortium of big libraries in order to digitize all their books. What makes this project from the British Library particularly exciting is that the texts being offered are all works of 19th century fiction.

Unlike the Google project that is digitizing everything, this offering from the BL is already presorted to include just the kind of content that literary researchers can really use. With Google, I assume, one is going to have to figure out how to sort the legal books from the cook books, the memoirs from the fiction. Here, however, the BL has already done a big part of the work.

It will be interesting to see how this material gets offered and what sort of metadata is included with the individual files. For those of us who are interested in corpus-mining and macroanalysis (as opposed to just reading a single book at a time) the metadata is crucial. If, for example, we have the publication date of each text in an easily extractable format (e.g. TEI XML) we could explore all kinds of chronological investigations.

In prior research, working with a corpus of just 250 19th century British novels, I explored the “theme” of childhood by quantifying the relative frequency of a “cluster” or “semantic field” of words suggestive of “childhood”. In that work, I discovered a proportionally higher incidence of the theme during the Victorian period, a finding that tends to confirm the idea that childhood was an “invention” of the Victorians. But, then again, a corpus of 250 novels doesn’t even scratch the surface.

I’m not sure just what’s included in the British Library’s 65,000 texts. I assume these are not just British texts, but American, German, etc. Franco Moretti has estimated that there were 8,000 to 10,000 novels published in the Great Britain in the 19th century (20-40,000 works of prose fiction). Surely a good many of these are part of the BL’s 65,000. Which brings us back to the metadata question. Will it be possible to generate a list of which texts in the 65,000 are British-authored and British-published *novels*? If the answer is yes, then the game is on.

Get the texts, convert from mobi to pdf, html, or other text format using any number of open source apps and then poof! You’ve got a COUS–Corpus of Unusual Size! Of course, it’d be a lot easier if the BL would make the texts available (for researchers at least) through a channel that doesn’t involve Amazon or one of the eBook formats. I’m investigating that path now and will report on any progress.

Machine-Classifying Novels and Plays by Genre

13 Friday Feb 2009

Posted by Matthew Jockers in Commentary, Text-Mining

≈ Comments Off on Machine-Classifying Novels and Plays by Genre

In the post that follows here, I describe some recent experiments that I (and others) have conducted. The goal of these experiments was to accurately machine-classify novels and plays (Shakespeare’s) by genre. One of the most interesting results ends up having more to do with feature extraction than classification algorithm

Background

Several weeks ago, Mike Witmore visited the Beyond Search workshop that I organize here at Stanford. In prior work, Witmore and some colleagues utilized a program called Docuscope (Developed at Carnegie Mellon) to distinguish between and classify (statistically) Shakespeare’s histories and comedies.

“Equipped with a specialized dictionary, Docuscope is able to divide texts into strings of words that are then sorted into one of eighteen word categories, such as “Inner Thinking” and “Past Events.” The program turns differentiating amongst genres into a statistical task by testing the frequency of occurence of words in each of the categories for each individual genre and recognizing where significant differences occur.”

Docuscope was designed as a tool for analyzing student writing, but Witmore (et. al.) discovered that it could also be employed as a specialized sort of feature extraction tool.

To test the efficacy of Docuscope as a tool for detecting and clustering novels by genre, Franco Moretti and I created a full text corpus that included 36 19th century novels (striped of title page and other identifying information). We divided this corpus into three groups and organized them by genre:

Group one consisted of 12 texts belonging to 3 different (but fairly similar) genres (gothic, historical tale, and national tale)
Group two consisted of 12 texts belonging to 3 different genres that were quite different (industrial, silver-fork, bildungsroman).
Group three consisted of 12 texts belonging to 6 different genres that mix 3 genres from those already included in group one or two and 3 new genres (evangelical, newgate, and anti-jacobin).

Witmore was given this corpus in electronic form (each novel in plain text). For identification purposes (since Mike was not privy to the actual genres or titles of the novels), he labeled each of the 12 genre groups with a number 1-12. Witmore’s numberings correspond to genres as follows:

Gothic
Historical Novels
National Tales
Industrial Novels
Silver-Fork Novels
Bildungsroman
Anti-Jacobin
Industrial
Gothic
Evangelical
Newgate
Bildungsroman

Using Docuscope, Witmore ran a series of tests in attempt to cluster the similar genres together. The experiment was designed to pick the three groups from 7-12 that have genre cognates in 1-6. Witmore’s results for the closest affiliated genres were impressive:

2:9 (Historical with Gothic)
1:9 (Gothic with Gothic) Witmore notes that this 2nd cluster was a close (statistically) second to the above
4:8 (Industrial with Industrial)
6:12 (Bildungsroman with Bildungsroman)

Witmore’s results also suggested an especially close relationship between the Gothic and Historical, Witmore writes that “groups 1 and 2 looked like they paired with the same candidate group (9).”

Additional Experiments

All of this work Witmore had done and the results he derived got me thinking more completely about the problem of genre classification. In many ways, genre classification is akin to authorship attribution. Generally speaking though, with authorship problems one attempts to extract a feature set that excludes context sensitive features from the analysis. (The consensus in most authorship attribution research suggests that a feature set made up primarily of frequent, or closed-class, word features yields the most accurate results) For genre classification, however, one would intuitively assume that context words would be critical (e.g. Gothic novels often have “castles” so we would not want to exclude context sensitive words like “castle.”) But my preliminary experiments have suggested just the opposite, namely that a distinct and detectable genre “signal” may be derived from a limited set of high-frequency features

Using just 42 word and punctuation features, I was able to classify the novels in the corpus described above equally as well as Witmore did using Docuscope (and a far more complex feature set). To derive my feature set, I lowercase the texts, count and convert to relative frequency the various features types, and then winnow the feature set by choosing only those features that have a mean relative frequency of 3% or greater. This results in the following 42 features (The prefix “p_” indicates a punctuation token instead of a word token.):

“a”, “all”, “an”, “and”, “as”, “at”, “be”, “but”, “by”, “for”, “from”, “had”, “have”, “he”, “her”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “on”, “p_apos”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_quote”, “p_semi”, “she”, “that”, “the”, “this”, “to”, “was”, “were”, “which”, “with”, “you”

Using the “dist” and “hclust” functions in the open-source “R” statistics application, I cluster the texts and output the following dendrogram:

These results were compelling, and after I shared them with Mike Witmore, he suggested testing this methodology on his Shakespeare corpus. Again the results were compelling and this process accurately clustered the majority of Shakespeare’s plays into appropriate clusters of “tragedy,” “comedy,” and “history”. The dendrogram below shows the results of my Shakespeare experiment using these 37 features

“a”, “and”, “as”, “be”, “but”, “for”, “have”, “he”, “him”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “p_apos”, “p_colon”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_ques”, “p_semi”, “so”, “that”, “the”, “this”, “thou”, “to”, “what”, “will”, “with”, “you”, “your”.

These initial tests raise a number of important questions, not the least of which is the question of how much of a factor genre plays in determining the usage of high frequency word and punctuation tokens. We have plans to conduct a series of more rigorous experiments, and the results of these tests will be forthcoming. In the meantime, my initial tests appear to confirm, again, the significant role that common function words play in defining literary style .

Matthew L. Jockers

Category Archives: Text-Mining

The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors

Aberrant Adjectives in 19th Century Novels

Unigrams, and bigrams, and trigrams, oh my

SEASR Grant

Panning for Memes

Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Analyze This (Page)

65,000 Texts to Mine?

Machine-Classifying Novels and Plays by Genre

Background

Additional Experiments