Over in the English Department Literature Lab, we have been experimenting with Topic Modeling as a means of discovering latent themes (aka topics) in a corpus of 19th century novels. Topic Modeling is an unsupervised machine learning process that employs Latent Dirichlet allocation. “It posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.”

We’ve been experimenting using the Java-Based MAchine Learning for LanguagE Toolkit (Mallet) from UMASS Amherst and a corpus of British and American novels from the 19th century. In one experiment we ran the topic modeler over just the British corpus, in another over just the American corpus. But when we combined the two collections and ran the model over the whole corpus, we discovered that certain topics showed up in only one or the other corpus. For example, one solely American topic was composed of words related to slavery and words written in southern dialect. And there was a strictly British topic clearly indicative of the royalty and aristocracy: words such as “lord,” “king”, “duke,” “sir”, “lady.” This was an interesting result and not simply because it provides a quantitative way of distinguishing topics or themes that are distinct to one nation or another, but also because the topics themselves could be read and interpreted in context.

More interesting for me, however, were two topics that appeared in both corpora. The first, which appeared more in the British corpus was related to “soldiering.” A second topic, which was more common in the American corpus, has to do with Indian wars. The “soldiering” topic was composed of the following words:

“men,” “general,” “captain,” “colonel,” “army,” “horse,” “sir,” “enemy,” “soldier,” “battle,” “day,” “war,” “officer,” “great,” “country,” “house,” “time,” “head,” “left,” “road,” “british,” “soldiers,” “washington,” “night,” “fire,” “father,” “officers,” “heard,” “moment.”

The Indians topic included:

“indian,” “men,” “indians,” “great,” “time,” “chief,” “river,” “party,” “red,” “white,” “place,” “savages,” “woods,” “day,” “side,” “fire,” “war,” “savage,” “water,” “canoe,” “rifle,” “people,” “warriors,” “returned,” “feet,” “friends,” “tree,” “night,” “distance.”

What was most fascinating, however, was that when the soldiering topic was found in the American corpus it usually had to do with Indians, and when the Indian topic appeared in the British corpus it was almost completely in the context of the Irish! As an Irish-Studies scholar, who wrote a theses on the role of the American West in Irish and Irish-American literature, this was an incredibly rich discovery. The literature of the Irish and the Irish Diaspora is filled with comparisons between the Irish situation vis-à-vis the British and the Native American situation vis-à-vis what one Irish American author described as the “Tide of Empire.”

Reader’s wishing to follow this line of comparison in some more contemporary works might want to have a look at Joyce’s short story “An Encounter,” Flann O’Brien’s book At Swim Two Birds, Paul Muldoon’s Madoc and Patrick McCabe’sThe Butcher Boy.