In the post that follows here, I describe some recent experiments that I (and others) have conducted. The goal of these experiments was to accurately machine-classify novels and plays (Shakespeare’s) by genre. One of the most interesting results ends up having more to do with feature extraction than classification algorithm
Several weeks ago, Mike Witmore visited the Beyond Search workshop that I organize here at Stanford. In prior work, Witmore and some colleagues utilized a program called Docuscope (Developed at Carnegie Mellon) to distinguish between and classify (statistically) Shakespeare’s histories and comedies.
“Equipped with a specialized dictionary, Docuscope is able to divide texts into strings of words that are then sorted into one of eighteen word categories, such as “Inner Thinking” and “Past Events.” The program turns differentiating amongst genres into a statistical task by testing the frequency of occurence of words in each of the categories for each individual genre and recognizing where significant differences occur.”
Docuscope was designed as a tool for analyzing student writing, but Witmore (et. al.) discovered that it could also be employed as a specialized sort of feature extraction tool.
To test the efficacy of Docuscope as a tool for detecting and clustering novels by genre, Franco Moretti and I created a full text corpus that included 36 19th century novels (striped of title page and other identifying information). We divided this corpus into three groups and organized them by genre:
- Group one consisted of 12 texts belonging to 3 different (but fairly similar) genres (gothic, historical tale, and national tale)
- Group two consisted of 12 texts belonging to 3 different genres that were quite different (industrial, silver-fork, bildungsroman).
- Group three consisted of 12 texts belonging to 6 different genres that mix 3 genres from those already included in group one or two and 3 new genres (evangelical, newgate, and anti-jacobin).
Witmore was given this corpus in electronic form (each novel in plain text). For identification purposes (since Mike was not privy to the actual genres or titles of the novels), he labeled each of the 12 genre groups with a number 1-12. Witmore’s numberings correspond to genres as follows:
- Historical Novels
- National Tales
- Industrial Novels
- Silver-Fork Novels
Using Docuscope, Witmore ran a series of tests in attempt to cluster the similar genres together. The experiment was designed to pick the three groups from 7-12 that have genre cognates in 1-6. Witmore’s results for the closest affiliated genres were impressive:
- 2:9 (Historical with Gothic)
- 1:9 (Gothic with Gothic) Witmore notes that this 2nd cluster was a close (statistically) second to the above
- 4:8 (Industrial with Industrial)
- 6:12 (Bildungsroman with Bildungsroman)
Witmore’s results also suggested an especially close relationship between the Gothic and Historical, Witmore writes that “groups 1 and 2 looked like they paired with the same candidate group (9).”
All of this work Witmore had done and the results he derived got me thinking more completely about the problem of genre classification. In many ways, genre classification is akin to authorship attribution. Generally speaking though, with authorship problems one attempts to extract a feature set that excludes context sensitive features from the analysis. (The consensus in most authorship attribution research suggests that a feature set made up primarily of frequent, or closed-class, word features yields the most accurate results) For genre classification, however, one would intuitively assume that context words would be critical (e.g. Gothic novels often have “castles” so we would not want to exclude context sensitive words like “castle.”) But my preliminary experiments have suggested just the opposite, namely that a distinct and detectable genre “signal” may be derived from a limited set of high-frequency features
Using just 42 word and punctuation features, I was able to classify the novels in the corpus described above equally as well as Witmore did using Docuscope (and a far more complex feature set). To derive my feature set, I lowercase the texts, count and convert to relative frequency the various features types, and then winnow the feature set by choosing only those features that have a mean relative frequency of 3% or greater. This results in the following 42 features (The prefix “p_” indicates a punctuation token instead of a word token.):
“a”, “all”, “an”, “and”, “as”, “at”, “be”, “but”, “by”, “for”, “from”, “had”, “have”, “he”, “her”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “on”, “p_apos”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_quote”, “p_semi”, “she”, “that”, “the”, “this”, “to”, “was”, “were”, “which”, “with”, “you”
Using the “dist” and “hclust” functions in the open-source “R” statistics application, I cluster the texts and output the following dendrogram:
These results were compelling, and after I shared them with Mike Witmore, he suggested testing this methodology on his Shakespeare corpus. Again the results were compelling and this process accurately clustered the majority of Shakespeare’s plays into appropriate clusters of “tragedy,” “comedy,” and “history”. The dendrogram below shows the results of my Shakespeare experiment using these 37 features
“a”, “and”, “as”, “be”, “but”, “for”, “have”, “he”, “him”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “p_apos”, “p_colon”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_ques”, “p_semi”, “so”, “that”, “the”, “this”, “thou”, “to”, “what”, “will”, “with”, “you”, “your”.
These initial tests raise a number of important questions, not the least of which is the question of how much of a factor genre plays in determining the usage of high frequency word and punctuation tokens. We have plans to conduct a series of more rigorous experiments, and the results of these tests will be forthcoming. In the meantime, my initial tests appear to confirm, again, the significant role that common function words play in defining literary style .