Yesterday afternoon, Lincoln Mullen and Cameron Blevins released a new R package that is designed to guess (infer) the gender of a name. In my class on literary characterization at the macroscale, students are working on a project that involves a computational study of character genders. . . needless to say, the ‘gender‘ package couldn’t have come at a better time. I’ve only begun to experiment with the package this morning, but so far it looks very promising.
It doesn’t do everything that we need, but it’s a great addition to our workflow. I’ve copied below some R code that uses the gender package in combination with some named entity recognition in order to try and extract character names and genders in a small sample of prose from Twain’s Tom Sawyer. I tried a few other text samples and discovered some significant challenges (e.g. Mrs. Dashwood), but these have more to do with last names and the difficult problems of accurate NER than anything to do with the gender package.
Anyhow, I’ve just begun to experiment, so no big conclusions here, just some starter code to get folks thinking. Hopefully others will take this idea and run with it!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
library(gender) library(openNLP) require(NLP) sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() s <- as.String("Tom did play hookey, and he had a very good time. He got back home barely in season to help Jim, the small colored boy, saw next-day's wood and split the kindlings before supper—at least he was there in time to tell his adventures to Jim while Jim did three-fourths of the work. Tom's younger brother (or rather half-brother) Sid was already through with his part of the work (picking up chips), for he was a quiet boy, and had no adventurous, trouble-some ways. While Tom was eating his supper, and stealing sugar as opportunity offered, Aunt Polly asked him questions that were full of guile, and very deep—for she wanted to trap him into damaging revealments. Like many other simple-hearted souls, it was her pet vanity to believe she was endowed with a talent for dark and mysterious diplomacy, and she loved to contemplate her most transparent devices as marvels of low cunning.") a2 <- annotate(s, list(sent_token_annotator, word_token_annotator)) entity_annotator <- Maxent_Entity_Annotator() named.ents<-s[entity_annotator(s, a2)] named.ents.l <- strsplit(named.ents, "\\W") named.ents.v <- unlist(named.ents.l) not.blanks.v <- which(named.ents.v!="") named.ents.v <- named.ents.v[not.blanks.v] gender(tolower(named.ents.v)) |
And here is the output:
1 2 3 4 5 6 7 |
name gender proportion_male proportion_female 1 tom male 0.9971 0.0029 2 jim male 0.9968 0.0032 3 tom male 0.9971 0.0029 4 tom male 0.9971 0.0029 5 aunt <NA> NA NA 6 polly female 0.0000 1.0000 |
I’m delighted to see that the package is useful. One small suggestion: you might change this line
gender(tolower(named.ents.v))
to this
gender(tolower(named.ents.v), years = c(1880, 1900))
The package calculates the proportion based on a range of years. The default is currently 1932-2012 (my guess at the maximum typical lifespan of an American today). Since you’re using Mark Twain, the end of the nineteenth century would be more appropriate. The SSA data only goes back to 1880. Eventually I’d like to offer the option of using census data to find the most common names.