Some folks I follow on Twitter (@scott_bot, @benmschmidt, @rayncordell, @foxyfolklorist, and others) were engaged in a conversation this week about the frequency of gendered pronouns in a corpus of 233 fairy tales from @foxyfolklorist’s dissertation. For a bit of literary contextualization, I tweeted a bar graph showing the frequency of 13 pronouns in a corpus of ~3,500 19th century novels. The bar graph (seen again here) breaks down pronoun usage by author gender (M, F, and U).
It is natural to wonder, as David Mimno (@dmimno) did this morning, if there is any significance to the gender results: is gender really correlated to these observed means or are the observed means just an artifact of messy data. One way to explore the extent to which these observed means really are an entailment of gender is to ask what the means would look like if gender were not a factor. In other words, what would happen if all the data about author gender were shuffled and the means then recalculated?*
If we do this shuffling and recalculating a whole bunch of times, say 100 times, we can then plot all the fake “genderless” permutations along side the actual observed means and thereby see whether the observed means are outside or inside what we would expect if gender were not a factor influencing pronoun use.**
Below are the plots for the 13 pronouns from my original bar graph (above). What you’ll see below is that for certain pronouns, such as “him,” “I,” “me,” “my” and “your”, the observed (“real”) means are within the range of “expected” values if gender were not a consideration. For other pronouns, however, such as “he,” “her,” “she” and “we,” the observed values are outside the values in the randomized “fake” data generated by taking gender out of the equation.
Another fascinating element of these graphs is found in the third “U” column. These are authors of unknown gender. It is hard not too look at these observed values and wonder about the most likely genders of those anonymous writers. . .
* [As it happens, this is precisely the approach that David Mimno suggested we take in some other work (under review) in which we assess the significance of topic use (rather than pronoun use) by male and female authors.]
** [Naturally, it could be that the determining factor here is not really gender at all. It could be that “we” (readers, editors, publishers, etc) have selected for books authored by men that express one set of linguistic qualities and books by women that express another set. In other words, these graphs don’t prove that women and men necessarily use pronouns differently, only that they do so (or don’t depending on the pronoun in question) in this particular corpus of 19th century fiction.]