Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Social Networking for digital humanities nerds? Which DH bloggers are you most compatible with? Let’s get the right nerds with the right nerds–match making made in digital humanities heaven.

After seeing Stefan Sinclair’s Voyeuristic analysis of the Day of DH Blog posts, I wrote and asked him how to get access to the “corpus” of posts. He hooked me up, and I pre-processed the data with a few php scripts, then ran an LDA topic modeling process and then some more post processing with R in order to see the most important themes of the day and also to cluster the 117 bloggers based on their thematic similarity.

So, here’s the what and then the how. As for the why? Why not?

What:

117 Day of DH Bloggers

10 Unsupervised Topics (10 is arbitrary–I could have picked 100). These topics are generated by an analysis of the words and word sequences in the individual blogger’s sites. The purpose is to harvest out the most prominent “themes” or topics. These themes are presented in series of word lists. it is up to the researcher to then “label” the word clusters. I have labeled a few of them (in [brackets] at the beginning of the word lists below–you might use another label–this is the subjective part). Here they are:

  1. [human interaction in DH] work today people time working things email year week days bit good meeting tomorrow
  2. day thing mail dh de image based fact called things change ago encoding house
  3. [Academic Writing–including Grants] day time dh start post blog proposal google write great posts lunch nice articles
  4. [Digital publishing and archives] http talk future collection making online version publishing field morning life traditional daily large
  5. conference university blog morning read internet access couple computers archive involved including great written
  6. [DH Teaching] students dh teaching humanities class technology scholars university lab group library support scholarship student
  7. [DH Projects] digital project humanities work projects room meeting collections office building task database spent st
  8. data project xml working projects web interesting user set spend system ways couple time
  9. digital day humanities media writing post computing twitter english humanist real phd web rest
  10. [reading and text-analysis] book text tools software books today reading literary texts coffee edition search tool textual

Unfortunately, the Day of DH corpus isn’t truly big enough to get the sort of crystal clear topics that I have harvested from much larger collections, but still, the topics above, seen in aggregate, do give us a sense of what’s “hot” to talk about in the field.

But let’s get to the sexy part. . .

In addition to harvesting out the prominent topics, the modeling tool outputs data indicating how much (what proportion) of each blog is about each topic. The resulting matrix is of dimension 117×10 (117 blogs and 10 topics). The data in the cells are percentages for each topic in each author’s blog. The values in each row add up to 100%. With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes. Using this data, I then output a matrix showing which author’s have the most in common; thus, I do a little subtle match-making in advance of our digital rendezvous in London–birds of a feather blog together?

Here are the groups:

  • Group1
    1. aimeemorrison
    2. ariefwidodo
    3. barbarabordalejo
    4. caraleitch
    5. carlosmartinez
    6. carlwhithaus
    7. clairewarwick
    8. craigharkema
    9. ellimylonas
    10. geoffreyrockwell
    11. glenworthey
    12. guydaarmstrong
    13. henrietterouedcunliffe
    14. ianjohnson
    15. janrybicki
    16. jenterysayers
    17. jonbath
    18. juliaflanders
    19. juliannenyhan
    20. justinerichardson
    21. kai-christianbruhn
    22. kathleenfitzpatrick
    23. keithlawson
    24. lauramandell
    25. lauraweakly
    26. malterehbein
    27. matthewjockers
    28. meganmeredith-lobay
    29. melissaterras
    30. milenaradzikowska
    31. miranhladnik
    32. patricksahle
    33. paulspence
    34. peterrobinson
    35. pouyllau
    36. rafaelalvarado
    37. raysiemens
    38. reneaudet
    39. rogerosborne
    40. rudymcdaniel
    41. stanruecker
    42. stephanieschlitz
    43. susangreenberg
    44. victoriasmith
    45. vikazafrin
    46. williamturkel
  • Group2
    1. alejandrogiacometti
    2. annacaprarelli
    3. danasolomon
    4. ernestopriego
    5. karensmith
    6. leedurbin
    7. matthewcarlos
    8. paolosordi
    9. sarasteger
    10. stephanethibault
    11. yinliu
  • Group3
    1. alialbarran
    2. amandagailey
    3. cyrilbriquet
    4. federicomeschini
    5. ntlab
    6. stefansinclair
    7. torstenschassan
  • Group4
    1. aligrotkowski
    2. ashtonnichols
    3. calenhenry
    4. devonfitzgerald
    5. enricasalvatori
    6. ericforcier
    7. garrywong
    8. jameschartrand
    9. joelyuvienco
    10. johnnewman
    11. peterorganisciak
    12. shannonlucky
    13. silviarussell
    14. simonmahony
    15. sophiahoosein
    16. stevenhayes
    17. taraandrews
    18. violalasmana
    19. willardmccarty
  • Group5
    1. alunedwards
    2. hopegreenberg
    3. lewisulman
  • Group6
    1. amandavisconti
    2. jamessmith
    3. martinholmes
    4. sperberg-mcqueen
    5. waynegraham
  • Group7
    1. bethanynowviskie
    2. josephgilbert
    3. katherineharris
    4. kellyjohnston
    5. kirstenuszkalo
    6. margaretgraham
    7. matthewgold
    8. paulyoungman
  • Group8
    1. charlestravis
    2. craigbellamy
    3. franzfischer
    4. jeremyboggs
    5. johnwall
    6. kathrynbarre
    7. shawnday
    8. teresadobson
  • Group9
    1. jasonboyd
    2. jolanda-pieta
    3. joriszundert
    4. michaelmaguire
    5. thomascrombez
    6. williamallen
  • Group10
    1. louburnard
    2. nevenjovanovic
    3. sharongoetz
    4. stephenramsay

Twitterers @sramsay and @mattwilkens were poking around here today and wondered what the topics would look like if there were only five topics and five clusters instead of 10 and 10. Here are the topics:

  1. data work time text working tools people thing system xml mail software things texts
  2. day time morning lot work bit find web class teaching student days dh real
  3. digital humanities day tomorrow book twitter university blog computing reading books writing tei emails
  4. day dh today time post things write start online writing working computer year hours
  5. project digital work projects students meeting today people humanities dh scholars library year lab

And here are the Blogger-Mates clusters when I set n=5:

  • Group1
    1. aimeemorrison
    2. alejandrogiacometti
    3. alialbarran
    4. amandagailey
    5. annacaprarelli
    6. ashtonnichols
    7. barbarabordalejo
    8. carlosmartinez
    9. carlwhithaus
    10. clairewarwick
    11. craigbellamy
    12. craigharkema
    13. danasolomon
    14. devonfitzgerald
    15. enricasalvatori
    16. ernestopriego
    17. garrywong
    18. glenworthey
    19. guydaarmstrong
    20. henrietterouedcunliffe
    21. ianjohnson
    22. jameschartrand
    23. janrybicki
    24. jenterysayers
    25. joelyuvienco
    26. johnnewman
    27. jonbath
    28. juliannenyhan
    29. justinerichardson
    30. karensmith
    31. kathleenfitzpatrick
    32. keithlawson
    33. leedurbin
    34. lewisulman
    35. malterehbein
    36. matthewgold
    37. matthewjockers
    38. meganmeredith-lobay
    39. melissaterras
    40. michaelmaguire
    41. miranhladnik
    42. nevenjovanovic
    43. patricksahle
    44. peterrobinson
    45. raysiemens
    46. reneaudet
    47. rogerosborne
    48. shannonlucky
    49. silviarussell
    50. simonmahony
    51. sophiahoosein
    52. stefansinclair
    53. stephanieschlitz
    54. susangreenberg
    55. taraandrews
    56. thomascrombez
    57. torstenschassan
    58. vikazafrin
    59. violalasmana
    60. willardmccarty
    61. williamallen
    62. williamturkel
    63. yinliu
  • Group2
    1. aligrotkowski
    2. ariefwidodo
    3. calenhenry
    4. caraleitch
    5. charlestravis
    6. ericforcier
    7. geoffreyrockwell
    8. jolanda-pieta
    9. juliaflanders
    10. lauraweakly
    11. margaretgraham
    12. matthewcarlos
    13. milenaradzikowska
    14. nt2lab
    15. paolosordi
    16. peterorganisciak
    17. rudymcdaniel
    18. sarasteger
    19. sharongoetz
    20. stanruecker
    21. stevenhayes
    22. victoriasmith
  • Group3
    1. alunedwards
    2. hopegreenberg
    3. katherineharris
    4. stephanethibault
    5. teresadobson
  • Group4
    1. amandavisconti
    2. cyrilbriquet
    3. federicomeschini
    4. jamessmith
    5. joriszundert
    6. martinholmes
    7. rafaelalvarado
    8. sperberg-mcqueen
    9. stephenramsay
    10. waynegraham
  • Group5
    1. bethanynowviskie
    2. ellimylonas
    3. franzfischer
    4. jasonboyd
    5. jeremyboggs
    6. johnwall
    7. josephgilbert
    8. kai-christianbruhn
    9. kathrynbarre
    10. kellyjohnston
    11. kirstenuszkalo
    12. lauramandell
    13. louburnard
    14. paulspence
    15. paulyoungman
    16. pouyllau
    17. shawnday