Unigrams, and bigrams, and trigrams, oh my

I’ve been watching the ngrams flurry online, in twitter, and on various email lists over the last couple of days. Though I think there is great stuff to be leaned from Google’s ngram viewer, I’m advising colleagues to exercise restraint and caution. First, we still have a lot to learn about what can and cannot be said, reliably, with this kind of data–especially in terms of “culture.” And second, the eye candy charts can be deceiving, especially if the data is not analyzed in terms of statistical significance.

It’s not my intention here to be a “nay-sayer” or a wet blanket, as I said, there is much to learn from the google data, and I too have had fun playing with the ngram viewer. That said, here are a few things that concern me.

  1. We have no metadata about the texts that are being queried. This is a huge problem. Take the “English Fiction” corpus, for example. What kinds of texts does it contain? Poetry, Drama, Novels, Short Stories. etc? From what countries do these works originate? Is there an even distribution of author genders? Is the sample biased toward a particular genre? What is the distribution of texts over time–at least this last one we can get from downloading the Google data.
  2. There are lots of “forces” at work on patterns of ngram usage, and without access to the metadata, it will be hard to draw meaningful conclusions about what any of these charts actually mean. To call these charts representations of “culture” is, I think, a dangerous move. Even at this scale, the corpus is not representative of culture–it may be, but we just don’t know. More than likely the corpus is something quite other than representative of culture. It probably represents the collection practices of major research libraries. Again, without the metadata to tell us what these texts are and where they are from, we must be awfully careful about drawing conclusions that reach beyond the scope of the corpus. The leap from corpus to culture is a big one.
  3. And then there is the problem of “linguistic drift”, a phenomenon mathematically analogous to genetic drift in evolution. In simple terms, some share of the change observed in ngram frequency over time is probably the result of what can be thought of as random mutations. An excellent article about this process can be found here–>“Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift”.
  4. Data noise and bad OCR. Ted Underwood has done a fantastic job of identifying some problems related to the 18th century long s. It’s a big problem, especially if users aren’t ready to deal with it by substitution of f’s for s’s. But the long s problem is fairly easy to deal with compared to other types of OCR problems–especially cases where the erroneous OCR’ed word spells another word that is correct: e.g. “fame” and “same”. But even these we can live with at some level. I have made the argument over and over again that at a certain scale these errors become less important, but not unimportant. That is, of course, if the errors are only short term aberrations, “blips,” and not long term consistencies. Having spent a good many years looking at bad OCR, I thought it might be interesting to type in a few random character sequences and see what the n-gram viewer would show. The first graph below plots the usage of “asdf” over time. Wow, how do we account for the spike in usage of “asdf” in 1920s and again in the late 1990s? And what about the seemingly cyclical pattern of rising and falling over time. (HINT: Check the y-axis).


    And here’s another chart comparing the usage of “asdf” to “qwer.”

    And there are any number of these random character sequences. At my request, my three year old made up and typed in “asdv”, “mlik”, “puas”, “puase”, “pux”–all of these “ngrams” showed up in the data, and some of them had tantalizing patterns of usage. My daughter’s typing away on my laptop reminded me of Borges Library of Babel as well as the old story about how a dozen monkeys typing at random will eventually write all of the great works of literature. It would seem that at least a few of the non-canonical primate masterpieces found their way into Google’s Library of Babel.

  5. And then there is the legitimate data in the data that we don’t really care about–title pages and library book plates, for example. After running an Named Entity Extraction algorithm over 2600 novels from the Internet Archive’s 19th century fiction collection, I was surprised to see the popularity of “Illinois.” It was one of the most common place names. Turns out that is because all these books came from the University of Illinois and all contained this information in the first page of the scans. It was not because 19th century authors were all writing about the Land of Lincoln. Follow this link to get a sense of the role that the partner libraries may be playing in the ngram data: Libraries in the Google Data

    In other words, it is possible that a lot of the words in the data are not words we actually want in the data. Would it be fair, for example, to say that this chart of the word “Library” in fiction is a fair representation of the interest in libraries in our literary culture? Certainly not. Nor is this chart for the word University an accurate representation of the importance of Universities in our literary culture.

So, these are some problems; some are big and some are small.

Still, I’m all for moving ahead and “playing” with the google data. But we must not be seduced by the graphs or by the notion that this data is quantitative and therefore accurate, precise, objective, representative, etc. What Google has given us with the ngram viewer is a very sophisticated toy, and we must be cautious in using the toy as a tool. The graphs are incredibly seductive, but peaks and valleys must be understood both in terms of the corpus from which they are harvested and in terms of statistical significance (and those light-grey percentages listed on the y-axis).


This month a group of researchers at Stanford, University of Illinois, University of Maryland, and George Mason were awarded a $790,000 grant from the Mellon Foundation to advance the prior work of the SEASR project. I’ll be serving as the overall Project Director and as one of the researchers in the Stanford component of the grant. In this phase of the SEASR project, we will focus on leveraging the existing SEASR infrastructure in support of four “use cases.” But “use case” hardly describes the research intensive nature of the proposed work, nor does it capture the strongly humanistic bias of the work proposed. Each partner has committed to a specific research project and each has the expressed goal of advancing humanities research and publishing their results. I’d like to emphasize this point about advancing humanities research.

This grant represents an important step beyond the tool building, QA and UI testing stages of software development. All too often, it seems, our digital humanities projects devote a great deal of time, money, and labor to infrastructure and prototyping and then all too frequently the results languish in the great sea of hammers without a nail. Sure, a few journeymen carpenters stick these tools in their belts and hammer away, but all too often it seems that more effort goes into building the tools and then the resources sit around gathering dust while humanities research marches on in the time-tested modes with which we are most familiar.

Of course, I don’t mean this to be a criticism of the tool builders or the tools built. The TAPOR project, for example, offers many useful text analysis widgets, and I frequency send my colleagues and students there for quick and dirty text-analysis. And just last month I had occasion to use and cite Stefan Sinclair’s Voyeur application. I was thrilled to have Voyeur at my finger tips; it provided a quick and easy way to do exactly what I wanted.

But often, the analytic tasks involved in our projects are multifaceted and cannot be addressed by any one tool. Instead, these projects involve “flows” in which our “humanistic” data travels though a series of analytic “filters” and comes out on the other end in some altered form. The TAPOR project attempts to be a virtual text analysis “workbench” in which the craftsman can slide a project around the bench from one tool to the next. This model works well for smallish projects but is not robust enough for large scale projects and, despite some significant interface improvements over the years, remains, for me at least, a bit clunky. I find it great for quick tasks with one or two texts, but inefficient for processing multiple texts or multiple processes. Part of the TAPOR mission was to develop a suite of tools that could be used by the average, ordinary humanist: which is to say, the humanist without any real technical chops. It succeeds on that front to be sure.

SEASR offers an alternative approach and what it provides in terms of processing power and computational elegance it gives up in terms of ease of use and transparency. The SEASR “interface” is one that involves constructing modular “workflows” in which each module corresponds to some computational task. These modules are linked together such that one process feeds into the next and the business of “sliding” a project around from one tool to another on the virtual workbench is taken over by the workflow manager.

In this grant we have specifically deemphasized UI development in favor of output, in favor of “results” in the humanities sense of the word. As we write in the proposal, “The main emphasis of the project will be on developing, coordinating, and investigating the research questions posed by the participating humanities scholars.” The scholars in this project include myself and Franco Moretti at Stanford, Dan Cohen at GMU, Tanya Clement at University of Maryland, Ted Underwood and John Unsworth both of UIUC. On the technical end, we have Michael Welge and Loretta Auvil of the Automated Learning Group, of the National Center for Supercomputing Applications.

As the project gets rolling, I will have more to post about the specific research questions we are each addressing and the ongoing results of our work. . .

On Collaboration

I’ve been hearing a lot about “collaboration,” especially in the digital humanities. Lisa Spiro at Rice University has written a very informative post about Collaborative Authorship in the Humanities as well as another post providing Examples of Collaborative Digital Humanities Projects. Both of these posts are worth reading, and Spiro offers some well-thought out and researched perspectives.

My own experiences with collaboration include both research and authorship. I have seen first hand how fruitful collaboration, especially interdisciplinary collaboration, can be. It is safe to say that I’m a believer. In fact the course I have been teaching for the last two years, Literary Studies and the Digital Library, is designed entirely around collaborative research projects. And yet I have to say that I’m am entirely suspicious of the current rage for “collaboration.”

No doubt the current popularity of collaboration, at least in the humanities, is a natural extension of the movement toward interdisciplinary studies. Through collaboration with people outside our individual disciplines has led to fruitful work, there seems to be an unnatural desire on the part of some administrators and even some colleagues to “foster collaboration,” as if collaboration were something that occurs in a petri dish, something that needs only to be “fostered” in order to evolve.

But collaboration does not arise out of a petri dish, it arises out of need. Sure, there are serendipitous collaborations that arise out of proximity: X bumps into Y at the water cooler and they get to talking . . . but more often successful collaboration arises out of need: X wishes to investigate a topic but requires the skills of Y in order to do a good job.

Failed collaborations, on the other hand, are all too often the result of good intentioned but overly forced attempts to bring people together. I attended a seminar a couple of years ago on the subject of “fostering collaboration in the humanities.” The organizers of the meeting certainly understood the promise of new knowledge that might be derived through interaction, but they entirely miscalculated when it came to individual motivation to collaborate. It’s a classic case of putting the cart before the horse. In my experience, fruitful collaboration evolves organically and is motivated by the underlying research questions, questions that are always too big and too complex to be addressed by a single researcher.

Auto Converting Project Gutenberg Text to TEI

Those who do corpus level computational text analysis are always hungry for more and more texts to analyze. Though we’ve become adept at locating texts from a wide range of sources (our own institutional repositories as well as a number of other places including Google Books, the Internet Archive, and Project Gutenberg), we still face a number of preprocessing tasks to bring those various files into some standard format. The texts found at these resources are not always in a format friendly to the tools we use for processing those texts. For example, I’ve developed lots of processing scripts that are designed to leverage the metadata that is frequently encoded into TEI-based xml. A text from Project Gutenberg, however, is not only just plain text, but it has a lot of boilerplate text at the beginning and end of each file that needs to be removed prior to text analysis.

I’m currently building a corpus of 19th century novels and discovered that many of the texts I would like to include have already been digitized by Project Gutenberg. This, of course, was great news. But, the system I have developed for ingesting texts into my corpus assumes that the texts will all be in TEI-XML with markup indicating such important things as “author,” “title”, and “date” of publication. I downloaded about 100 novels and was about to begin opening them up one by one and adding the metadata. . .eek! I quickly realized the mundanity of the task and thought, “hmm, I bet someone has written a nice regex script for doing this sort of thing.” A quick trolling of the web led me to the web page of Michiel Overtoom who had developed some python scripts for downloading and cleaning up (“beautifying” in his language) Dutch Gutenberg texts for his eBook Reader. Overtoom’s process is mainly designed to strip out the boilerplate and then rename the files with naming conventions that reflect the author and title of the books.

With Overtoom’s script as a base, I reengineered the code to convert a Gutenberg text into a minimally encoded and TEI-compliant XML file. The script builds a teiHeader that includes the author and title of the work (unfortunately, Project Gutenberg texts do not include publication dates, why?) and then adds “text”, “body”, div, and all the p tags. The final result is a document that meets basic TEI requirements. The script is copied below, but since the all important python spacing may be destroyed by this posting, it’s better to download it here and then change the file extension from .txt. to “.py”. Enjoy!

Panning for Memes

Over in the English Department Literature Lab, we have been experimenting with Topic Modeling as a means of discovering latent themes (aka topics) in a corpus of 19th century novels. Topic Modeling is an unsupervised machine learning process that employs Latent Dirichlet allocation. “It posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.”

We’ve been experimenting using the Java-Based MAchine Learning for LanguagE Toolkit (Mallet) from UMASS Amherst and a corpus of British and American novels from the 19th century. In one experiment we ran the topic modeler over just the British corpus, in another over just the American corpus. But when we combined the two collections and ran the model over the whole corpus, we discovered that certain topics showed up in only one or the other corpus. For example, one solely American topic was composed of words related to slavery and words written in southern dialect. And there was a strictly British topic clearly indicative of the royalty and aristocracy: words such as “lord,” “king”, “duke,” “sir”, “lady.” This was an interesting result and not simply because it provides a quantitative way of distinguishing topics or themes that are distinct to one nation or another, but also because the topics themselves could be read and interpreted in context.

More interesting for me, however, were two topics that appeared in both corpora. The first, which appeared more in the British corpus was related to “soldiering.” A second topic, which was more common in the American corpus, has to do with Indian wars. The “soldiering” topic was composed of the following words:

“men,” “general,” “captain,” “colonel,” “army,” “horse,” “sir,” “enemy,” “soldier,” “battle,” “day,” “war,” “officer,” “great,” “country,” “house,” “time,” “head,” “left,” “road,” “british,” “soldiers,” “washington,” “night,” “fire,” “father,” “officers,” “heard,” “moment.”

The Indians topic included:

“indian,” “men,” “indians,” “great,” “time,” “chief,” “river,” “party,” “red,” “white,” “place,” “savages,” “woods,” “day,” “side,” “fire,” “war,” “savage,” “water,” “canoe,” “rifle,” “people,” “warriors,” “returned,” “feet,” “friends,” “tree,” “night,” “distance.”

What was most fascinating, however, was that when the soldiering topic was found in the American corpus it usually had to do with Indians, and when the Indian topic appeared in the British corpus it was almost completely in the context of the Irish! As an Irish-Studies scholar, who wrote a theses on the role of the American West in Irish and Irish-American literature, this was an incredibly rich discovery. The literature of the Irish and the Irish Diaspora is filled with comparisons between the Irish situation vis-à-vis the British and the Native American situation vis-à-vis what one Irish American author described as the “Tide of Empire.”

Reader’s wishing to follow this line of comparison in some more contemporary works might want to have a look at Joyce’s short story “An Encounter,” Flann O’Brien’s book At Swim Two Birds, Paul Muldoon’s Madoc and Patrick McCabe’sThe Butcher Boy.

What is a Literature Lab: Not Grunts and Dullards

Yesterday’s Chronicle of Higher Education ran an article by Marc Parry about the work we are doing here in our new Literature Lab with “big data.” It’s awfully nice to be compared to Lewis and Clark exploring the frontiers of literary scholarship, but I think the article fails to give due credit to the exceptional group of students who have been working with Franco Moretti and me in the Lab. Far from being the lab “grunts” that Parry calls them, these students are the lifeblood of the lab, and the projects we are working on spring from their ideas and their passion for literature.

I understand how such a confusion of roles could happen: in the science lab, students often work *under* a faculty member who has received a grant to pursue a particular line of research. Indeed, the funding of grad students in the sciences is often based on this model; the grant pays for the work they do in the lab.

Our literature lab is nothing at all like this; it is a far more egalitarian enterprise and there are no monetary incentives for the students. Instead, the motivation for the students is pure research, the opportunity to push the envelope and experience the excitement of discovering new territory. Yes, Moretti and I serve as guides, advisors, and mentors in this process, but it is important to emphasize the truly collaboration nature of the enterprise.

Moretti and I do not have the answers nor do we necessarily make up the questions. In the case of the most recent work, in fact, all the questions have come directly from the students themselves. A recent article by Amanda Chang and Corrie Goldman (“Stanford Students Use Digital Tools to Analyze Classic Texts“) captures the student’s role quite accurately. Our lab is based on the idea that any good question deserves to be pursued. Students or faculty may pose those questions and teams evolve organically based on interest in the questions. The result, of course, is that we *all* learn a lot.

As a teenager, one of my favorite films was The Magnificent Seven , a remake of the Seven Samurai with gunslingers played by Yul Brynner, Steve McQueen and other marquis names. The basic idea is that these gunslingers hire out to protect a small town being ravaged by bad guys. The excitement of the film comes as Brynner assembles his team of seven gunslingers, each with a special talent. They then train the local residents how to defend themselves and as they do the villagers and the gunmen develop a deep sense of respect and admiration for each other.

Working with this group of students in the lab has been a similar experience (without the guns). The work quite literally could not be done without the team and each member of the team has brought a unique talent to the project. One student in the group is an accomplished coder, another has read most every key book in the corpus, another has a penchant for math, another loves research. They are the magnificent seven, and I have never had the pleasure of working with a more talented group: yes, they are students but for me they are already colleagues. I trust their judgements and have a profound respect, and sometimes awe, for what they already know.

Parry’s article contains bits and pieces of an interview he conducted with Yale Professor Katie Trumpener. Speaking of our work and of Moretti’s notion of “distant-reading” Trumpener apparently said the following:

But what happens when his “dullard” descendants take up “distant reading” for their research?

“If the whole field did that, that would be a disaster,” she says, one that could yield a slew of insignificant numbers with “jumped-up claims about what they mean.”

“Dullard”? Really? I do hope that Ms. Trumpener’s comment was somehow taken out of context here and that she will very quickly write to the Chronicle to set the record straight. Otherwise I fear that some less forgiving souls might conclude that Ms. Trumpener is one herself. . .

UPDATE: Over the weekend I received a clarification via email. Ms. Trumpner writes: “I was referring to Moretti’s potential methodological “descendants”–ie. those coming after him, even long after him, not his current team-mates.” Ms. Trumpner notes that when she was interviewed the discussion was not about our Literature Lab at Stanford, but about Moretti’s approaches and general matters of how she and Moretti approach the study of literature. Her comments were made in that context and not in the context of a discussion of the current work of the lab.

Stalker (R) and the journey of the Jockers iPhone

Lot’s of hoopla in the last few days over the discovery that the iPhone keeps a database of locations it has traveled. Wasn’t long before someone in the R community figured out how to tap into this file and with a mere two lines of code you can visualize where your phone has been on a map.

The code library comes complements of Drew Conway over on the r-bloggers page. I installed the app and within a few seconds had several maps of my recent travels. I attach two images below (don’t tell my mom).

Digital Humanities: Methodology and Questions

Students in our new Literature Lab doing what English Majors do!

Students in our new Literature Lab doing what English Majors do!

Folks keep expressing concern about the future of the humanities, and the “need” for a next big thing. In fact, the title of a blog entry in the April 23, 2010 New York Times takes it for granted that the humanities need “saving.” The blog entry is a follow up to an article from March 31, which explores how some literary critics are applying scientific methodologies and approaches to the study of literature. Of course, this isn’t really new. One only needs to read a few back issues of Literary and Linguistic Computing to know that we’ve been doing this kind of work for a long time (and even longer if one wants to consider the approaches suggested by the Russian Formalists). What is new is that the mainstream humanities and the mainstream press are taking notice.

In her response to the article, Blakey Vermeule (full disclosure, her office is just three doors from mine) makes a key point to take away from the discussion. She writes: “The theory wars are long gone and nobody regrets their passing. What has replaced them is just what was there all along: research and scholarship, but with a new openness to scientific ideas and methods” (emphasis mine). Before explaining why this is the key take-away, a little story. . .

Not too long ago, a colleague took me aside and asked in all earnestness, “what do I need to do to break into this field of Digital Humanities.” This struck me as a rather odd question to ask: a very clear putting of the cart before the horse. Digital Humanities (DH) is a wide stretching umbrella term that attempts to encompass and describe everything from new media theory and gaming to computational text analysis and digital archiving. There is a lot of room under the umbrella, and it really is, therefore, impossible to think of DH as a unified “field” that one can break into. In fact, there are no barriers to entry at all; the doors are wide open, come on in.

But, I’ll go out on a limb and argue that the DH community can be split into two primary groups: Group “A” is composed of researchers who study digital objects; Group “B” is composed of researchers who utilize digital tools to study objects (digital or otherwise). Group A, I would argue, is primarily concerned with theoretical matters and group B with methodological matters. In reality, of course, the lines are blurry, but this is a workable clarification.

. . . What Vermuele and others are describing in the New York Times business falls most cleanly into Group B. But I would not describe this movement toward empirical methodologies as revolutionary: when interested in certain types of questions, an empirical methodology just makes good common sense. I came to utilize computation in my research not because the siren’s song of revolution was tempting me away from my dusty, tired, and antiquated approaches to literature. Rather, computational tools and statistical methods simply offered a way of asking and exploring the questions that I (and others such as those pictured above) have about the literary field. What has changed is not the object of study but the nature of the questions.

So, the answer to my colleague who asked what is needed to “break into this field of Digital Humanities” is simply this: questions, you need questions.

Who’s Your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling

Social Networking for digital humanities nerds? Which DH bloggers are you most compatible with? Let’s get the right nerds with the right nerds–match making made in digital humanities heaven.

After seeing Stefan Sinclair’s Voyeuristic analysis of the Day of DH Blog posts, I wrote and asked him how to get access to the “corpus” of posts. He hooked me up, and I pre-processed the data with a few php scripts, then ran an LDA topic modeling process and then some more post processing with R in order to see the most important themes of the day and also to cluster the 117 bloggers based on their thematic similarity.

So, here’s the what and then the how. As for the why? Why not?


117 Day of DH Bloggers

10 Unsupervised Topics (10 is arbitrary–I could have picked 100). These topics are generated by an analysis of the words and word sequences in the individual blogger’s sites. The purpose is to harvest out the most prominent “themes” or topics. These themes are presented in series of word lists. it is up to the researcher to then “label” the word clusters. I have labeled a few of them (in [brackets] at the beginning of the word lists below–you might use another label–this is the subjective part). Here they are:

  1. [human interaction in DH] work today people time working things email year week days bit good meeting tomorrow
  2. day thing mail dh de image based fact called things change ago encoding house
  3. [Academic Writing–including Grants] day time dh start post blog proposal google write great posts lunch nice articles
  4. [Digital publishing and archives] http talk future collection making online version publishing field morning life traditional daily large
  5. conference university blog morning read internet access couple computers archive involved including great written
  6. [DH Teaching] students dh teaching humanities class technology scholars university lab group library support scholarship student
  7. [DH Projects] digital project humanities work projects room meeting collections office building task database spent st
  8. data project xml working projects web interesting user set spend system ways couple time
  9. digital day humanities media writing post computing twitter english humanist real phd web rest
  10. [reading and text-analysis] book text tools software books today reading literary texts coffee edition search tool textual

Unfortunately, the Day of DH corpus isn’t truly big enough to get the sort of crystal clear topics that I have harvested from much larger collections, but still, the topics above, seen in aggregate, do give us a sense of what’s “hot” to talk about in the field.

But let’s get to the sexy part. . .

In addition to harvesting out the prominent topics, the modeling tool outputs data indicating how much (what proportion) of each blog is about each topic. The resulting matrix is of dimension 117×10 (117 blogs and 10 topics). The data in the cells are percentages for each topic in each author’s blog. The values in each row add up to 100%. With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes. Using this data, I then output a matrix showing which author’s have the most in common; thus, I do a little subtle match-making in advance of our digital rendezvous in London–birds of a feather blog together?

Here are the groups:

  • Group1
    1. aimeemorrison
    2. ariefwidodo
    3. barbarabordalejo
    4. caraleitch
    5. carlosmartinez
    6. carlwhithaus
    7. clairewarwick
    8. craigharkema
    9. ellimylonas
    10. geoffreyrockwell
    11. glenworthey
    12. guydaarmstrong
    13. henrietterouedcunliffe
    14. ianjohnson
    15. janrybicki
    16. jenterysayers
    17. jonbath
    18. juliaflanders
    19. juliannenyhan
    20. justinerichardson
    21. kai-christianbruhn
    22. kathleenfitzpatrick
    23. keithlawson
    24. lauramandell
    25. lauraweakly
    26. malterehbein
    27. matthewjockers
    28. meganmeredith-lobay
    29. melissaterras
    30. milenaradzikowska
    31. miranhladnik
    32. patricksahle
    33. paulspence
    34. peterrobinson
    35. pouyllau
    36. rafaelalvarado
    37. raysiemens
    38. reneaudet
    39. rogerosborne
    40. rudymcdaniel
    41. stanruecker
    42. stephanieschlitz
    43. susangreenberg
    44. victoriasmith
    45. vikazafrin
    46. williamturkel
  • Group2
    1. alejandrogiacometti
    2. annacaprarelli
    3. danasolomon
    4. ernestopriego
    5. karensmith
    6. leedurbin
    7. matthewcarlos
    8. paolosordi
    9. sarasteger
    10. stephanethibault
    11. yinliu
  • Group3
    1. alialbarran
    2. amandagailey
    3. cyrilbriquet
    4. federicomeschini
    5. ntlab
    6. stefansinclair
    7. torstenschassan
  • Group4
    1. aligrotkowski
    2. ashtonnichols
    3. calenhenry
    4. devonfitzgerald
    5. enricasalvatori
    6. ericforcier
    7. garrywong
    8. jameschartrand
    9. joelyuvienco
    10. johnnewman
    11. peterorganisciak
    12. shannonlucky
    13. silviarussell
    14. simonmahony
    15. sophiahoosein
    16. stevenhayes
    17. taraandrews
    18. violalasmana
    19. willardmccarty
  • Group5
    1. alunedwards
    2. hopegreenberg
    3. lewisulman
  • Group6
    1. amandavisconti
    2. jamessmith
    3. martinholmes
    4. sperberg-mcqueen
    5. waynegraham
  • Group7
    1. bethanynowviskie
    2. josephgilbert
    3. katherineharris
    4. kellyjohnston
    5. kirstenuszkalo
    6. margaretgraham
    7. matthewgold
    8. paulyoungman
  • Group8
    1. charlestravis
    2. craigbellamy
    3. franzfischer
    4. jeremyboggs
    5. johnwall
    6. kathrynbarre
    7. shawnday
    8. teresadobson
  • Group9
    1. jasonboyd
    2. jolanda-pieta
    3. joriszundert
    4. michaelmaguire
    5. thomascrombez
    6. williamallen
  • Group10
    1. louburnard
    2. nevenjovanovic
    3. sharongoetz
    4. stephenramsay

Twitterers @sramsay and @mattwilkens were poking around here today and wondered what the topics would look like if there were only five topics and five clusters instead of 10 and 10. Here are the topics:

  1. data work time text working tools people thing system xml mail software things texts
  2. day time morning lot work bit find web class teaching student days dh real
  3. digital humanities day tomorrow book twitter university blog computing reading books writing tei emails
  4. day dh today time post things write start online writing working computer year hours
  5. project digital work projects students meeting today people humanities dh scholars library year lab

And here are the Blogger-Mates clusters when I set n=5:

  • Group1
    1. aimeemorrison
    2. alejandrogiacometti
    3. alialbarran
    4. amandagailey
    5. annacaprarelli
    6. ashtonnichols
    7. barbarabordalejo
    8. carlosmartinez
    9. carlwhithaus
    10. clairewarwick
    11. craigbellamy
    12. craigharkema
    13. danasolomon
    14. devonfitzgerald
    15. enricasalvatori
    16. ernestopriego
    17. garrywong
    18. glenworthey
    19. guydaarmstrong
    20. henrietterouedcunliffe
    21. ianjohnson
    22. jameschartrand
    23. janrybicki
    24. jenterysayers
    25. joelyuvienco
    26. johnnewman
    27. jonbath
    28. juliannenyhan
    29. justinerichardson
    30. karensmith
    31. kathleenfitzpatrick
    32. keithlawson
    33. leedurbin
    34. lewisulman
    35. malterehbein
    36. matthewgold
    37. matthewjockers
    38. meganmeredith-lobay
    39. melissaterras
    40. michaelmaguire
    41. miranhladnik
    42. nevenjovanovic
    43. patricksahle
    44. peterrobinson
    45. raysiemens
    46. reneaudet
    47. rogerosborne
    48. shannonlucky
    49. silviarussell
    50. simonmahony
    51. sophiahoosein
    52. stefansinclair
    53. stephanieschlitz
    54. susangreenberg
    55. taraandrews
    56. thomascrombez
    57. torstenschassan
    58. vikazafrin
    59. violalasmana
    60. willardmccarty
    61. williamallen
    62. williamturkel
    63. yinliu
  • Group2
    1. aligrotkowski
    2. ariefwidodo
    3. calenhenry
    4. caraleitch
    5. charlestravis
    6. ericforcier
    7. geoffreyrockwell
    8. jolanda-pieta
    9. juliaflanders
    10. lauraweakly
    11. margaretgraham
    12. matthewcarlos
    13. milenaradzikowska
    14. nt2lab
    15. paolosordi
    16. peterorganisciak
    17. rudymcdaniel
    18. sarasteger
    19. sharongoetz
    20. stanruecker
    21. stevenhayes
    22. victoriasmith
  • Group3
    1. alunedwards
    2. hopegreenberg
    3. katherineharris
    4. stephanethibault
    5. teresadobson
  • Group4
    1. amandavisconti
    2. cyrilbriquet
    3. federicomeschini
    4. jamessmith
    5. joriszundert
    6. martinholmes
    7. rafaelalvarado
    8. sperberg-mcqueen
    9. stephenramsay
    10. waynegraham
  • Group5
    1. bethanynowviskie
    2. ellimylonas
    3. franzfischer
    4. jasonboyd
    5. jeremyboggs
    6. johnwall
    7. josephgilbert
    8. kai-christianbruhn
    9. kathrynbarre
    10. kellyjohnston
    11. kirstenuszkalo
    12. lauramandell
    13. louburnard
    14. paulspence
    15. paulyoungman
    16. pouyllau
    17. shawnday

Analyze This (Page)

“TAToo” is a fun Flash widget developed by Peter Organisciak at the University of Alberta. Peter works under the supervision of Digital Humanists Par Excellence and TAPoR Gurus Geoffrey Rockwell and Stan Ruecker. The widget (just some embed-able code) does “layman’s” text analysis on the web pages in which its code is embedded. I’ve added the code to the right sidebar of my blog to take it for a test drive.

The tool offers several “views” of your text. The default view is a word cloud in which words with greater frequency are both larger and more bold. Looking at the word cloud can give you a pretty quick sense of the page’s key terms and concepts.

By clicking on the “Tool:” bar at the top of the widget, you can select other options. The “List Words” view provides a term frequency list. You can then click on any word in the list to see its collocates. Alternatively, there are both a Collocate view and a Concordance view that allow users to enter a specific word and get information about the company that word keeps.

Kudos to Peter and the rest of the TAPoR Tools team for continuing to pursue the fine art of tool making.