For their final project, the students in my Introduction to Digital Humanities seminar decided to analyze narrative style in Faulkner’s Sound and the Fury. In addition to significant off-line analysis, we are building a web-based application that allows visitors to compare the different sections of the novel to each other and also to new, unseen texts that visitors to the site can enter themselves.
To achieve this end, the web application must be able to “ingest” a text sample, tokenize it, extract data about the various features that will be used in the comparison, and then prepare (organize, tabulate) those features in a manner that will allow for statistical analysis/comparison. Since my course is not a course in statistics, we decided that I would be responsible for the number crunching.
So, while my students work out the text ingestion, tokenization, and preparation part of the project, I was tasked with figuring out how to crunch the numbers. My first instinct (not good) was to begin thinking about how to do the required math in php, which is the language the students are using for the rest of the project. Writing a complex statistical algorithm in php did not sound like much fun. My facility with statistics is almost entirely limited to descriptive statistics and though I do employ more complex statistical procedures in my work, I can’t say that I fully understand the equations that underlie them. So I quickly found myself wishing for a web-based version of (or maybe an API for) the open-source stats application “R”, which I use frequently in my own work. It turns out, of course, that lots of other folks had thought of this before me, and there are all sorts of web implementations of R. This was good news, but unfortunately, it wasn’t exactly what I was after. I did not want to replicate the R interface online. Instead, I wanted to be able to utilize the power of R through a user friendly front end. After five or six hours of hammering away, I eventually got what I was after: a way to call R from within php and return the results to a web interface.
What follows here is a simple step by step for replicating what I did. I offer this not as an example of something revolutionary or unprecedented–others have figured this out and there is nothing exceptional here–but instead as a way of documenting, in one place, the process I discovered after scraping the R archives, the various php sites, and the brain of my more R-savvy colleague Claudia. Hopefully this will prove useful to other rookies who may want to take a stab at something similar.
The steps below outline how to set up a php web page that accesses the “R” statistics package and outputs a .jpg Cluster Dendrogram showing how the texts clustered. The steps assume that you have already developed a script that ingests and processes a user-submitted text file and then adds that data to an existing data file containing data for the “canned” texts you wish to compare the user-submitted file against. I also assume that you have php and “R” installed on your server.
It warrants noting that what I present here is not exactly how I finally implemented the solution. What I show here provides for the clearest explanation of the process and works perfectly well, but it is not the “streamlined” final version of the script
For this solution, I use four separate files:
- form.html –a simple html form with two fields. The first field allows the user to enter a name for their file (e.g. “MySample”), and the second field is a “textarea” where the user can paste the text for comparison.
- rtest.php –a result page for the form that gets executed after the user hits submit on the html form page. The php code in this file executes the R code on the server.
- rCommands.txt –a text file containing a “canned” series of R commands.
- cannedData.txt –a tab delimited text file containing rows of data for analysis. Each row contains three columns: “textId,” “word,” and “frequency” where textId is the name of the text sample (e.g. “Faulkner’s Benjy”) and frequency is a normalized (percentage) value representing the percentage frequency of the “word” in the given text sample.
Now the Steps:
- The user submits a text and text name at form.html.
- The text of the user submitted sample is processed into a term frequency array “$sampeData” using the built in php functions “str_word_count” and “array_count_values” to compute term-frequencies (relative word frequencies) for every word type in the file.
- The contents of cannedData.txt is read into a new php variable.
- A temporary file “data.txt” is created on the server and the contents of cannedData are written to “data.txt.”
- The contents of $sampleData are appended to the end of “data.txt” in the format: “\”$sampleName\”\t\”$word\”\t\”$percent\”\n” where “$sampleName” is the user entered name for the the text sample, “$word” is a given word-type from the sample, and “$percent” is the normalized frequency of the word in the sample. Upon completion “data.txt” is closed.
- Using the php “exec” function, the script executes the unix cmd: “cat rCommands.txt | /usr/bin/R –vanilla”, which launches “R” and executes the R commands found inside the file “rCommands.txt”
- “rCommands.txt” is a canned sequence of commands that loads “data.txt” into a data frame, runs the cross tabulation function to create a term frequency matrix that can be processed with the dist and hclust functions as follows:
1234567891011file<-read.csv("data.txt", header=T, sep="\t")xt<-xtabs(freq ~ bookId+word, data=file)cluster <- hclust(dist(xt))jpeg("plot.jpg", width=8, height=6)par(family="mono")plot(cluster)dev.off()distanceMatrix <- as.matrix(dist(xt, upper=T, diag=T))x<-row.names(distanceMatrix)write.table(distanceMatrix, file="dataMatrix.txt", sep="\t", eol = "\n", col.names=x, row.names=FALSE)write.table(xt, file="xt.txt", sep="\t", eol = "\n")
- The result is a .jpg file (“plot.jpg”) of the resulting cluster dendrogram created in the current directory.
- “plot.jpg is then called in simple html (e.g. “<img src=”plot.jpg”>”) as a final step of the script
Readers can download all the source files here execRSourceFiles.zip.