For their final project, the students in my Introduction to Digital Humanities seminar decided to analyze narrative style in Faulkner’s Sound and the Fury. In addition to significant off-line analysis, we are building a web-based application that allows visitors to compare the different sections of the novel to each other and also to new, unseen texts that visitors to the site can enter themselves.

To achieve this end, the web application must be able to “ingest” a text sample, tokenize it, extract data about the various features that will be used in the comparison, and then prepare (organize, tabulate) those features in a manner that will allow for statistical analysis/comparison. Since my course is not a course in statistics, we decided that I would be responsible for the number crunching.
So, while my students work out the text ingestion, tokenization, and preparation part of the project, I was tasked with figuring out how to crunch the numbers. My first instinct (not good) was to begin thinking about how to do the required math in php, which is the language the students are using for the rest of the project. Writing a complex statistical algorithm in php did not sound like much fun. My facility with statistics is almost entirely limited to descriptive statistics and though I do employ more complex statistical procedures in my work, I can’t say that I fully understand the equations that underlie them. So I quickly found myself wishing for a web-based version of (or maybe an API for) the open-source stats application “R”, which I use frequently in my own work. It turns out, of course, that lots of other folks had thought of this before me, and there are all sorts of web implementations of R. This was good news, but unfortunately, it wasn’t exactly what I was after. I did not want to replicate the R interface online. Instead, I wanted to be able to utilize the power of R through a user friendly front end. After five or six hours of hammering away, I eventually got what I was after: a way to call R from within php and return the results to a web interface.
What follows here is a simple step by step for replicating what I did. I offer this not as an example of something revolutionary or unprecedented–others have figured this out and there is nothing exceptional here–but instead as a way of documenting, in one place, the process I discovered after scraping the R archives, the various php sites, and the brain of my more R-savvy colleague Claudia. Hopefully this will prove useful to other rookies who may want to take a stab at something similar.
The steps below outline how to set up a php web page that accesses the “R” statistics package and outputs a .jpg Cluster Dendrogram showing how the texts clustered. The steps assume that you have already developed a script that ingests and processes a user-submitted text file and then adds that data to an existing data file containing data for the “canned” texts you wish to compare the user-submitted file against. I also assume that you have php and “R” installed on your server.
It warrants noting that what I present here is not exactly how I finally implemented the solution. What I show here provides for the clearest explanation of the process and works perfectly well, but it is not the “streamlined” final version of the script
For this solution, I use four separate files:
- form.html –a simple html form with two fields. The first field allows the user to enter a name for their file (e.g. “MySample”), and the second field is a “textarea” where the user can paste the text for comparison.
- rtest.php –a result page for the form that gets executed after the user hits submit on the html form page. The php code in this file executes the R code on the server.
- rCommands.txt –a text file containing a “canned” series of R commands.
- cannedData.txt –a tab delimited text file containing rows of data for analysis. Each row contains three columns: “textId,” “word,” and “frequency” where textId is the name of the text sample (e.g. “Faulkner’s Benjy”) and frequency is a normalized (percentage) value representing the percentage frequency of the “word” in the given text sample.
Now the Steps:
- The user submits a text and text name at form.html.
- The text of the user submitted sample is processed into a term frequency array “$sampeData” using the built in php functions “str_word_count” and “array_count_values” to compute term-frequencies (relative word frequencies) for every word type in the file.
- The contents of cannedData.txt is read into a new php variable.
- A temporary file “data.txt” is created on the server and the contents of cannedData are written to “data.txt.”
- The contents of $sampleData are appended to the end of “data.txt” in the format: “\”$sampleName\”\t\”$word\”\t\”$percent\”\n” where “$sampleName” is the user entered name for the the text sample, “$word” is a given word-type from the sample, and “$percent” is the normalized frequency of the word in the sample. Upon completion “data.txt” is closed.
- Using the php “exec” function, the script executes the unix cmd: “cat rCommands.txt | /usr/bin/R –vanilla”, which launches “R” and executes the R commands found inside the file “rCommands.txt”
- “rCommands.txt” is a canned sequence of commands that loads “data.txt” into a data frame, runs the cross tabulation function to create a term frequency matrix that can be processed with the dist and hclust functions as follows:
1234567891011file<-read.csv("data.txt", header=T, sep="\t")xt<-xtabs(freq ~ bookId+word, data=file)cluster <- hclust(dist(xt))jpeg("plot.jpg", width=8, height=6)par(family="mono")plot(cluster)dev.off()distanceMatrix <- as.matrix(dist(xt, upper=T, diag=T))x<-row.names(distanceMatrix)write.table(distanceMatrix, file="dataMatrix.txt", sep="\t", eol = "\n", col.names=x, row.names=FALSE)write.table(xt, file="xt.txt", sep="\t", eol = "\n") - The result is a .jpg file (“plot.jpg”) of the resulting cluster dendrogram created in the current directory.
- “plot.jpg is then called in simple html (e.g. “<img src=”plot.jpg”>”) as a final step of the script
Readers can try the demo at http://www.stanford.edu/~mjockers/cgi-bin/rtest/form.html and download all the source files here http://www.stanford.edu/~mjockers/cgi-bin/rtest/execRSourceFiles.zip. For the demo I auto-truncate the user entered file to 1000 words.
Hi,
Very nice article. Really useful. I am trying to do something similar but I face some difficulties. I wonder if you could give me a piece of advice.
For some reason, when I try to execute this function in php:
exec(“cat my_rscript.R | Applications/R “);
I don’t get anything back. But when I run the rscript form the terminal, I get as a result the plot that I need.
I also tried Rscipt and Batch as follows:
exec(“Rscript /Applications/MAMP/htdocs/php_test/my_rscript.R”);
exec(“Applications/R CMD BATCH /Applications/MAMP/htdocs/php_test/my_rscript.R”);
But none of them work. I checked if the exec function is working properly and it seems that there is no problem..
I would really appreciate if you could help me. Thank you in advance.
Dimitris
Dimitris,
This is strange indeed. If it runs ok from the terminal and if the exec function is working for other commands, then I’d guess either one of these two things might be the problem: 1) It’s a permissions issue and php is unable to write files into the directory, or 2) the file is being written somewhere else in your system (that is, not where you were expecting it). It has been a good many years since I played with any of this, so there may be something else that I’m missing. Good Luck!
M