Text Analysis with R for Students of Literature

Text Analysis with R for Students of Literature provides a practical introduction to computational text analysis using the open source programming language R. Readers begin working with text right away and each chapter works through a new technique or process such that readers gain a broad exposure to core R procedures and a basic understanding of the possibilities of computational text analysis at both the micro and macro scale. View the Book Flyer [pdf 1.4MB]

cover.tawr

Introduction to the RStudio Programming Environment [Video].

Corrections:

Page 41 (section 4.4.3).  The sentence reading “The key difference is that lapply requires a list as a second arguments, and it requires the name of some other function as a second argument.”
should be
“The key difference is that lapply requires a list as its first arguments, and it requires the name of some other function as a second argument.”

Page 141. The following line of code

if(length(chunks.l[[length(chunk.l)]] <= length(chunks.l[[length(chunk.l)]]/2)

should read

if(length(chunks.l[[length(chunk.l)]]) <= chunk.size/2)

as it does on page 140.

Reviewer Comments

“This is a well written book on the topic of Text Analysis. There is enough information to give you a good start using R. Followed by easy to understand details about text analysis. … This is a good book to have if you are doing text analysis.” (Mary Anne, Cats and Dogs with Data, maryannedata.com, August, 2014)

“A remarkably well-crafted book that will allow students to get a quick start and progress toward quite sophisticated text mining tasks. … exercises provided at the end of each chapter, with solutions at the end of the book, should serve well to help students solidify their knowledge and gain more confidence in their text mining skills. … a great addition to the libraries of digital humanists and natural language enthusiasts who wish to expand their programming literacy … .” (Denilson Barbosa, Computing Reviews, August, 2014)

"This book is an essential resource for anyone who wants to study literature using computational methods."(Mark Algee-Hewitt, Amazon.com, November 2014)

"I can't think of a more qualified person to guide readers through powerful R techniques for text analysis. While extremely useful for people studying literature, these techniques can be also used by anybody working with texts. Even if you simply want to understand how companies and data scientists are analyzing all kinds of texts, go through this book." (Lev Manovich, Department of Computer Science, The Graduate Center, City University of New York & author of The Language of New Media)

"The open source programming language R has become one of the most central statistical and analytical tool in many sciences. While it has already been used in linguistic applications, this book is the first to discuss the application of (corpus-linguistic and other) methods with R in the context of literary studies. The author covers a wide range of descriptive, analytical, and exploratory methods beautifully and in detail in a book that will appeal to a wide and diverse audience of both students and seasoned researchers from literary studies, linguistic computing, and the digital humanities more generally." (Stefan Th. Gries, Department of Linguistics, University of California, Santa Barbara & author of Quantitative corpus linguistics with R: A Practical Introduction)

"This book does a great service for literary scholars interested in computational approaches to text analysis, giving them ready access to powerful methods for exploring patterns and relationships across large quantities of text. Its clear and lucid explanations will also make it an easy textbook to teach from, especially for instructors with prior background who can then use it as a stepping stone to introducing more complex methods. Amateurs and those with little programming background will find it imminently accessible." (Hoyt Long, Department of East Asian Languages and Civilizations, University of Chicago)

"Through my work as an epidemiologist, I encounter electronic health records in an unstructured form (i.e. text), and Text Analysis with R covers many of the initial steps for studying these records. The book is very accessible; it provides a straightforward introduction to manipulating text information without presuming a background in programming or a familiarity with the jargon used in this field. I also appreciated Jockers' thoughtful inclusion of supplemental explanations and information in footnotes throughout the book. For example, text analysis often involves the use of "regular expressions"; a footnote concisely explains wildcard and escape characters and this explanation spared me a fair bit of confusion in my own work. Although I am not a "student of literature", I thought the book contained many generalizable and expertly-taught lessons that make it a valuable introduction to manipulating and analyzing text." (Matthew Maenner, Ph.D.)

"This book is a worthy introduction to computational text analysis, and it fills an important gap in the literature. It’s very accessible and contains plenty of interesting examples and real applications, which have been collected and crafted over the many years the author taught text analysis to undergraduate and graduate students. Although it focuses on the study of literature, I would highly recommend this book to students in business administration and related fields." (Joao Quariguasi Frota Neto, School of Management, University of Bath)

Reader Contributions and Corrections

Thanks to all of those below who provided comments and feedback.

PageTypeDescriptionContributor
Title pageclarificationWith cap or lc the W? Presumably depends on publisher's style.Charles Shirley
viicorrectionimportantly importantly or important? Use is inconsistent in the book. Being somewhat old-school, I recommend "important."Charles Shirley
viicorrectionpatters typoCharles Shirley
viicorrection 'toolkit' not 'tool kit' Yihui Xie
vii clarification Elaborate further on final point about 'new discoveries.' Alexander Huber
ixcorrectionBaayen and Hadley serial comma before "and"? I think it's included or omitted inconsistently.Charles Shirley
xcorrectionKnitr; two looks like semicolon should be comma or possibly colonCharles Shirley
xcorrectionYihui Xie Dynamic Documents with R and knitr. Chapman and Hall/CRC, 2013. ISBN 978-1482203530. Yihui Xie
x correction 'Sweave' not 'Sweve.' Jerid Francom
6correction 'alongside' not 'along side' Ashanka Kumari
6clarification Added footnote reading: 'Console is a word used to refer to a command line window where you enter commands. Sometimes this is called a Terminal or Shell. John Laudun
6clarificationAdded a footnote to clarify what you will see when opening RStudio for the first time: 'Actually, the first time you launch RStudio you will only be able to see three of the panes. The R scripting or Source pane will likely be collapsed so you will only see the word Source until you either create a new script (File > New > R Script) or un-collapse the Source window pane.' John Laudun
7correctionare determined not an error, but reads like a slight insult; maybe just "decide"?Charles Shirley
7correctionLinux missing period at endCharles Shirley
13correctionplaintext should be camel case plainTextStephen Pentecost
13clarification Added footnote to explain my variable naming convention, as follows: 'Throughout this book I will use a naming convention when instantiating new R objects. In the example seen here, I have named the object text.v. The .v extension is a convention I have adopted to indicate the R data type of the object, in this case a vector object. This will make more sense as you learn about R's different data types. For now, just understand that you can name R objects in ways that will make sense to you as a human reader of the code.' Austin Wehrwein
14clarification By default the 'max.print' option is set to show 10,000 lines so that when entering text.v you will only sees the first 10,000 lines of the book in the console followed by this warning: [ reached getOption('max.print') omitted 8874 entries ]. The max.print options can be reset using options(max.print=1000000) so that you will see the entire text of Moby Dick. Mark Wolff
16correction 'shortcut' not 'short cut' and 'shortcutting' not short-cutting' Ashanka Kumari
16clarification Added footnote as follows: 'Programming code is extremely finicky. If you do not type the commands exactly as they appear here, you will likely get an error. In my experience about 95\% of the errors and bugs one sees when coding are the result of careless typing. If the program is not responding the way you expect or if you are getting errors, check your typing. Everything counts: capital letters must be consistently capitalized, commas between arguments must be out side of the quotes and so on. John Laudun
16correctionforth typo; should be "fourth"Charles Shirley
16correctionin stylistic note: through the book I saw a number of cases like this "in between" where the unnecessary preposition creates a colloquial flavor; recommend thinking about whether the intended audience will notice or objectCharles Shirley
17correctionbook;-) hmmm...what terminal punctuation, if any, should be used with an emoticon? interesting question, definitely no old-school answer!Charles Shirley
18correctionexpression typo, should be pluralCharles Shirley
19clarificationAdded page reference to footnote to page 136 where better regular expression gets explained.Alexander Huber
19correctionhelp would it be appropriate to mention RStudio's help tab? once I discovered it, I found myself using it constantlyCharles Shirley
19correctioninternet normal practice seems to be initial capCharles Shirley
20correction 'occurrences' not 'occurecnes' in code sample. Mikal Brotnov
20correction 'an object ' not 'a object' Matthew J Maenner
20correction<- query for here and elsewhere: will inconsistent spacing around <- seem sloppy to intended readers? is it worth mentioning RStudio's handy shortcut of alt-hyphen to type the <- plus a space before and after?Charles Shirley
20correctiongut check isn't "gut check" usually about courage? maybe also intuition, which fits here, but I found the phrase distractingCharles Shirley
21correction 'into a vector of words' not 'into vector of words' John Laudun
21correctionwhale higher on page, "whale" as word of interest is in quotes, not italic; minor inconsistency, likely not worth worrying aboutCharles Shirley
21correctionlist since lists are a data type often used in the book, maybe don't use it loosely here; just "will return the index positions"?Charles Shirley
22clarification When entering whale.hits.v/total.words.v the textbook shows the result rounded to the sixth decimal place. You will see the result rounded to the seventh in your console. This difference was the result of using the knitr package for production of the textbook document. I have added an option to correct this discrepancy. Mark Wolff
22clarificationAdded a footnote to clarify that there is a more elegant way to calculate the length of the hits for whale without using which() as in: length(moby.word.v[moby.word.v=='whale']). Even though it is verbose, I'll continue to use which in the example because I think which is easier for a beginner to understand. Yihui Xie
22correction length(unique(moby.word.v)) should be 16871. Ashanka Kumari
22correction 'comments into code' not 'comments into to code' Brandon Hawk
22correctionoccurecnes typo in code sampleCharles Shirley
23correction seems like I've seen cautions against using T or F for TRUE or FALSECharles Shirley
24correction Unnecessary question mark changed to period. Ashanka Kumari
24clarification Added note about adjusting the size of the plotting pane in RStudio so as to avoid 'Error in plot.new() : figure margins too large'. Austin Wehrwein
24correctionnovel? sentence isn't a question; replace ? with periodCharles Shirley
25correctionrecycling and recommend at least a comma after "recycling" because the two halves of the compound sentence are such different topics; maybe even make two sentences?Charles Shirley
26correctiontabled a bit distracting to see data type used as participle; no big objection, just including my personal reactionCharles Shirley
27clarification New text explaining the use of the axis() and names() functions as follows: Here I'll add a few more arguments to plot in order to convey more information about the resulting image, and I'll call the axis function to reset the values on the x-axis with the names of the top ten words. Notice that the names function can be used to set, or in this case, get the names of an object. Carmen McCue
27correctionasterisks an asterisk rather than "the asterisks"?Charles Shirley
27correctiontwo 3.2 to specify exercise numberCharles Shirley
28correctionExercise 1 probably needs a more precise reference; 1.X?Charles Shirley
31correction 'You now need to create a sequence of numbers from 1 to n, where n is the position, or index number, of the last word in Moby Dick' not You now need to create a sequence of numbers from 1 to n, where n is the position, or index number) of last word in Moby Dick.'Ashanka Kumari
31correctionIf elsewhere in book, R words at beginning of sentence are lower-cased if that's the correct R formCharles Shirley
31correctionone, ten seems more appropriate to use 1, 10Charles Shirley
32correction The text regarding TRUE, FALSE and NA has been revised as follows: 'Another vector containing the values for plotting on the y-axis is now needed, and in this case, the values need only be some reflection of the logical condition of TRUE where a whale is found and FALSE or none found when an instance of whale is not found. In R you can represent the logical value TRUE with a number 1 and FALSE with a 0. Here, however, since we are not really counting items but, instead, noting their presense or absence, I'll introduce a special character sequence NA as in 'not available' for places where there is no found match.' Alexander Huber
32clarification Moved code for plotting 'simple plot of whale' to page 32 and added reference to figure 4.1Brandon Hawk
32correctionNA inconsistent spacing around dashesCharles Shirley
33correctionseparate and 'occurrences' are misspelled. Ashanka Kumari
33correctionseperate typo in code sampleCharles Shirley
34correctioninstantiated probably OK, but "instantiated" may confuse some readers; meaning becomes clear from context as the word is used through the bookCharles Shirley
35correctionUnecessary use of unlist in code reading novel.lines.v<-unlist(novel.lines.v). novel.lines.v is already a vector.Charles Shirley
35correction 'Now identify the chapter break positions in the vector using the grep function' not 'Now identify the chapter break positions in the list using the grep function'Carmen McCue
35correctionAs written, the code in the for loop that follows on page 39 will fail to capture the last line of the novel. To remedy this error, both the text instructions and the code need modification. The instructions on this page now read as follows: 'This technique works perfectly except for the last chapter where there is no following chapter! There are several ways you might address this situation, but a simple solution is to just add one more line to the novel.lines.v object and then add the position of this new line to the chap.positions.v vector. You will find that last position easily enough with the length function.' Item #2 of the list that follows has been modified to read:'Add a new item to the end of the novel.lines.v object using the c funciton. Here I have set the value of that last item to END. You will see later on that this last item serves to mark the end boundry for the last chapter. Now get the last position using the length function and add it to the chap.positions.v vector using the c function:' The code in item #2 is rewritten as follows:

novel.lines.v <- c(novel.lines.v, "END")

last.position.v <- length(novel.lines.v)

chap.positions.v <- c(chap.positions.v , last.position.v)

Carmen McCue
38correction Missing the word 'to' in 'To summarize, the for loop will need to iterate over each item'Ashanka Kumari
38correctionneed iterate looks like "need" is extraneousCharles Shirley
39clarification Moved description of list object to chapter two (page 18) where list objects are first introduced. Kevin McMullen
39correctionpauses suggests the user will notice the code pause in execution; actual meaning seems to be a "pause" in the logicCharles Shirley
40correction Extra 'to' in sentence 'I must -to- add 1 to i in its capacity as an index'Ashanka Kumari
40correction Missing 'v' in object name 'novel.lines.v' Ashanka Kumari
42correction 'each of the drawers contains an integer vector' not 'each of the drawers contains is an integer vector'Carmen McCue
42correction 'R will keep recycling from the shorter vector until it reaches the end of the process' not 'R will keep recycling from the longer vector until it reaches the end of the process'Carmen McCue
42correctionSometimes not 'Sometime'Ashanka Kumari
42correctionSometime missing s on endCharles Shirley
42correctionincredibly too colloquial?Charles Shirley
43correction Missing 'v' in object name 'novel.lines.v' Ashanka Kumari
43correction Missing the word 'the' between 'by' and 'first' and again between 'by' and 'second' Ashanka Kumari
43correction Missing the word 'one' in 'a file cabinet with three drawers, and each one of the drawers contains an integer vector' Ashanka Kumari
43correctionbracketed sub setting I think it's usually sub-setting elsewhere; my personal preference is subsettingCharles Shirley
44correctionsub setting inconsistent use with and without hyphen in this paragraphCharles Shirley
44correctioncalled, comma splice, or just informal?Charles Shirley
46correction 'simpler object' not 'simpler objects' Ashanka Kumari
46clarification The columns in matrix objects are already vectors, so using the as.vector function here is unnecessary.NA
46correctionan even simpler vector objects should be singular: "object"Charles Shirley
48correctioni.e. looks like e.g. fits the meaning betterCharles Shirley
49clarification Added brief discussion and example of correlation coefficient used in literary context.Alexander Huber
50correction 'you need to replace the NA values' not 'you need to replace with these NA values' Ashanka Kumari
52correctionSee lowercase: seeCharles Shirley
52correction1s seen in various places, just noted here: font for numerals often looks like that for R words rather than normal text; seems a bit odd to me, but no big deal; in this case, maybe better to use apostrophy: 1'sCharles Shirley
53correctionshort hand shorthand (one word), I thinkCharles Shirley
53correctionnow Needs initial cap: "Now"Charles Shirley
54correctionOnce in random order, you'll an old-school bugaboo: introductory phrase referring to something other than the next wordCharles Shirley
56correctiononce run needs initial cap (also another introductory-phrase violation, for readers who feel violated)Charles Shirley
56correction[Wikipedia URL] looks like all-caps is unintentionalCharles Shirley
61correction 'Students frequently' not 'Students frequency'Alexander Huber
61correction 'chapters of the novel' not 'chapters of he novel' Brandon Hawk
61correction 'chapter.freqs.l' not 'chapter.list.freqs'Brandon Hawk
61correctionfrequency typo; should be "frequently"Charles Shirley
61correctiontoken shouldn't it be "type"?Charles Shirley
63correctionAn should be "A"Charles Shirley
67correction 'how to use R's' not 'how use R's'Brandon Hawk
67correction 'provide lapply' not 'provide lappy'Brandon Hawk
67correctionfourth' not 'forth' Kimberley Tedrow
67correctionhow use should be "how to use"Charles Shirley
68correctionrevered a delightful typo, but alas, should be "reversed"Charles Shirley
69correctionin not nAshanka Kumari
68correctionfollow, provide here and some similar instances, the comma seems unnecessary and possibly wrong (though a perfectly good 18th-19th century practice that I regret being out of fashion)Charles Shirley
69correctionresult? period, not ?Charles Shirley
72clarification chapter.lengths.m is created only as part of a practice exercise (6.1) and not in the main text, so text here now includes reminder from exercise.Alexander Huber
72correctionthis looks like a colon is needed; noticed similar omissions in various places, not sure they're all included in this listCharles Shirley
74correctionincredibly that word again; author's choice, of courseCharles Shirley
75correction Citation for Firth is incomplete. Alexander Huber
75correctionyou access appears it should be "you to access"Charles Shirley
75correctionrelative path unfamiliar term for many readers; like "instantiate," becomes clearer as it is reused; may or may not be worth inserting a brief explanationCharles Shirley
76correctionfunction, function better to omit comma and depend on font/face difference to separate the words visually?Charles Shirley
76correction Improve the regular expression for finding files with a '.txt' file extension as: '\\.txt$NA
77correctionwith of rather than "with"? or "objective of using"?Charles Shirley
77correctioncommand + return would suggest adding Windows keystrokes, which I think are Ctrl + EnterCharles Shirley
80correctionfile needs colon: "file:"Charles Shirley
83correctionabstract another technical term; not likely to be a serious problem, but potentially confusing to someone who is still worrying about the basics of codingCharles Shirley
84correction's check the font; is it as intended?Charles Shirley
84correctionreturn back the hits another old-school bugaboo; just "return the hits"?Charles Shirley
84correctionwidowing another wonderful typo...Charles Shirley
85correctionPublished doesn't need initial cap (three instances in the two code samples)Charles Shirley
86correctionsee, Enter question mark rather than commaCharles Shirley
91correctioninternet internet/Internet again; if instances noted in this list are changed, better do a global search to avoid inconsistency in case I didn't indicate them allCharles Shirley
91correctionOCR correct font?Charles Shirley
91correctionin is, not in (is not perfect)Charles Shirley
91correctionor of, not or (margin of error)Charles Shirley
92correctionalong side alongside probably preferableCharles Shirley
92correctionURL in footnote maybe OK, but in present pagination (footnote starting on p. 91), may look like a stray footnoteCharles Shirley
93correctionget seems (too?) colloquialCharles Shirley
94correctionyou should be "your"Charles Shirley
94correction'chapter' first single quote needs to be reversed (or changed to straight quotes like the rest of the expression)Charles Shirley
95clarification I have not explained why I chose 'd' as the name for the TEI namespace (the decision was arbitrary). I have now changed it to the slightly more verbose but clearer 'tei.' Alexander Huber
95correctionthe third argument see note in PDF; "The third argument" confused meCharles Shirley
95correctionnamespace in code sample: should "namespace" be "namespaces"?Charles Shirley
96correctionbecause they not 'because the' Alexander Huber
96correctionwith help from grep hard to be sure whether the phrase modifies "did previously" or "you can put..."Charles Shirley
96correctionthe typo: should be "they"Charles Shirley
104correctionabove: typo: period, not colonCharles Shirley
104correctionnon isn't it "not available"?Charles Shirley
108correctionxpath normal form seems to be XPathCharles Shirley
108correctiona an, to match pronunciation?Charles Shirley
108correction 'doc.object' not 'doc' Alexander Huber
109correctionup colloquialism "chunks up" may be misread as chunks / up the stringCharles Shirley
109correctionread needs colonCharles Shirley
110correctionbi-grams normally written without hyphen?Charles Shirley
110correction67 footnote needs closing periodCharles Shirley
112correction Formula squaresAlexander Huber
112correctionultimately repetition in two sentences seems awkwardCharles Shirley
112correctionfive-hundred hyphen not neededCharles Shirley
112correctionFormula needs correction Charles Shirley
114correctionfactors may be worth mentioning factors will be explained in a later sectionCharles Shirley
117correctionadequate: period, not colonCharles Shirley
117correctioncode: sentence continues after the code snippet, so colon seems wrong hereCharles Shirley
119correctiondocument document's?Charles Shirley
121correctionbuilt in built-in?Charles Shirley
122correctionitems, each separate into two sentences?Charles Shirley
122correctioncorresponding one-tenth missing "to"Charles Shirley
123correctionitems which you can ascertain items, as you can ascertain?Charles Shirley
124correctionallows delete "allows"Charles Shirley
130correctionfunction missing period at end of sentenceCharles Shirley
130correctionRybinci Rybicki, isn't it?Charles Shirley
130correctionsite: needs space after colonCharles Shirley
133correction 'I was calling' not 'I calling'Alexander Huber
133correctionLSA Latent LSA, or or something else to connect acronym and termCharles Shirley
134correction 'was a dazzling' not 'was dazzling'Alexander Huber
134correctionimportantly similar phrase is "more important" somewhere else (old-school stylists taught me to prefer "more important")Charles Shirley
134correctiontotal, most connecting clauses with comma seems awkward, could reduce to "mostly novels"Charles Shirley
134correctionworks, extraneous commaCharles Shirley
136correctionthat use an apostrophe, maybe "... possessives that use apostrophes get split" (or "are split" for slightly more formality)Charles Shirley
136correctionstraight-forward no hyphenCharles Shirley
138correctionall need appears to be missing "you"Charles Shirley
139clarificationA more computationally efficient method for achieving this same result can be done by building a list object inside the loop and then using do.call to rbind the list elements. The code for this could be written as follows: The code for this could be written as follows:

topic.l<-NULL

for(i in 1:length(files.v)){

  doc.object<-xmlTreeParse(file.path(inputDir, files.v[i]), 

  useInternalNodes=TRUE)

  chunk.m<-makeFlexTextChunks(doc.object, 

  chunk.size, percentage=FALSE)

  textname<-gsub(""\\..*"","""", files.v[i])

  segments.m<-cbind(paste(textname, 

  segment=1:nrow(chunk.m), sep=""_""), chunk.m)

  topic.l[[textname]]<-segments.m

}

topic.m<-do.call(rbind, topic.l)

Paul Johnson
139correctionnumber should be pluralCharles Shirley
141correctionunicode usually capped, I think; the unicode.org website spells it as UnicodeCharles Shirley
141correctionsolutions, comma between subject and predicateCharles Shirley
142correctionmethod personal note: I found the term "method" baffling when I first encountered it while trying to write Excel macros; nothing to suggest here, but thought it worth pointing out that others may be baffledCharles Shirley
142correctiontoken should be pluralCharles Shirley
143correction Output and description of output from calling head(word.freqs) differ. Alexander Huber
144correctionbegin...Begin repetition is awkwardCharles Shirley
144correctionedu/ extraneous space in URLCharles Shirley
147correctionmind, comma not neededCharles Shirley
147correctionfunction needs comma after "function"Charles Shirley
147correctionsee needs initial capCharles Shirley
147correctionjust recommend deleting "just"Charles Shirley
148correctionreal change to "really" for slightly less colloquial flavor (or even delete)Charles Shirley
149correctionable assign missing "to"Charles Shirley
149correctionproportion, comma between subject and predicateCharles Shirley
149correctionthat: delete either "that" or colonCharles Shirley
149correctionFriends topic would be easier to read as "...Friends, topic..."Charles Shirley
149correction.36. quote needs to be closedCharles Shirley
150correction.df needs colon to introduce code snippetCharles Shirley
151correctionScatter should be "Bar"Charles Shirley
152correctionthis : extraneous space before colonCharles Shirley
153correctioncharacters initial-cap (or lowercase "Main")Charles Shirley
156correctionfile should be pluralCharles Shirley