Introduction to Text Analysis and Topic Modeling with R
General Description:
General Description:
Introduction to Text Analysis and Topic Modeling with R is a set of two workshops that will provide a practical introduction text analysis with a special emphasis on topic modeling. Taken together, the workshops will cover basic text processing, data ingestion, data preparation, and topic modeling. The main computing environment for the workshops will be R: “the open source programming language and software environment for statistical computing and graphics.”
While no programming experience is required, students must have basic computer skills, must be familiar with their computer’s file system, and must be comfortable entering commands in a command line environment.
Though the two workshops are designed to stand alone, the second one is more advanced and assumes some basic familiarity with topic modeling. Participants might want to visit The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors for a general overview.
Suggested Workshop Preparation:
While not required, participants are encouraged to work through at least the first two of the seven basic R lessons available at R Code School prior to taking this workshop.
In advance of the workshop, students should:
- Download the current version of R (at the time of this writing version 3.0.0) from the CRAN website by clicking on the link that is appropriate to your operating system (see http://cran.at.r-project.org):
- If you use MS Windows, click on the “base” and then on the link to the executable (i.e. “.exe”) setup file.
- If you are running Mac OSX, choose the link to the most current package.
- If you use Linux, choose your distribution and then the installer file. Follow the instructions for installing R on your system in the standard or “default” directory. You will now have the base installation of R on your system.
- If you are on a Windows or Macintosh computer, you will find the R application in the directory on your system where Programs (Windows) Applications (Macintosh) are stored. If you want to launch the R GUI, you can double click the icon to start the R GUI. We will not be using the R GUI in the workshop. We will use RStudio (see below).
- If you are on a Linux/Unix system, simply type “R” at the command line to enter the R program environment.
- Download and Install RStudio
- The R GUI application is fine for a lot of simple programming, but RStudio is an application that offers a very nice user environment for writing and running R programs. RStudio is an IDE, that’s “Integrated Development Environment” for R. RStudio runs happily on Windows, Mac, and Linux. After you have downloaded R (by following the instructions above) you must download the “Desktop” version (i.e. not the Server version) of RStudio from http://www.rstudio.com. Follow the installation instructions and then launch RStudio just like you would any other program/application. When you launch RStudio, you do not have to also launch the R program. RStudio accesses the R program you installed in the first step.
- Download the workshop materials.
Workshop Syllabus:
IMPORTANT: It is critical that you arrive on time to every session and be ready to roll with RStudio installed and running. The workshop will begin on schedule, and if you miss the first few minutes of any session you’ll be lost!
Workshop One:
Introduction to Text Analysis with Applications in R
Thursday, February 20, 10:00 – 12:30
Summary: In this workshop you will be introduced to the R programming language while learning the basics of computational text analysis. You will learn basic R syntax and be introduced to the RStudio programing environment. Text analysis topics covered will include text ingestion and tokenization, word frequency analysis, dispersion plots, and if time permits, correlation analysis.
- SESSION ONE (10:00-11:15)
- The R computing environment
- R console vs. RStudio
- Basic text manipulation in R
- Word Frequency
- BREAK (11:15-11:30)
- SESSION TWO (11:30-12:30)
- Dispersion Plots
- Correlation
Workshop Two:
Introduction to Topic Modeling with Applications in R
Thursday, February 20, 9:10 – 12:00
IMPORTANT: It is critical that you arrive on time to every session and be ready to roll with RStudio installed and running. The workshop will begin on schedule, and if you miss the first few minutes you’ll be lost!
Summary: In this workshop, you will be introduced to topic modeling and learn how to analyze and visualize topic model output in R. For this work, we will use the R implementation of MALLET that was developed by David Mimno. Student will also learn how to parse TEI-based XML and how to segment large texts into chucks. We will discuss various text pre-processing procedures including how to do part of speech tagging in R using the openNLP package. Though this will be a hands-on workshop, some techniques explored here a quite advanced and those unfamiliar with such things as XML document structure and basic text analysis may find it better to observe and then use the included documentation to practice the techniques at home.
- SESSION THREE (9:10-10:30)
- Loading a corpus
- Preparing files for Topic Modeling
- BREAK (10:30-10:45)
- SESSION FOUR (10:45-12:00)
- Running the Model
- Exploring topic coherence with term clouds
- Topic data analysis
Workshop One Code Examples:
Word Frequency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
text.v <- scan("data/plainText/melville.txt", what="character", sep="\n") start.v <- which(text.v == "CHAPTER 1. Loomings.") end.v <- which(text.v == "orphan.") start.metadata.v <- text.v[1:start.v -1] end.metadata.v <- text.v[(end.v+1):length(text.v)] metadata.v <- c(start.metadata.v, end.metadata.v) novel.lines.v <- text.v[start.v:end.v] novel.v <- paste(novel.lines.v, collapse=" ") novel.lower.v <- tolower(novel.v) moby.words.l <- strsplit(novel.lower.v, "\\W") moby.word.v <- unlist(moby.words.l) not.blanks.v <- which(moby.word.v!="") moby.word.v <- moby.word.v[not.blanks.v] moby.freqs.t <- table(moby.word.v) sorted.moby.freqs.t <- sort(moby.freqs.t , decreasing=TRUE) sorted.moby.freqs.t[1:10] plot(sorted.moby.freqs.t[1:10]) plot(sorted.moby.freqs.t[1:10], type="b", xlab="Top Ten Words", ylab="Word Count",xaxt = "n") axis(1,1:10, labels=names(sorted.moby.freqs.t[1:10])) |
Accessing and Comparing Word Frequency Data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# First with Moby Dick text.v <- scan("data/plainText/melville.txt",what="character", sep="\n") start.v <- which(text.v == "CHAPTER 1. Loomings.") end.v <- which(text.v == "orphan.") start.metadata.v <- text.v[1:start.v -1] end.metadata.v <- text.v[(end.v+1):length(text.v)] metadata.v <- c(start.metadata.v, end.metadata.v) novel.lines.v <- text.v[start.v:end.v] novel.v <- paste(novel.lines.v, collapse=" ") novel.lower.v <- tolower(novel.v) moby.words.l <- strsplit(novel.lower.v, "\\W") moby.word.v <- unlist(moby.words.l) not.blanks.v <- which(moby.word.v!="") moby.word.v <- moby.word.v[not.blanks.v] moby.freqs.t <- table(moby.word.v) sorted.moby.freqs.t <- sort(moby.freqs.t , decreasing=T) sorted.moby.rel.freqs.t <- 100*(sorted.moby.freqs.t/sum(sorted.moby.freqs.t)) plot(sorted.moby.rel.freqs.t[1:10], main="Moby Dick", type="b", xlab="Top Ten Words", ylab="Percentage", xaxt = "n") axis(1,1:10, labels=names(sorted.moby.rel.freqs.t[1:10])) # Now with Jane Austen text.v <- scan("data/plainText/austen.txt", what="character", sep="\n") start.v <- which(text.v == "CHAPTER 1") end.v <- which(text.v == "THE END") novel.lines.v <- text.v[start.v:end.v] novel.v <- paste(novel.lines.v, collapse=" ") novel.lower.v <- tolower(novel.v) sense.words.l <- strsplit(novel.lower.v, "\\W") sense.word.v <- unlist(sense.words.l) not.blanks.v <- which(sense.word.v!="") sense.word.v <- sense.word.v[not.blanks.v] sense.freqs.t <- table(sense.word.v) sorted.sense.freqs.t <- sort(sense.freqs.t , decreasing=T) sorted.sense.rel.freqs.t <- 100* (sorted.sense.freqs.t/sum(sorted.sense.freqs.t)) plot(sorted.sense.rel.freqs.t[1:10], main="Sense and Sensibility", type="b", xlab="Top Ten Words", ylab="Percentage",xaxt = "n") axis(1,1:10, labels=names(sorted.sense.rel.freqs.t[1:10])) # And now some Comparison unique(c(names(sorted.moby.rel.freqs.t[1:10]), names(sorted.sense.rel.freqs.t[1:10]))) names(sorted.sense.rel.freqs.t[which(names(sorted.sense.rel.freqs.t[1:10]) %in% names(sorted.moby.rel.freqs.t[1:10]))]) presentSense <- which(names(sorted.sense.rel.freqs.t[1:10]) %in% names(sorted.moby.rel.freqs.t[1:10])) names(sorted.sense.rel.freqs.t[1:10])[-presentSense] presentMoby <- which(names(sorted.moby.rel.freqs.t[1:10]) %in% names(sorted.sense.rel.freqs.t[1:10])) names(sorted.moby.rel.freqs.t[1:10])[-presentMoby] |
Token Distribution Analysis and Dispersion Plots
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
n.time.v <- seq(1:length(moby.word.v)) whales.v <- which(moby.word.v == "whale") w.count.v <- rep(NA,length(n.time.v)) w.count.v[whales.v] <- 1 plot(w.count.v, main="Dispersion Plot of `whale' in Moby Dick", xlab="Novel Time", ylab="whale", type="h", ylim=c(0,1), yaxt='n') ahabs.v <- which(moby.word.v == "ahab") # find `ahab' a.count.v <- rep(NA,length(n.time.v)) # change prefix `w' to `a' to keep whales and ahabs in separate variables a.count.v[ahabs.v] <- 1 # mark the occurrences with a 1 plot(a.count.v, main="Dispersion Plot of 'ahab' in Moby Dick", xlab="Novel Time", ylab="ahab", type="h", ylim=c(0,1), yaxt='n') |
Token Distribution Analysis and Dispersion Plots (Using Grep to find Chapter Breaks)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
text.v <- scan("data/plainText/melville.txt", what="character", sep="\n") start.v <- which(text.v == "CHAPTER 1. Loomings.") end.v <- which(text.v == "orphan.") novel.lines.v <- text.v[start.v:end.v] chap.positions.v <- grep("^CHAPTER \\d", novel.lines.v) novel.lines.v <- c(novel.lines.v, "END") last.position.v <- length(novel.lines.v) chap.positions.v <- c(chap.positions.v , last.position.v) # Two empty lists chapter.raws.l <- list() chapter.freqs.l <- list() # A for loop for(i in 1:length(chap.positions.v)){ if(i != length(chap.positions.v)){ chapter.title <- novel.lines.v[chap.positions.v[i]] start <- chap.positions.v[i]+1 end <- chap.positions.v[i+1]-1 chapter.lines.v <- novel.lines.v[start:end] chapter.words.v <- tolower(paste(chapter.lines.v, collapse=" ")) chapter.words.l <- strsplit(chapter.words.v, "\\W") chapter.word.v <- unlist(chapter.words.l) chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] chapter.freqs.t <- table(chapter.word.v) chapter.raws.l[[chapter.title]] <- chapter.freqs.t chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel } } whale.l <- lapply(chapter.freqs.l, '[', 'whale') whales.m <- do.call(rbind, whale.l) ahab.l <- lapply(chapter.freqs.l, '[', 'ahab') ahabs.m <- do.call(rbind, ahab.l) whales.v <- whales.m[,1] ahabs.v <- ahabs.m[,1] whales.ahabs.m <- cbind(whales.v, ahabs.v) dim(whales.ahabs.m) colnames(whales.ahabs.m) <- c("whale", "ahab") barplot(whales.ahabs.m, beside=T, col="grey") |
Workshop Two Code Examples:
Setup
1 2 3 4 |
# Prelims inputDir <- "data/miniCorpus" files.v <- dir(path=inputDir, pattern=".*txt") chunk.size <- 1000 # number of words per chunk |
Chunking Function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
makeFlexTextChunks <- function(inputDir, file.name, chunk.size=1000){ text.file.path<-file.path(inputDir, file.name) text.lines.v <- scan(text.file.path, what="character", sep="\n") novel.v <- paste(text.lines.v, collapse=" ") novel.lower.v <- tolower(novel.v) novel.lower.l <- strsplit(novel.lower.v, "\\W") novel.word.v <- unlist(novel.lower.l) not.blanks.v <- which(novel.word.v!="") novel.word.v <- novel.word.v[not.blanks.v] x <- seq_along(novel.word.v) chunks.l <- split(novel.word.v, ceiling(x/chunk.size)) # deal with small chunks at the end if(length(chunks.l[[length(chunks.l)]]) <= length(chunks.l[[length(chunks.l)]])/2){ chunks.l[[length(chunks.l)-1]] <- c(chunks.l[[length(chunks.l)-1]], chunks.l[[length(chunks.l)]]) chunks.l[[length(chunks.l)]] <- NULL } chunks.l <- lapply(chunks.l, paste, collapse=" ") chunks.df <- do.call(rbind, chunks.l) return(chunks.df) } |
Loop for chunking each text in the corpus directory
1 2 3 4 5 6 7 8 |
topic.l <- NULL for(i in 1:length(files.v)){ chunk.m <- makeFlexTextChunks(inputDir, files.v[i], chunk.size) textname <- gsub("\\..*","", files.v[i]) segments.m <- cbind(paste(textname, segment=1:nrow(chunk.m), sep="_"), chunk.m) topic.l[[textname]] <- segments.m } topic.m <- do.call(rbind, topic.l) |
Convert the matrix to a data frame for mallet processing
1 2 3 |
# Prepare for mallet documents <- as.data.frame(topic.m, stringsAsFactors=F) colnames(documents) <- c("id", "text") |
Load and run Mallet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
library(mallet) mallet.instances <- mallet.import(documents$id, documents$text, "data/stoplist.csv", FALSE, token.regexp="[\\p{L}']+") ## Create a topic trainer object. topic.model <- MalletLDA(num.topics=43) topic.model$loadDocuments(mallet.instances) vocabulary <- topic.model$getVocabulary() # examine some of the vocabulary vocabulary[1:50] word.freqs <- mallet.word.freqs(topic.model) # examine some of the word frequencies: head(word.freqs) topic.model$setAlphaOptimization(40, 80) topic.model$train(400) topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) # how big is the resulting matrix? dim(topic.words.m) # set the column names to make the matrix easier to read: colnames(topic.words.m) <- vocabulary # examine a specific topic topic.num <- 1 # the topic id you wish to examine num.top.words<-10 # the number of top words in the topic you want to examine mallet.top.words(topic.model, topic.words.m[topic.num,], num.top.words) |
Visualize topics as word clouds
1 2 3 4 5 6 |
# be sure you have installed the wordcloud package library(wordcloud) topic.num <- 1 num.top.words<-100 topic.top.words <- mallet.top.words(topic.model, topic.words.m[imp.row,], 100) wordcloud(topic.top.words$words, topic.top.words$weights, c(4,.8), rot.per=0, random.order=F) |