Michigan State University: February 20-21, 2014

Introduction to Text Analysis and Topic Modeling with R

General Description:

Introduction to Text Analysis and Topic Modeling with R is a set of two workshops that will provide a practical introduction text analysis with a special emphasis on topic modeling. Taken together, the workshops will cover basic text processing, data ingestion, data preparation, and topic modeling. The main computing environment for the workshops will be R: “the open source programming language and software environment for statistical computing and graphics.”

While no programming experience is required, students must have basic computer skills, must be familiar with their computer’s file system, and must be comfortable entering commands in a command line environment.

Though the two workshops are designed to stand alone, the second one is more advanced and assumes some basic familiarity with topic modeling. Participants might want to visit The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors for a general overview.

Suggested Workshop Preparation:

While not required, participants are encouraged to work through at least the first two of the seven basic R lessons available at R Code School prior to taking this workshop.

In advance of the workshop, students should:

  1. Download the current version of R (at the time of this writing version 3.0.0) from the CRAN website by clicking on the link that is appropriate to your operating system (see http://cran.at.r-project.org):
    • If you use MS Windows, click on the “base” and then on the link to the executable (i.e. “.exe”) setup file.
    • If you are running Mac OSX, choose the link to the most current package.
    • If you use Linux, choose your distribution and then the installer file. 
Follow the instructions for installing R on your system in the standard or “default” directory. You will now have the base installation of R on your system.
    • If you are on a Windows or Macintosh computer, you will find the R application in the directory on your system where Programs (Windows) Applications (Macintosh) are stored. If you want to launch the R GUI, you can double click the icon to start the R GUI. We will not be using the R GUI in the workshop. We will use RStudio (see below).
    • If you are on a Linux/Unix system, simply type “R” at the command line to enter the R program environment.
  2. Download and Install RStudio
    • The R GUI application is fine for a lot of simple programming, but RStudio is an application that offers a very nice user environment for writing and running R programs. RStudio is an IDE, that’s “Integrated Development Environment” for R. RStudio runs happily on Windows, Mac, and Linux. After you have downloaded R (by following the instructions above) you must download the “Desktop” version (i.e. not the Server version) of RStudio from http://www.rstudio.com. Follow the installation instructions and then launch RStudio just like you would any other program/application. When you launch RStudio, you do not have to also launch the R program. RStudio accesses the R program you installed in the first step.
  3. Download the workshop materials.

Workshop Syllabus:

IMPORTANT: It is critical that you arrive on time to every session and be ready to roll with RStudio installed and running. The workshop will begin on schedule, and if you miss the first few minutes of any session you’ll be lost!

Workshop One:
Introduction to Text Analysis with Applications in R

Thursday, February 20, 10:00 – 12:30

Summary: In this workshop you will be introduced to the R programming language while learning the basics of computational text analysis. You will learn basic R syntax and be introduced to the RStudio programing environment. Text analysis topics covered will include text ingestion and tokenization, word frequency analysis, dispersion plots, and if time permits, correlation analysis.

  • SESSION ONE (10:00-11:15)
    • The R computing environment
    • R console vs. RStudio
    • Basic text manipulation in R
    • Word Frequency
  • BREAK (11:15-11:30)
  • SESSION TWO (11:30-12:30)
    • Dispersion Plots
    • Correlation

Workshop Two:
Introduction to Topic Modeling with Applications in R

Thursday, February 20, 9:10 – 12:00

IMPORTANT: It is critical that you arrive on time to every session and be ready to roll with RStudio installed and running. The workshop will begin on schedule, and if you miss the first few minutes you’ll be lost!

Summary: In this workshop, you will be introduced to topic modeling and learn how to analyze and visualize topic model output in R. For this work, we will use the R implementation of MALLET that was developed by David Mimno. Student will also learn how to parse TEI-based XML and how to segment large texts into chucks. We will discuss various text pre-processing procedures including how to do part of speech tagging in R using the openNLP package. Though this will be a hands-on workshop, some techniques explored here a quite advanced and those unfamiliar with such things as XML document structure and basic text analysis may find it better to observe and then use the included documentation to practice the techniques at home.

  • SESSION THREE (9:10-10:30)
    • Loading a corpus
    • Preparing files for Topic Modeling
  • BREAK (10:30-10:45)
  • SESSION FOUR (10:45-12:00)
    • Running the Model
    • Exploring topic coherence with term clouds
    • Topic data analysis

Workshop One Code Examples:

Word Frequency

Accessing and Comparing Word Frequency Data

Token Distribution Analysis and Dispersion Plots

Token Distribution Analysis and Dispersion Plots (Using Grep to find Chapter Breaks)

Workshop Two Code Examples:

Setup

Chunking Function

Loop for chunking each text in the corpus directory

Convert the matrix to a data frame for mallet processing

Load and run Mallet

Visualize topics as word clouds