University of Kansas

Introduction to Text Analysis and Topic Modeling with R

September 12, 2013

Instructor Contact:

Matthew L. Jockers
Email: mjockers@unl.edu
Twitter: @mljockers

Description:

Introduction to Text Analysis and Topic Modeling with R will provide an introduction to computational text analysis and topic modeling in R. The course will cover basic text processing, data ingestion, data preparation, and topic modeling. The main computing environment for the course will be R. While no programming experience is required, students must have basic computer skills, must be familiar with their computer’s file system, and must be comfortable entering commands in a command line environment.

Suggested Reading:

While not required, students are strongly encouraged to work through at least the first two of the seven basic R lessons available at http://tryr.codeschool.com/ prior to taking this class.

Download and Install R

Go to the CRAN website and click on the link that is appropriate to your operating system.

  • If you use MS Windows, click on the “base” and then on the link to the executable (i.e. “.exe”) setup file (currently http://cran.at.rproject.org/bin/windows/ base/R-2.14.1-win.exe).
  • If you are running Mac OSX, choose the link to R-2.14.0.pkg (http://cran. at.r-project.org/bin/macosx/R-2.14.0.pkg)
  • If you use Linux, choose your distribution and then the installer file. 
Follow the instructions for installing R on your system in the standard or “default” directory. You will now have the base installation of R on your system.
  • If you are on a Windows or Macintosh computer, you will find the R application in the directory on your system where Programs (Windows) Applications (Macintosh) are stored. If you want to launch R, just double click the icon to start the R GUI
  • If you are on a Linux/Unix system, simply type “R” at the command line to enter the R program environment.

Download and Install RStudio
RStudio is an open source application that offers a very nice user environment for writing and running R programs. RStudio is an IDE, that’s “Integrated Development Environment” for R. RStudio runs happily on Windows, Mac, and Linux. After you have downloaded R (by following the instructions above) you must download the “Desktop” version (i.e. not the Server version) of RStudio from http://www.rstudio.com. Follow the installation instructions and then launch RStudio just like you would any other program/application.

Download the Class Materials:

Class Materials (22MB Zip File)

Workshop Schedule:

(IMPORTANT: It is critical that you arrive on time to every session and be ready to roll with RStudio installed and running. The workshop will begin on schedule, and if you miss the first few minutes you’ll be lost.)

  • SESSION ONE (9:00-10:15)
    • The R computing environment
    • R console vs. RStudio
    • Basic text manipulation in R
    • Word Frequency
  • BREAK (10:15-10:30)
  • SESSION TWO (10:30-12:00)
    • Dispersion Plots
    • Correlation
  • LUNCH BREAK (12:00-1:00)
  • SESSION THREE (1:00-2:30)
    • Loading a corpus
    • Preparing files for Topic Modeling
  • BREAK (2:30-2:45)
  • SESSION FOUR (2:45-4:30 or 5:00)
    • Running the Model
    • Exploring topic coherence with term clouds
    • Topic data analysis