Skip to content Skip to navigation

English 184E: Literary Text Mining

Ryan Heuser
Tu/Th 4:30 - 6:20

This course will train students in applied methods for computationally analyzing texts for humanities research. The skills students will gain will include basic programming for textual analysis, applied statistical evaluation of results and the ability to present these results within a formal research paper or presentation. Students in the course will also learn the prerequisite steps of such an analysis including corpus selection and cleaning, metadata collection, and selecting and creating an appropriate visualization for the results.

Syllabus text: 

What happens when computers read Shakespeare? What can digital methods tell us about literary language—about how it works, how it evolved, and how it relates to other forms of language? How, for instance, are the rhythms of poetry different from the rhythms of newspapers? What kinds of social networks do plays and novels create? How are female characters in novels described differently than male characters? New techniques in the digital humanities, computational linguistics, and natural language processing make it possible for scholars to ask these questions of literature. In this course, we will learn how to participate in this exciting new area of interdisciplinary research, while also probing its challenges  and limitations.

This course is an introduction to the theories and methods of computational literary studies. It presumes no background in programming, computer science, or literary criticism. Students will begin with the building blocks of the Python programming language before moving on to more complex analyses of literary texts. Students will learn a variety of ways to discover patterns in textual data, visualize these patterns, and present them as part of a broader literary critical argument. Each week will present a new method of analysis, along with a canonical example of digital humanities criticism. We will progress through several modules of classes, each representing a distinct, and perhaps broader, domain of literary language: from words, to sentence syntax, to narratorial style, to character-spaces, to semantics and thematics, and finally to genres and the broader literary field. At the same time, throughout the course we will read one novel and one short story cycle, each of which rapidly alternate among narrators, styles, plots, and settings—a formal experimentation we will try to model computationally, while also reflecting on what aspects of the texts elude such methods of analysis.

Course objectives

Upon completing this course, students will be able to:

  • Distinguish between methods of computational literary analysis, and identify the appropriate method for a given literary critical question.
  • Digitize texts and build corpora appropriate for a given literary critical question.
  • Program custom literary text analyses with the Python programming language, as well as call upon established tools and algorithms.
  • Analyze and visualize data to determine its impact on a literary critical question.
  • Deploy data and visualizations as part of a literary-critical argument.


Course texts

We will read one novel and one short story cycle in this course, listed below respectively.

  • Karen Tei Yamashita, Tropic of Orange (Minneapolis: Coffee House Press, 1997)
  • Jennifer Egan, A Visit from the Goon Squad (New York: Knopf, 2010)


We will also read a number of examples of digital humanities criticism, of which several are:

  • Alan Liu, “The Meaning of the Digital Humanities,” PMLA 128.2 (2013): 409-423.
  • Katherine Bode, “The Equivalence of ‘Close’ and ‘Distant’ Reading,” MLQ 78.1 (2017): 77-106.
  • Lisa Marie Rhody, “Why I Dig: Feminist Approaches to Text Analysis,” Debates in the Digital Humanities (Minneapolis: University of Minnesota Press, 2016)
  • Ted Underwood, David Bamman, and Sabrina Lee, “The Transformation of Gender in English-Language Fiction,” Cultural Analytics (13 Feb 2018)
  • Matthew Wilkens, “The Geographic Imagination of Civil War Era Fiction,” American Literary History, 25.4 (2013): 803-40
  • Michael Gavin, “Vector Semantics, William Empson, and the Study of Ambiguity,” Critical Inquiry, 44 (2018): 641–73


Homework due at the beginning of each week will give students ample opportunity to practice the techniques learned in the course. In addition to homework and class participation, students’ grades will be assessed according to two major assignments: a midterm project, done individually, of a single data visualization embedded in a 3-5 page critical argument; and a final project, done collaboratively in small groups, of a 10-12 page essay in which data analysis and visualization play a key argumentative role.




  • Week 1. What is literary text mining? and Introduction to Python


  • Week 2. Key words in context (KWIC) and interpreting concordances
  • Week 3. Word counts and quantitative significance


  • Week 4. Most distinctive words (MDW) and authorial style


  • Week 6. Named entity recognition (NER) and literary geography


  • Week 7. Social network analysis and character networks


  • Week 8. Topic modeling and thematic analysis
  • Week 9. Word embedding models and semantic analysis


  • Week 9. Machine learning and genre classification


  • Week 10. Final project workshopping