Topic Analysis

Topic analysis groups recurring word patterns in a large corpus. The interpretation remains yours.


What it is

Topic analysis identifies clusters of co-occurring words across a corpus. LDA and STM are the standard models students encounter first. BERTopic and related embedding methods fill a similar role with different machinery. The output is usually a set of word lists and a topic proportion for each document.

A topic model identifies statistical regularities. Naming the topics, deciding whether they are meaningful, and explaining what they do for the argument remain your responsibility.


What you learn in the DH course

In the DH course, students treat topic models as aids to interpretation. The work centers on these tasks.

  • Reading LDA as mixed membership over word distributions
  • Reading STM as LDA plus covariates that shift topic prevalence and content
  • Embedding-based topic methods (BERTopic, Top2Vec) and how they differ from LDA
  • Choosing K with diagnostics and interpretability checks
  • Validating topics through intruder tests and human coding on a sample
  • Reporting model choices in a methodology chapter

What you need to learn first

  • Preprocessing. Topic models are notoriously sensitive to preprocessing. See Preprocessing.
  • Basic statistics and probability. You need enough to understand “mixture over distributions” without treating the model as magic.
  • R or Python. STM is an R package. LDA and BERTopic have strong Python tooling (gensim, scikit-learn, bertopic).

What you can do with it

  • Track how themes in a news corpus shift across a political crisis
  • Compare how political parties frame the same issue
  • Identify candidate genres in a literary corpus
  • Choose passages for later close reading
  • Produce a descriptive map for a larger corpus that would otherwise be impossible to read end-to-end