Topic Analysis

Topic analysis groups recurring word patterns in a large corpus. The interpretation remains yours.

What it is

Topic analysis identifies clusters of co-occurring words across a corpus. LDA and STM are the standard models students encounter first. BERTopic and related embedding methods fill a similar role with different machinery. The output is usually a set of word lists and a topic proportion for each document.

A topic model identifies statistical regularities. Naming the topics, deciding whether they are meaningful, and explaining what they do for the argument remain your responsibility.

What you learn in the DH course

In the DH course, students treat topic models as aids to interpretation. The work centers on these tasks.

Reading LDA as mixed membership over word distributions
Reading STM as LDA plus covariates that shift topic prevalence and content
Embedding-based topic methods (BERTopic, Top2Vec) and how they differ from LDA
Choosing K with diagnostics and interpretability checks
Validating topics through intruder tests and human coding on a sample
Reporting model choices in a methodology chapter

What you need to learn first

Preprocessing. Topic models are notoriously sensitive to preprocessing. See Preprocessing.
Basic statistics and probability. You need enough to understand “mixture over distributions” without treating the model as magic.
R or Python. STM is an R package. LDA and BERTopic have strong Python tooling (gensim, scikit-learn, bertopic).

What you can do with it

Track how themes in a news corpus shift across a political crisis
Compare how political parties frame the same issue
Identify candidate genres in a literary corpus
Choose passages for later close reading
Produce a descriptive map for a larger corpus that would otherwise be impossible to read end-to-end

Preprocessing shapes every topic the model produces.
Word Embeddings covers embedding-based topic methods.
Framing Analysis pairs well with topic models when topics are treated as candidate frames.
Discourse Analysis can use topic output to guide sampling.

Topic Analysis

What it is

What you learn in the DH course

What you need to learn first

What you can do with it

Related methods