Topic Modeling & LDA
Digital Humanities: Text-as-Data (BA3, Korean Studies – Leiden University)
This document introduces Topic Modeling, with a focus on Latent Dirichlet Allocation (LDA), using the Orange Data Mining text-mining add-on.
The goal is to help you understand:
- What topic modeling does
- How LDA works (intuitively)
- How to choose and interpret topics
- How to use Orange’s Topic Modeling and LDAvis widgets effectively
1. What Is Topic Modeling?
Topic modeling is an unsupervised method for discovering hidden thematic structure in large collections of text. It identifies sets of words that tend to appear together and uses these sets (topics) to describe each document.
You can think of topic modeling as answering:
What themes appear across this corpus, and how much does each document express each theme?
Why we use it
- To summarize long or numerous documents
- To identify structure in a corpus
- To compare themes across time, authors, or genres
- To reduce complexity into interpretable patterns
What a “topic” looks like
A topic is a cluster of words that frequently co-occur, for example:
- Topic A: 고구려, 삼국시대, 백제 → Ancient Korea
- Topic B: 독립, 저항, 일제 → Colonial period / resistance
- Topic C: 민주화, 발전, 산업화 → Modern Korea
Documents usually mix several topics in different proportions.
2. LDA (Latent Dirichlet Allocation): The Standard Model
LDA is the most widely used topic modeling algorithm.
It assumes:
-
Documents are mixtures of topics.
A chapter might be 60% ancient history, 25% colonial history, 15% modern development. -
Topics are probability distributions over words.
Each topic is a weighted list of words that tend to appear together. -
Words in a document are generated by first picking a topic, then picking a word from that topic.
You choose the number of topics (k), and LDA discovers the patterns that best explain the data.
Example Output
Topics (top words):
| Topic | Words |
|---|---|
| T1 | 고구려, 백제, 신라 |
| T2 | 독립, 일제, 저항 |
| T3 | 산업화, 민주화, 발전 |
Document → topic proportions:
| Document | T1 | T2 | T3 |
|---|---|---|---|
| Doc 1 | 0.90 | 0.05 | 0.05 |
| Doc 2 | 0.10 | 0.85 | 0.05 |
| Doc 3 | 0.05 | 0.10 | 0.85 |
3. Choosing the Number of Topics
You must specify how many topics the model should find.
Rough guidelines:
| Corpus Size | Suggested k |
|---|---|
| 10–50 docs | 3–6 topics |
| 50–200 docs | 5–12 topics |
| 200+ docs | 10–20 topics |
- Too few topics → themes are overly broad and mixed.
- Too many topics → themes become noisy or incoherent.
In Orange, you can experiment interactively with different values of k and see how the topics change.
4. Using Topic Modeling in Orange
Required widgets (Text Mining add-on)
- Preprocess Text
- Topic Modeling (LDA)
- LDAvis
- Data Table / Distributions
Basic pipeline
Raw Text
↓
Preprocess Text
- Tokenize
- Remove stopwords
- Keep nouns/verbs/adjectives
↓
Topic Modeling (set number of topics)
↓
LDAvis (interpret topics)
↓
Data Table (inspect document–topic proportions)
5. LDA Output in Orange
The Topic Modeling widget produces:
- A list of topics with their top words and weights
- A document–topic matrix (how much each topic appears in each document)
- Outputs that can be sent to Data Table, Distributions, or LDAvis
You can:
- View which words define each topic
- See which documents are most strongly associated with each topic
- Export topic weights for further analysis (e.g., grouping by period or author)
6. LDAvis: The Most Important Tool for Interpretation
LDAvis is an interactive visualization of LDA results. It shows:
- Topics as circles on a 2D map
- Distances between topics (far = more distinct; close = similar)
- For a selected topic, a bar chart of the most relevant words
- A λ (lambda) relevance slider to balance frequency and distinctiveness
The λ slider
- λ = 1.0 → ranks words by frequency within the topic
- λ = 0.0 → ranks words by how distinctive they are to the topic
- In practice, λ around 0.2–0.35 is often most informative
This helps you see words that are characteristic of a topic, not just common across all topics.
7. Strengths and Limitations of LDA
Strengths
- Reveals hidden thematic structure in large corpora
- Summarizes documents in terms of a manageable set of topics
- Allows documents to mix topics rather than belong to just one cluster
- Works well with LDAvis for interpretation
Limitations
- Topics reflect statistical patterns, not “ground truth” meanings
- Results depend heavily on preprocessing (tokenization, stopwords, POS filtering)
- Results depend on the chosen number of topics
- Some topics can be incoherent or “junk” topics (artifacts of noise)
8. How to Interpret Topics (Responsibly!)
Good practice:
- Use LDAvis to inspect top words for each topic, especially with λ ≈ 0.2–0.35
- Check documents where a given topic has high weight; read them!
- Give each topic a clear, substantive label based on your domain knowledge
- Watch for topics that mix vocabulary from different periods or themes
- Treat results as hypothesis-generating, not definitive facts
- You can use LDA on corpus/data subsets, but note that the results will be different than if you ran it across the whole corpus.
Remember:
Topic modeling assists interpretation; it does not replace it.
9. Additional Resources
Academic background
- Blei, David M. (2012). “Probabilistic Topic Models.” Communications of the ACM.
- Jockers, Matthew L. (2014). Text Analysis with R for Students of Literature.
- Sievert, Carson & Kenneth Shirley (2014). “LDAvis: A Method for Visualizing and Interpreting Topics.”
Orange tutorials
Orange maintains a beginner-friendly YouTube playlist:
- Getting Started with Orange — Orange Data Mining (linked in the course materials)
Look in particular for videos on:
- Text Mining
- Topic Modeling
- LDA visualization (LDAvis)
11. Key Takeaways (TL;DR)
- Topic modeling groups words into themes and represents documents as mixtures of those themes.
- LDA is the standard topic model and underlies the Topic Modeling widget in Orange.
- The number of topics and preprocessing choices matter a lot.
- LDAvis (plus the λ slider) is crucial for understanding what each topic “means.”
- Human interpretation, grounded in your substantive expertise, is always required.
Use topic modeling to explore patterns, generate hypotheses, and connect computational findings with close reading and area studies knowledge.