Topic Modeling & LDA

Digital Humanities: Text-as-Data (BA3, Korean Studies – Leiden University)

This document introduces Topic Modeling, with a focus on Latent Dirichlet Allocation (LDA), using the Orange Data Mining text-mining add-on.

The goal is to help you understand:


1. What Is Topic Modeling?

Topic modeling is an unsupervised method for discovering hidden thematic structure in large collections of text. It identifies sets of words that tend to appear together and uses these sets (topics) to describe each document.

You can think of topic modeling as answering:

What themes appear across this corpus, and how much does each document express each theme?

Why we use it

What a “topic” looks like

A topic is a cluster of words that frequently co-occur, for example:

Documents usually mix several topics in different proportions.


2. LDA (Latent Dirichlet Allocation): The Standard Model

LDA is the most widely used topic modeling algorithm.

It assumes:

  1. Documents are mixtures of topics.
    A chapter might be 60% ancient history, 25% colonial history, 15% modern development.

  2. Topics are probability distributions over words.
    Each topic is a weighted list of words that tend to appear together.

  3. Words in a document are generated by first picking a topic, then picking a word from that topic.

You choose the number of topics (k), and LDA discovers the patterns that best explain the data.

Example Output

Topics (top words):

Topic Words
T1 고구려, 백제, 신라
T2 독립, 일제, 저항
T3 산업화, 민주화, 발전

Document → topic proportions:

Document T1 T2 T3
Doc 1 0.90 0.05 0.05
Doc 2 0.10 0.85 0.05
Doc 3 0.05 0.10 0.85

3. Choosing the Number of Topics

You must specify how many topics the model should find.

Rough guidelines:

Corpus Size Suggested k
10–50 docs 3–6 topics
50–200 docs 5–12 topics
200+ docs 10–20 topics

In Orange, you can experiment interactively with different values of k and see how the topics change.


4. Using Topic Modeling in Orange

Required widgets (Text Mining add-on)

Basic pipeline

Raw Text
    ↓
Preprocess Text
    - Tokenize
    - Remove stopwords
    - Keep nouns/verbs/adjectives
    ↓
Topic Modeling (set number of topics)
    ↓
LDAvis (interpret topics)
    ↓
Data Table (inspect document–topic proportions)

5. LDA Output in Orange

The Topic Modeling widget produces:

You can:


6. LDAvis: The Most Important Tool for Interpretation

LDAvis is an interactive visualization of LDA results. It shows:

The λ slider

This helps you see words that are characteristic of a topic, not just common across all topics.


7. Strengths and Limitations of LDA

Strengths

Limitations


8. How to Interpret Topics (Responsibly!)

Good practice:

Remember:

Topic modeling assists interpretation; it does not replace it.


9. Additional Resources

Academic background

Orange tutorials

Orange maintains a beginner-friendly YouTube playlist:

Look in particular for videos on:


11. Key Takeaways (TL;DR)

Use topic modeling to explore patterns, generate hypotheses, and connect computational findings with close reading and area studies knowledge.