Topic Modeling (LDA): Korean History Textbooks

Fit LDA on the full 67-book NIKH corpus. Watch coherence scores guide the choice of k, read the discovered topics, and see how textbook themes shift across Colonial, Authoritarian, and Democratic eras.

🗓 Week 10 📚 NIKH · 67 books · 1895 – 2016 🧮 LDA (gensim) · pyLDAvis

What we're doing, and why

New to LDA? Start here. LDA is a sorting tool for words. It reads a pile of books and notices which words tend to show up together. Words that keep co-occurring end up in the same group, and each group is called a topic. You read the top words in a topic and decide what theme they point to. No topic arrives with a name attached; the labels on this page are readings we wrote, and you are welcome to disagree.

In lecture we ran LDA on an 11-book sample so the workflow would fit in a class session. Here we run it on the full 67-book NIKH corpus — Korean history textbooks curated by the National Institute of Korean History and supplemented with additional books from the instructor's library, covering 1895 through 2016. The method is the same: Kiwi tokenization on nouns, stopword removal, document-frequency filtering, LDA on the bag-of-words counts.

Three things become visible on the full corpus that the 11-book demo couldn't show you:

How coherence scores can help you pick k (we'll explain what coherence is — Orange doesn't surface this metric).
What the discovered topics look like when the model has enough data to separate themes it couldn't separate on 11 books.
How the era-level topic mix shifts: colonial-era textbooks emphasize different things than post-1987 democratic-era ones.

The corpus

The periodization follows the course convention: Colonial for Japanese-colonial-era textbooks (roughly 1895–1945), Authoritarian for the developmental-state decades (1946–1987), and Democratic for the post-1987 period. The Authoritarian period dominates the corpus by book count because that is when the state published textbooks most intensively.

The R walkthroughs below run the same pipeline the visuals on this page use: the full 67-book NIKH corpus, noun-only tokenization, a 5%–95% document-frequency filter, and LDA at k = 5. Download nikh_corpus.csv from the nlp_corpora repo (or clone the repo) and save it to your data/ folder before running the code.

R vs. the interactive. The interactive uses gensim's variational-Bayes LDA in Python. R's topicmodels uses Gibbs sampling. Same algorithm family, different solver and different random stream, so the topics you get in R will look similar in content but will not be byte-identical to the ones shown above. If your 67-book fit is slow, swap in the 11-book clustering demo or the 9-book demo from the Data & Scripts page — the code works on any CSV with a full_text column.

Show R code: load the full 67-book NIKH corpus and stopwordsR

# ── Packages ──────────────────────────────────────────────────────
library(tidyverse)
library(tidytext)
library(elbird)        # Korean morphological analysis (Kiwi)
library(topicmodels)   # LDA
library(LDAvis)        # interactive topic visualization

# ── Load the full 67-book NIKH corpus ─────────────────────────────
# Download once from the nlp_corpora repo and save to your data/ folder:
#   https://github.com/scdenney/nlp_corpora/blob/main/data/nikh/nikh_corpus.csv
corpus <- read_csv("data/nikh_corpus.csv") |>
  drop_na(full_text, book_id) |>
  distinct(book_id, .keep_all = TRUE) |>
  mutate(era = case_when(
    period %in% c("Colonial", "Late Choson") ~ "Colonial",
    period == "Democratic"                        ~ "Democratic",
    TRUE                                          ~ "Authoritarian"
  ))

# ── Korean stopwords ──────────────────────────────────────────────
stopwords_ko <- read_lines("data/stopwords_ko.txt") |>
  str_trim() |>
  discard(~ .x == "")

corpus |>
  count(era, name = "books")

About the packages: elbird wraps Kiwi, the same tokenizer used in our Orange scripts. topicmodels is the standard R package for LDA. LDAvis is the R sibling of pyLDAvis. The era recoding collapses the five raw period labels into the three-era view used on this page.

Preprocessing

We apply the exact pipeline from the Orange demo, just scripted in Python instead:

Tokenize each book with Kiwi.
Keep only nouns (NNG, NNP).
Drop stopwords and tokens shorter than 2 characters.
Drop terms that appear in very few books (likely OCR noise) or in almost every book (nearly-universal, uninformative).

LDA reads raw counts, not TF–IDF weights. The document-frequency filter replaces what TF–IDF's IDF step would do: it removes words that are too common or too rare to help separate topics.

Show R code: tokenize with Kiwi, filter, and build a document-term matrixR

# ── Helper: tokenize one text with Kiwi via elbird ────────────────
tokenize_kiwi <- function(text) {
  result <- tokenize(text, flatten = TRUE)
  tibble(form = result$form, tag = result$tag)
}

# ── Tokenize and keep only nouns ──────────────────────────────────
tokens <- corpus |>
  select(book_id, era, full_text) |>
  mutate(morphemes = map(full_text, tokenize_kiwi)) |>
  unnest(morphemes) |>
  filter(
    tag %in% c("NNG", "NNP"),
    !form %in% stopwords_ko,
    str_length(form) >= 2,
    !str_detect(form, "^[0-9]+$")
  ) |>
  select(book_id, era, word = form)

# ── Count (book, word) ────────────────────────────────────────────
counts <- tokens |>
  count(book_id, word)

# ── Document-frequency filter ─────────────────────────────────────
# Same thresholds as the Python pipeline: in ≥ 5% of books and ≤ 95%.
n_docs <- n_distinct(counts$book_id)
min_df <- max(2, floor(0.05 * n_docs))
max_df <- floor(0.95 * n_docs)

doc_freq <- counts |>
  distinct(book_id, word) |>
  count(word, name = "df")

keep_words <- doc_freq |>
  filter(df >= min_df, df <= max_df) |>
  pull(word)

counts_filt <- counts |> filter(word %in% keep_words)

# ── Cast to a document-term matrix for topicmodels ────────────────
dtm <- counts_filt |>
  cast_dtm(document = book_id, term = word, value = n)

dim(dtm)   # books × vocabulary size

Why a DTM? topicmodels::LDA() expects a DocumentTermMatrix, which is just the raw count table in a shape it understands. cast_dtm() from tidytext does the conversion from a long tidy tibble.

Choosing k: coherence scores

What is k? It is the number of topics you ask LDA to find. If you set k = 5, the model splits the vocabulary across 5 groups. Pick k too small and unrelated themes get mashed together. Pick k too large and one theme ends up split into look-alike pieces. There is no single right answer, which is why we use measures like coherence to help us pick.

In class we said k is a research choice with no universal rule. One tool that can help — but that Orange's Topic Modeling widget does not surface — is a coherence score.

What is coherence? A topic is coherent when its top words tend to co-occur in short windows of text across the corpus. A human who reads them together should recognize them as belonging to the same theme. The c_v coherence score formalizes this intuition: it measures how often each pair of top words from a topic appears together (relative to chance), averaged across topics. Higher is usually better. See Röder et al. (2015) for the full definition.

We fit LDA at several values of k and compute c_v for each:

Coherence isn't the last word. Two topics can look equally “coherent” to the algorithm but not equally useful for your research question. And coherence rewards models that use a narrower vocabulary, which can hide real variety. Read it as one signal, together with reading the topics themselves.

Show R code: compare several values of k with ldatuningR

# ── Package for k-selection metrics ───────────────────────────────
library(ldatuning)

# ── Fit LDA at several k values and score each fit ────────────────
tune <- FindTopicsNumber(
  dtm,
  topics  = c(3, 4, 5, 6, 7, 8, 10, 12),
  metrics = c("Arun2010", "CaoJuan2009", "Deveaud2014"),
  method  = "Gibbs",
  control = list(seed = 42),
  mc.cores = 2,
  verbose = TRUE
)

FindTopicsNumber_plot(tune)

How to read the plot: Deveaud2014 should be high (maximize), while Arun2010 and CaoJuan2009 should be low (minimize). The best k is a compromise. These are not identical to the c_v score we plot above, but they answer the same question: which k gives the cleanest topics.

The topics

Below are the topics from k = 6, each with its top 15 words and a suggested label (the label is ours, not the algorithm's — you can disagree with it and propose your own).

How to read a topic. Each tab below is one topic. The list of words inside is what LDA judged most central to it. The bar next to each word shows how strongly that word belongs to the topic. The title at the top of the tab is our reading of those words. Try it yourself: read the words and ask what theme they point to before you look at our label.

Show R code: fit LDA and inspect the top words per topicR

# ── Fit the model (k = 5 matches the interactive's default) ───────
set.seed(20260420)
k <- 5

lda_fit <- LDA(
  dtm,
  k       = k,
  method  = "Gibbs",
  control = list(iter = 500, burnin = 200, seed = 20260420)
)

# ── β: topic–word probabilities (ϕ_k(w)) ──────────────────────────
beta <- tidy(lda_fit, matrix = "beta")

# Top 15 words per topic
top_words <- beta |>
  group_by(topic) |>
  slice_max(beta, n = 15, with_ties = FALSE) |>
  ungroup() |>
  arrange(topic, desc(beta))

print(top_words, n = k * 15)

What is β (beta)? It is the topic–word matrix: for each topic, the probability of each word. In lecture we called this ϕ_k(w). The tidytext tidy() function pulls it out of a fitted model in long format so you can pipe and plot it.

Documents and their topic mixtures

LDA hands each document a mixture over topics — the θ_d(k) row from lecture. The bar on each row below shows that mixture. The color of each segment matches the topic color above.

What is a topic mixture? LDA does not sort each book into one topic. It treats every book as a blend. A single textbook might be, say, 40% ancient history, 30% colonial resistance, and smaller shares of the other topics. The colored bar on each row shows that blend for one book.

Filter by era: Sort by:

Book	Era	Year	Tokens	Mixture	Dominant

Average mixture by era

Averaging each era's topic mixtures gives you a sense of which themes each period concentrated on.

Show R code: document–topic mixtures and era averagesR

# ── γ: document–topic proportions (θ_d(k)) ────────────────────────
gamma <- tidy(lda_fit, matrix = "gamma") |>
  rename(book_id = document)

# Dominant topic per book
dominant <- gamma |>
  group_by(book_id) |>
  slice_max(gamma, n = 1) |>
  ungroup() |>
  left_join(corpus |> select(book_id, title, era, year),
            by = "book_id") |>
  arrange(year)

print(dominant, n = nrow(dominant))

# ── Era-level topic mix ───────────────────────────────────────────
era_mix <- gamma |>
  left_join(corpus |> select(book_id, era), by = "book_id") |>
  group_by(era, topic) |>
  summarise(share = mean(gamma), .groups = "drop")

era_mix |>
  pivot_wider(names_from = topic, values_from = share,
              names_prefix = "T")

What is γ (gamma)? The document–topic matrix: for each book, the proportion of each topic. In lecture we called this θ_d(k). Averaging γ within an era gives you the era-level topic mix shown in the cards above.

LDAvis

LDAvis is an interactive map of the LDA model. The left-hand circles are topics (size = prevalence, distance = dissimilarity). The right-hand bar chart shows the words that define the topic you click on. The λ slider blends raw frequency (λ = 1) with distinctiveness (λ = 0) — start at around 0.3 to see which words make a topic specific.

Opens in an iframe. If it feels cramped, open it in a new tab.

Show R code: build the same LDAvis view from your fitted modelR

# ── Extract the pieces LDAvis expects ─────────────────────────────
phi   <- posterior(lda_fit)$terms     # K × V, rows sum to 1
theta <- posterior(lda_fit)$topics    # D × K, rows sum to 1
vocab <- colnames(phi)

doc_lengths <- as.matrix(dtm) |> rowSums()
term_freqs  <- as.matrix(dtm) |> colSums()

# ── Build the JSON and open it in a browser ───────────────────────
json <- createJSON(
  phi            = phi,
  theta          = theta,
  doc.length     = doc_lengths,
  vocab          = vocab,
  term.frequency = term_freqs
)

serVis(json, out.dir = "lda_view", open.browser = TRUE)

What serVis() does: writes the LDAvis HTML to lda_view/ and opens it in your browser. The λ slider, circle map, and bar chart work the same as the iframe above. Share the folder with a classmate and they can open index.html to see your model.

What to take away

LDA is a reading aid. The model finds co-occurring words; you read them and decide what they mean. Every topic label on this page is an interpretation.
Coherence is one signal, not the answer. It's useful to rule out obviously bad k values. It is not a replacement for actually reading the topics.
Era effects come from the mixture, not the clustering. A textbook can be mostly ancient-history and partly colonial-resistance. Clustering would have put it in one bucket; LDA keeps the mix.
Orange uses the same algorithm. When you run the final assignment in Orange, you'll get topics in the same shape as these — the gensim library under the hood is the same.