Topic Modeling (LDA): Korean History Textbooks
Fit LDA on the full 67-book NIKH corpus. Watch coherence scores guide the choice of k, read the discovered topics, and see how textbook themes shift across Colonial, Authoritarian, and Democratic eras.
What we're doing, and why
In lecture we ran LDA on an 11-book sample so the workflow would fit in a class session. Here we run it on the full 67-book NIKH corpus — Korean history textbooks curated by the National Institute of Korean History and supplemented with additional books from the instructor's library, covering 1895 through 2016. The method is the same: Kiwi tokenization on nouns, stopword removal, document-frequency filtering, LDA on the bag-of-words counts.
Three things become visible on the full corpus that the 11-book demo couldn't show you:
- How coherence scores can help you pick k (we'll explain what coherence is — Orange doesn't surface this metric).
- What the discovered topics look like when the model has enough data to separate themes it couldn't separate on 11 books.
- How the era-level topic mix shifts: colonial-era textbooks emphasize different things than post-1987 democratic-era ones.
The corpus
The periodization follows the course convention: Colonial for Japanese-colonial-era textbooks (roughly 1895–1945), for the developmental-state decades (1946–1987), and Democratic for the post-1987 period. The Authoritarian period dominates the corpus by book count because that is when the state published textbooks most intensively.
The R walkthroughs below run the same pipeline the visuals on this page use: the full 67-book NIKH corpus, noun-only tokenization, a 5%–95% document-frequency filter, and LDA at k = 5. Download nikh_corpus.csv from the nlp_corpora repo (or clone the repo) and save it to your data/ folder before running the code.
topicmodels uses Gibbs sampling. Same algorithm family, different solver and different random stream, so the topics you get in R will look similar in content but will not be byte-identical to the ones shown above. If your 67-book fit is slow, swap in the 11-book clustering demo or the 9-book demo from the Data & Scripts page — the code works on any CSV with a full_text column.
Show R code: load the full 67-book NIKH corpus and stopwordsR
# ── Packages ──────────────────────────────────────────────────────
library(tidyverse)
library(tidytext)
library(elbird) # Korean morphological analysis (Kiwi)
library(topicmodels) # LDA
library(LDAvis) # interactive topic visualization
# ── Load the full 67-book NIKH corpus ─────────────────────────────
# Download once from the nlp_corpora repo and save to your data/ folder:
# https://github.com/scdenney/nlp_corpora/blob/main/data/nikh/nikh_corpus.csv
corpus <- read_csv("data/nikh_corpus.csv") |>
drop_na(full_text, book_id) |>
distinct(book_id, .keep_all = TRUE) |>
mutate(era = case_when(
period %in% c("Colonial", "Late Choson") ~ "Colonial",
period == "Democratic" ~ "Democratic",
TRUE ~ "Authoritarian"
))
# ── Korean stopwords ──────────────────────────────────────────────
stopwords_ko <- read_lines("data/stopwords_ko.txt") |>
str_trim() |>
discard(~ .x == "")
corpus |>
count(era, name = "books")
elbird wraps Kiwi, the same tokenizer used in our Orange scripts. topicmodels is the standard R package for LDA. LDAvis is the R sibling of pyLDAvis. The era recoding collapses the five raw period labels into the three-era view used on this page.
Preprocessing
We apply the exact pipeline from the Orange demo, just scripted in Python instead:
- Tokenize each book with Kiwi.
- Keep only nouns (
NNG,NNP). - Drop stopwords and tokens shorter than 2 characters.
- Drop terms that appear in very few books (likely OCR noise) or in almost every book (nearly-universal, uninformative).
Show R code: tokenize with Kiwi, filter, and build a document-term matrixR
# ── Helper: tokenize one text with Kiwi via elbird ────────────────
tokenize_kiwi <- function(text) {
result <- tokenize(text, flatten = TRUE)
tibble(form = result$form, tag = result$tag)
}
# ── Tokenize and keep only nouns ──────────────────────────────────
tokens <- corpus |>
select(book_id, era, full_text) |>
mutate(morphemes = map(full_text, tokenize_kiwi)) |>
unnest(morphemes) |>
filter(
tag %in% c("NNG", "NNP"),
!form %in% stopwords_ko,
str_length(form) >= 2,
!str_detect(form, "^[0-9]+$")
) |>
select(book_id, era, word = form)
# ── Count (book, word) ────────────────────────────────────────────
counts <- tokens |>
count(book_id, word)
# ── Document-frequency filter ─────────────────────────────────────
# Same thresholds as the Python pipeline: in ≥ 5% of books and ≤ 95%.
n_docs <- n_distinct(counts$book_id)
min_df <- max(2, floor(0.05 * n_docs))
max_df <- floor(0.95 * n_docs)
doc_freq <- counts |>
distinct(book_id, word) |>
count(word, name = "df")
keep_words <- doc_freq |>
filter(df >= min_df, df <= max_df) |>
pull(word)
counts_filt <- counts |> filter(word %in% keep_words)
# ── Cast to a document-term matrix for topicmodels ────────────────
dtm <- counts_filt |>
cast_dtm(document = book_id, term = word, value = n)
dim(dtm) # books × vocabulary size
topicmodels::LDA() expects a DocumentTermMatrix, which is just the raw count table in a shape it understands. cast_dtm() from tidytext does the conversion from a long tidy tibble.
Choosing k: coherence scores
In class we said k is a research choice with no universal rule. One tool that can help — but that Orange's Topic Modeling widget does not surface — is a coherence score.
c_v coherence score formalizes this intuition: it measures how often each pair of top words from a topic appears together (relative to chance), averaged across topics. Higher is usually better. See Röder et al. (2015) for the full definition.
We fit LDA at several values of k and compute c_v for each:
Coherence isn't the last word. Two topics can look equally “coherent” to the algorithm but not equally useful for your research question. And coherence rewards models that use a narrower vocabulary, which can hide real variety. Read it as one signal, together with reading the topics themselves.
Show R code: compare several values of k with ldatuningR
# ── Package for k-selection metrics ───────────────────────────────
library(ldatuning)
# ── Fit LDA at several k values and score each fit ────────────────
tune <- FindTopicsNumber(
dtm,
topics = c(3, 4, 5, 6, 7, 8, 10, 12),
metrics = c("Arun2010", "CaoJuan2009", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 42),
mc.cores = 2,
verbose = TRUE
)
FindTopicsNumber_plot(tune)
Deveaud2014 should be high (maximize), while Arun2010 and CaoJuan2009 should be low (minimize). The best k is a compromise. These are not identical to the c_v score we plot above, but they answer the same question: which k gives the cleanest topics.
The topics
Below are the topics from k = 6, each with its top 15 words and a suggested label (the label is ours, not the algorithm's — you can disagree with it and propose your own).
Show R code: fit LDA and inspect the top words per topicR
# ── Fit the model (k = 5 matches the interactive's default) ───────
set.seed(20260420)
k <- 5
lda_fit <- LDA(
dtm,
k = k,
method = "Gibbs",
control = list(iter = 500, burnin = 200, seed = 20260420)
)
# ── β: topic–word probabilities (ϕ_k(w)) ──────────────────────────
beta <- tidy(lda_fit, matrix = "beta")
# Top 15 words per topic
top_words <- beta |>
group_by(topic) |>
slice_max(beta, n = 15, with_ties = FALSE) |>
ungroup() |>
arrange(topic, desc(beta))
print(top_words, n = k * 15)
tidy() function pulls it out of a fitted model in long format so you can pipe and plot it.
Documents and their topic mixtures
LDA hands each document a mixture over topics — the θd(k) row from lecture. The bar on each row below shows that mixture. The color of each segment matches the topic color above.
| Book | Era | Year | Tokens | Mixture | Dominant |
|---|
Average mixture by era
Averaging each era's topic mixtures gives you a sense of which themes each period concentrated on.
Show R code: document–topic mixtures and era averagesR
# ── γ: document–topic proportions (θ_d(k)) ────────────────────────
gamma <- tidy(lda_fit, matrix = "gamma") |>
rename(book_id = document)
# Dominant topic per book
dominant <- gamma |>
group_by(book_id) |>
slice_max(gamma, n = 1) |>
ungroup() |>
left_join(corpus |> select(book_id, title, era, year),
by = "book_id") |>
arrange(year)
print(dominant, n = nrow(dominant))
# ── Era-level topic mix ───────────────────────────────────────────
era_mix <- gamma |>
left_join(corpus |> select(book_id, era), by = "book_id") |>
group_by(era, topic) |>
summarise(share = mean(gamma), .groups = "drop")
era_mix |>
pivot_wider(names_from = topic, values_from = share,
names_prefix = "T")
LDAvis
LDAvis is an interactive map of the LDA model. The left-hand circles are topics (size = prevalence, distance = dissimilarity). The right-hand bar chart shows the words that define the topic you click on. The λ slider blends raw frequency (λ = 1) with distinctiveness (λ = 0) — start at around 0.3 to see which words make a topic specific.
Opens in an iframe. If it feels cramped, open it in a new tab.
Show R code: build the same LDAvis view from your fitted modelR
# ── Extract the pieces LDAvis expects ─────────────────────────────
phi <- posterior(lda_fit)$terms # K × V, rows sum to 1
theta <- posterior(lda_fit)$topics # D × K, rows sum to 1
vocab <- colnames(phi)
doc_lengths <- as.matrix(dtm) |> rowSums()
term_freqs <- as.matrix(dtm) |> colSums()
# ── Build the JSON and open it in a browser ───────────────────────
json <- createJSON(
phi = phi,
theta = theta,
doc.length = doc_lengths,
vocab = vocab,
term.frequency = term_freqs
)
serVis(json, out.dir = "lda_view", open.browser = TRUE)
serVis() does: writes the LDAvis HTML to lda_view/ and opens it in your browser. The λ slider, circle map, and bar chart work the same as the iframe above. Share the folder with a classmate and they can open index.html to see your model.
What to take away
- LDA is a reading aid. The model finds co-occurring words; you read them and decide what they mean. Every topic label on this page is an interpretation.
- Coherence is one signal, not the answer. It's useful to rule out obviously bad k values. It is not a replacement for actually reading the topics.
- Era effects come from the mixture, not the clustering. A textbook can be mostly ancient-history and partly colonial-resistance. Clustering would have put it in one bucket; LDA keeps the mix.
- Orange uses the same algorithm. When you run the final assignment in Orange, you'll get topics in the same shape as these — the gensim library under the hood is the same.