Clustering Korean History Textbooks
Hierarchical clustering, dendrogram visualization, and distinctive words by cluster
In Weeks 3–5 we learned to preprocess text and measure word frequencies. This week we ask a different question: can an algorithm group texts by similarity without knowing anything about their labels? We use hierarchical clustering on TF-IDF vectors from 11 Korean history textbooks spanning three political eras (Colonial, Authoritarian, and Democratic) and see whether the clusters the algorithm discovers correspond to the eras we know.
The Data: 11 History Textbooks
Our demo corpus is an 11-book subset of the 67-book NIKH (National Institute of Korean History) textbook corpus. The books span three political eras: 3 colonial-era textbooks (1940), 4 authoritarian-era textbooks (1973–1987), and 4 democratic-era textbooks (1995–2002). The CSV contains the raw full_text of each book. We preprocess it ourselves in Step 2.
Show R code: load the clustering demo corpusR
# ── Packages ──────────────────────────────────────────────────────
library(tidyverse)
library(tidytext)
library(elbird) # Korean morphological analysis (wraps Kiwi)
# ── Load the clustering demo corpus ───────────────────────────────
corpus <- read_csv("data/nikh_textbooks/nikh_clustering_demo.csv")
# ── Load Korean stopwords ─────────────────────────────────────────
stopwords_ko <- read_lines("data/stopwords_ko.txt") |>
str_trim() |>
discard(~ .x == "")
# Quick look at the data
corpus |>
select(book_id, title, era, year) |>
print(n = 11)
install.packages("elbird"). It wraps Kiwi, the same Korean morphological analyzer used in our Orange preprocessing scripts. First run downloads the model automatically.
| Title | Era | Level | Year | Tokens |
|---|
Preprocessing
We tokenize each book with Kiwi's morphological analyzer, keep only common nouns (NNG) and proper nouns (NNP), remove stopwords, and filter out single-character tokens. This is the same preprocessing pipeline from Weeks 3–5 and the same pipeline used in our Orange workflows. We then count how often each word appears in each book. No words are filtered out by document frequency. Every word is kept, just like in the Week 5 demo.
Show R code: tokenize with Kiwi, filter nouns, remove stopwords, countR
# ── Helper: tokenize one text with Kiwi via elbird ────────────────
tokenize_kiwi <- function(text) {
result <- tokenize(text, flatten = TRUE)
tibble(form = result$form, tag = result$tag)
}
# ── Tokenize and filter ───────────────────────────────────────────
tokens <- corpus |>
select(book_id, era, full_text) |>
mutate(
morphemes = map(full_text, tokenize_kiwi)
) |>
unnest(morphemes) |>
filter(
tag %in% c("NNG", "NNP"), # keep nouns only
!form %in% stopwords_ko, # remove stopwords
str_length(form) >= 2, # drop single characters
!str_detect(form, "^[0-9]+$") # drop pure numbers
) |>
select(book_id, era, word = form)
# ── Count words per book (raw frequencies) ────────────────────────
# No document-frequency filter: every word is kept.
word_counts <- tokens |>
count(book_id, word)
# How many unique words per book?
word_counts |>
count(book_id, name = "unique_words") |>
left_join(corpus |> select(book_id, title, era), by = "book_id") |>
select(book_id, title, era, unique_words)
# Top 10 most frequent words across all books
tokens |>
count(word, sort = TRUE) |>
slice_head(n = 10)
tokenize_kiwi(): This wraps elbird's tokenize() function and returns a tidy tibble with form (the surface word) and tag (the POS tag). The flatten = TRUE argument returns all tokens in a single flat structure. This is the same helper from the Week 5 demo.
TF-IDF Vectorization & Cosine Distance
Now we weight those raw word counts using TF-IDF (Term Frequency–Inverse Document Frequency), the same technique from Week 4. TF-IDF down-weights common words that appear in every book (like 나라) and highlights words that are distinctive to particular books. We then compute cosine distance between every pair of books, the same metric you select in Orange's Distances widget.
Show R code: TF-IDF weighting and cosine distance matrixR
# ── TF-IDF weighting ──────────────────────────────────────────────
# tf = word count / total words in book
# idf = ln(n_books / n_books_containing_word)
# tf_idf = tf * idf
# All words are kept — no minimum document-frequency cutoff.
tfidf <- word_counts |>
bind_tf_idf(word, book_id, n)
# ── Build document-term matrix ────────────────────────────────────
# Rows = books, columns = words, values = TF-IDF weights
dtm <- tfidf |>
select(book_id, word, tf_idf) |>
pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)
# Convert to a matrix for clustering
mat <- dtm |> select(-book_id) |> as.matrix()
rownames(mat) <- dtm$book_id
# ── Cosine distance ───────────────────────────────────────────────
# Cosine measures the angle between two vectors, ignoring length.
# cosine similarity = (a · b) / (||a|| * ||b||)
# cosine distance = 1 - cosine similarity
# Same metric as Orange → Distances → Cosine.
cosine_dist <- function(m) {
sim <- m %*% t(m) /
(sqrt(rowSums(m^2)) %o% sqrt(rowSums(m^2)))
as.dist(1 - sim)
}
d <- cosine_dist(mat)
# Quick sanity check: which two books are most similar?
round(as.matrix(d)[1:5, 1:5], 3)
Dendrogram: Hierarchical Clustering
Using the cosine distances from Step 3, Ward's method builds a hierarchy by repeatedly merging the two most similar groups, minimizing within-cluster variance at each step. The result is a dendrogram: a tree that shows which textbooks are most similar and when groups merge. The height of each merge indicates how different the merged groups are. The dashed red line marks the cut at k = 3 clusters.
Show R code: hierarchical clustering and dendrogramR
# ── Hierarchical clustering (Ward's method on cosine distances) ───
hc <- hclust(d, method = "ward.D2")
# ── Plot dendrogram ───────────────────────────────────────────────
plot(hc, labels = dtm$book_id, main = "Hierarchical Clustering of NIKH Textbooks",
xlab = "", sub = "", cex = 0.8)
# ── Cut into 3 clusters ───────────────────────────────────────────
clusters <- cutree(hc, k = 3)
rect.hclust(hc, k = 3, border = c("#b45309", "#7c3aed", "#0891b2"))
# ── View assignments ──────────────────────────────────────────────
corpus |>
mutate(cluster = clusters) |>
select(book_id, title, era, cluster) |>
arrange(cluster) |>
print(n = 11)
Cluster vs. Era: Do Clusters Match?
The algorithm had no access to the era labels. It worked only from the TF-IDF word vectors. Yet it recovered groupings that largely correspond to the historical eras. Below, each card shows one cluster and the books it contains. The era labels confirm: the language of history textbooks reflects the political era in which they were written.
Distinctive Words by Cluster
What makes each cluster distinctive? We re-run TF-IDF treating each cluster as a single pseudo-document and extract the top-weighted words. This connects back to the word analysis from Weeks 3–5, but now the grouping comes from the clustering algorithm rather than our own labels.
Show R code: extract top TF-IDF words per clusterR
# ── Add cluster assignments to tokens ─────────────────────────────
cluster_labels <- tibble(
book_id = names(clusters),
cluster = as.character(clusters)
)
# ── TF-IDF by cluster (pseudo-documents) ──────────────────────────
cluster_tfidf <- tokens |>
left_join(cluster_labels, by = "book_id") |>
count(cluster, word) |>
bind_tf_idf(word, cluster, n) |>
group_by(cluster) |>
slice_max(tf_idf, n = 20) |>
ungroup()
# ── Plot ──────────────────────────────────────────────────────────
cluster_tfidf |>
mutate(word = reorder_within(word, tf_idf, cluster)) |>
ggplot(aes(tf_idf, word, fill = cluster)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ cluster, scales = "free") +
scale_y_reordered() +
labs(x = "TF-IDF", y = NULL) +
theme_minimal()
The distinctive words tell a clear story. The colonial-era cluster features 천황 (emperor), 일본 (Japan), 고구려 (Goguryeo), and 군대 (army) — reflecting Japanese imperial framing of Korean history. The authoritarian cluster foregrounds 문화 (culture), 사회 (society), and 발전 (development) — nation-building narratives. The democratic cluster highlights 운동 (movement), 민족 (nation/people), and 독립 (independence) — a shift toward social movements and self-determination.
This is the power of combining clustering with word analysis: the algorithm groups texts by vocabulary similarity, and distinctive words explain why each group is different. The method confirms what domain experts already know (Korean history textbooks are products of their political moment) but does so from the text alone, without relying on metadata.