Clustering Korean History Textbooks

Hierarchical clustering, dendrogram visualization, and distinctive words by cluster

Week 7 R + tidyverse + tidytext NIKH History Textbook Corpus (11-book demo from 67-book corpus, 1895–2016)

In Weeks 3–5 we learned to preprocess text and measure word frequencies. This week we ask a different question: can an algorithm group texts by similarity without knowing anything about their labels? We use hierarchical clustering on TF-IDF vectors from 11 Korean history textbooks spanning three political eras (Colonial, Authoritarian, and Democratic) and see whether the clusters the algorithm discovers correspond to the eras we know.

The Data: 11 History Textbooks

Our demo corpus is an 11-book subset of the 67-book NIKH (National Institute of Korean History) textbook corpus. The books span three political eras: 3 colonial-era textbooks (1940), 4 authoritarian-era textbooks (1973–1987), and 4 democratic-era textbooks (1995–2002). The CSV contains the raw full_text of each book. We preprocess it ourselves in Step 2.

Show R code: load the clustering demo corpusR

# ── Packages ──────────────────────────────────────────────────────
library(tidyverse)
library(tidytext)
library(elbird)       # Korean morphological analysis (wraps Kiwi)

# ── Load the clustering demo corpus ───────────────────────────────
corpus <- read_csv("data/nikh_textbooks/nikh_clustering_demo.csv")

# ── Load Korean stopwords ─────────────────────────────────────────
stopwords_ko <- read_lines("data/stopwords_ko.txt") |>
  str_trim() |>
  discard(~ .x == "")

# Quick look at the data
corpus |>
  select(book_id, title, era, year) |>
  print(n = 11)

About elbird: Install it with install.packages("elbird"). It wraps Kiwi, the same Korean morphological analyzer used in our Orange preprocessing scripts. First run downloads the model automatically.

Corpus Overview

Title	Era	Level	Year	Tokens

Preprocessing

We tokenize each book with Kiwi's morphological analyzer, keep only common nouns (NNG) and proper nouns (NNP), remove stopwords, and filter out single-character tokens. This is the same preprocessing pipeline from Weeks 3–5 and the same pipeline used in our Orange workflows. We then count how often each word appears in each book. No words are filtered out by document frequency. Every word is kept, just like in the Week 5 demo.

Show R code: tokenize with Kiwi, filter nouns, remove stopwords, countR

# ── Helper: tokenize one text with Kiwi via elbird ────────────────
tokenize_kiwi <- function(text) {
  result <- tokenize(text, flatten = TRUE)
  tibble(form = result$form, tag = result$tag)
}

# ── Tokenize and filter ───────────────────────────────────────────
tokens <- corpus |>
  select(book_id, era, full_text) |>
  mutate(
    morphemes = map(full_text, tokenize_kiwi)
  ) |>
  unnest(morphemes) |>
  filter(
    tag %in% c("NNG", "NNP"),       # keep nouns only
    !form %in% stopwords_ko,           # remove stopwords
    str_length(form) >= 2,           # drop single characters
    !str_detect(form, "^[0-9]+$")   # drop pure numbers
  ) |>
  select(book_id, era, word = form)

# ── Count words per book (raw frequencies) ────────────────────────
# No document-frequency filter: every word is kept.
word_counts <- tokens |>
  count(book_id, word)

# How many unique words per book?
word_counts |>
  count(book_id, name = "unique_words") |>
  left_join(corpus |> select(book_id, title, era), by = "book_id") |>
  select(book_id, title, era, unique_words)

# Top 10 most frequent words across all books
tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 10)

Note on tokenize_kiwi(): This wraps elbird's tokenize() function and returns a tidy tibble with form (the surface word) and tag (the POS tag). The flatten = TRUE argument returns all tokens in a single flat structure. This is the same helper from the Week 5 demo.

TF-IDF Vectorization & Cosine Distance

Now we weight those raw word counts using TF-IDF (Term Frequency–Inverse Document Frequency), the same technique from Week 4. TF-IDF down-weights common words that appear in every book (like 나라) and highlights words that are distinctive to particular books. We then compute cosine distance between every pair of books, the same metric you select in Orange's Distances widget.

Show R code: TF-IDF weighting and cosine distance matrixR

# ── TF-IDF weighting ──────────────────────────────────────────────
# tf  = word count / total words in book
# idf = ln(n_books / n_books_containing_word)
# tf_idf = tf * idf
# All words are kept — no minimum document-frequency cutoff.
tfidf <- word_counts |>
  bind_tf_idf(word, book_id, n)

# ── Build document-term matrix ────────────────────────────────────
# Rows = books, columns = words, values = TF-IDF weights
dtm <- tfidf |>
  select(book_id, word, tf_idf) |>
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)

# Convert to a matrix for clustering
mat <- dtm |> select(-book_id) |> as.matrix()
rownames(mat) <- dtm$book_id

# ── Cosine distance ───────────────────────────────────────────────
# Cosine measures the angle between two vectors, ignoring length.
# cosine similarity = (a · b) / (||a|| * ||b||)
# cosine distance   = 1 - cosine similarity
# Same metric as Orange → Distances → Cosine.
cosine_dist <- function(m) {
  sim <- m %*% t(m) /
    (sqrt(rowSums(m^2)) %o% sqrt(rowSums(m^2)))
  as.dist(1 - sim)
}

d <- cosine_dist(mat)

# Quick sanity check: which two books are most similar?
round(as.matrix(d)[1:5, 1:5], 3)

Why cosine distance? Cosine distance measures the angle between two TF-IDF vectors, ignoring their magnitude. A short colonial textbook with 1,700 tokens and a long one with 9,000 tokens can still end up close together. What matters is the mix of words, not how many total words there are.

From Week 4 to Week 7: In Week 4, TF-IDF helped us find distinctive words in a single document. Now we use the same TF-IDF vectors to measure how similar entire documents are to each other via cosine distance. Clustering groups documents whose TF-IDF vectors point in similar directions.

Dendrogram: Hierarchical Clustering

Using the cosine distances from Step 3, Ward's method builds a hierarchy by repeatedly merging the two most similar groups, minimizing within-cluster variance at each step. The result is a dendrogram: a tree that shows which textbooks are most similar and when groups merge. The height of each merge indicates how different the merged groups are. The dashed red line marks the cut at k = 3 clusters.

Show R code: hierarchical clustering and dendrogramR

# ── Hierarchical clustering (Ward's method on cosine distances) ───
hc <- hclust(d, method = "ward.D2")

# ── Plot dendrogram ───────────────────────────────────────────────
plot(hc, labels = dtm$book_id, main = "Hierarchical Clustering of NIKH Textbooks",
     xlab = "", sub = "", cex = 0.8)

# ── Cut into 3 clusters ───────────────────────────────────────────
clusters <- cutree(hc, k = 3)
rect.hclust(hc, k = 3, border = c("#b45309", "#7c3aed", "#0891b2"))

# ── View assignments ──────────────────────────────────────────────
corpus |>
  mutate(cluster = clusters) |>
  select(book_id, title, era, cluster) |>
  arrange(cluster) |>
  print(n = 11)

Dendrogram

Colonial Authoritarian Democratic

Reading the dendrogram: Textbooks that merge at low heights are very similar; merges at the top indicate large differences. The colonial-era texts (amber) form a tight, distinct cluster. The authoritarian and democratic texts merge with each other before joining the colonial branch because they share more vocabulary, covering overlapping historical periods with a modern Korean lens.

Cluster vs. Era: Do Clusters Match?

The algorithm had no access to the era labels. It worked only from the TF-IDF word vectors. Yet it recovered groupings that largely correspond to the historical eras. Below, each card shows one cluster and the books it contains. The era labels confirm: the language of history textbooks reflects the political era in which they were written.

The crossovers are interesting too. If a book is placed in a "wrong" cluster, it might mean the textbook's language is transitional, written in one era but using vocabulary more typical of another. This is exactly the kind of finding that makes clustering valuable: it surfaces patterns that simple labeling would miss.

Distinctive Words by Cluster

What makes each cluster distinctive? We re-run TF-IDF treating each cluster as a single pseudo-document and extract the top-weighted words. This connects back to the word analysis from Weeks 3–5, but now the grouping comes from the clustering algorithm rather than our own labels.

Show R code: extract top TF-IDF words per clusterR

# ── Add cluster assignments to tokens ─────────────────────────────
cluster_labels <- tibble(
  book_id = names(clusters),
  cluster = as.character(clusters)
)

# ── TF-IDF by cluster (pseudo-documents) ──────────────────────────
cluster_tfidf <- tokens |>
  left_join(cluster_labels, by = "book_id") |>
  count(cluster, word) |>
  bind_tf_idf(word, cluster, n) |>
  group_by(cluster) |>
  slice_max(tf_idf, n = 20) |>
  ungroup()

# ── Plot ──────────────────────────────────────────────────────────
cluster_tfidf |>
  mutate(word = reorder_within(word, tf_idf, cluster)) |>
  ggplot(aes(tf_idf, word, fill = cluster)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ cluster, scales = "free") +
  scale_y_reordered() +
  labs(x = "TF-IDF", y = NULL) +
  theme_minimal()

Top Words by Cluster (TF-IDF)

The distinctive words tell a clear story. The colonial-era cluster features 천황 (emperor), 일본 (Japan), 고구려 (Goguryeo), and 군대 (army) — reflecting Japanese imperial framing of Korean history. The authoritarian cluster foregrounds 문화 (culture), 사회 (society), and 발전 (development) — nation-building narratives. The democratic cluster highlights 운동 (movement), 민족 (nation/people), and 독립 (independence) — a shift toward social movements and self-determination.

This is the power of combining clustering with word analysis: the algorithm groups texts by vocabulary similarity, and distinctive words explain why each group is different. The method confirms what domain experts already know (Korean history textbooks are products of their political moment) but does so from the text alone, without relying on metadata.