Exploring Korean History Textbooks in R

Word clouds, frequency analysis, and concordance with the NIKH corpus

Week 5 R + tidyverse + tidytext NIKH History Textbook Corpus (9-book demo from 67-book corpus, 1895–2016)

This walkthrough replicates part of the Week 5 hands-on lesson in R. We load 9 Korean history textbooks from three political eras — Colonial, Authoritarian, and Democratic — sampled from the 67-book NIKH corpus (1895–2016). We preprocess the text and explore how language differs across eras using word clouds and concordance analysis. All code runs top to bottom in RStudio.

Setup & Load Data

We use tidyverse for data wrangling, tidytext for text analysis structure, elbird for Korean morphological analysis (it wraps the same Kiwi engine used in our Orange preprocessing scripts), and ggwordcloud for word clouds.

Show R code: load packages, corpus, and stopwordsR

# ── Packages ──────────────────────────────────────────────────────
library(tidyverse)
library(tidytext)
library(elbird)
library(ggwordcloud)

# ── Load the NIKH demo corpus ─────────────────────────────────────
corpus <- read_csv("data/nikh_textbooks/nikh_textbooks_demo.csv")

# ── Load Korean stopwords ─────────────────────────────────────────
stopwords_ko <- read_lines("data/stopwords_ko.txt") |>
  str_trim() |>
  discard(~ .x == "")

# Quick look at the data
corpus |> select(book_id, title, era, period) |> print(n = 9)

About elbird: Install it with install.packages("elbird"). It wraps Kiwi, the same Korean morphological analyzer used in our Orange preprocessing scripts. First run downloads the model automatically.

Preprocessing

We tokenize each book with Kiwi's morphological analyzer, keep only common nouns (NNG) and proper nouns (NNP), remove stopwords, and filter out single-character tokens. This is the same pipeline as our Orange workflow. The demo CSV also includes a processed_text column with pre-tokenized nouns if you want to skip the tokenization step and work directly with the cleaned tokens.

Show R code: tokenize, filter nouns, remove stopwordsR

# ── Helper: tokenize one text with Kiwi via elbird ────────────────
tokenize_kiwi <- function(text) {
  result <- tokenize(text, flatten = TRUE)
  tibble(form = result$form, tag = result$tag)
}

# ── Tokenize and filter ───────────────────────────────────────────
tokens <- corpus |>
  select(book_id, era, full_text) |>
  mutate(
    morphemes = map(full_text, tokenize_kiwi)
  ) |>
  unnest(morphemes) |>
  filter(
    tag %in% c("NNG", "NNP"),       # keep nouns
    !form %in% stopwords_ko,           # remove stopwords
    str_length(form) >= 2,           # drop single characters
    !str_detect(form, "^[0-9]+$")   # drop pure numbers
  ) |>
  select(book_id, era, word = form)

# ── Count words per era ───────────────────────────────────────────
era_counts <- tokens |>
  count(era, word, sort = TRUE)

# Top 10 per era
era_counts |>
  group_by(era) |>
  slice_max(n, n = 10) |>
  print(n = 30)

Note on tokenize_kiwi(): This helper wraps elbird's tokenize() function and returns a tidy tibble with form (the surface word) and tag (the POS tag). The flatten = TRUE argument returns all tokens in a single flat structure.

Word Clouds by Era

Word clouds give a quick visual sense of what each era's textbooks emphasize. The colonial-era texts (written under Japanese rule) feature terms about kingdoms and military conflict. The authoritarian-era texts foreground nation, society, and culture. The democratic-era texts add movements, independence, and development.

Show R code: generate word clouds with ggwordcloudR

# ── Word clouds by era ────────────────────────────────────────────
# Colorblind-friendly palette (Okabe-Ito inspired)
era_colors <- c(
  Colonial      = "#b45309",  # amber
  Authoritarian = "#7c3aed",  # violet
  Democratic    = "#0891b2"   # cyan
)

era_counts |>
  group_by(era) |>
  slice_max(n, n = 50) |>
  ungroup() |>
  ggplot(aes(label = word, size = n, color = era)) +
  geom_text_wordcloud(area_corr = TRUE) +
  scale_color_manual(values = era_colors) +
  scale_size_area(max_size = 14) +
  facet_wrap(~ era) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold", size = 12))

Output

Tracking 통일 (Unification) Across Eras

The word 통일 (tongil, unification) appears in all three eras, but its meaning shifts dramatically. In colonial-era textbooks, it refers to ancient territorial unification of kingdoms. In authoritarian-era texts, it takes on nationalist overtones. In democratic-era texts, it centers on North-South reunification and peace.

Show R code: count and plot 통일 frequency per textbookR

# ── Count 통일 per document ───────────────────────────────────────
tongil_counts <- tokens |>
  filter(word == "통일") |>
  count(book_id, era, name = "tongil_n") |>
  left_join(
    tokens |> count(book_id, name = "total_n"),
    by = "book_id"
  ) |>
  mutate(per_1k = tongil_n / total_n * 1000) |>
  left_join(corpus |> select(book_id, title), by = "book_id")

# ── Plot ──────────────────────────────────────────────────────────
tongil_counts |>
  mutate(title = str_trunc(title, 25)) |>
  mutate(title = fct_reorder(title, per_1k)) |>
  ggplot(aes(x = per_1k, y = title, fill = era)) +
  geom_col() +
  scale_fill_manual(values = era_colors) +
  labs(
    x = "Occurrences per 1,000 tokens",
    y = NULL,
    fill = "Era",
    title = "통일 (unification) across textbooks"
  ) +
  theme_minimal(base_size = 12)

Output

Colonial Authoritarian Democratic

Reading the chart: Raw count is simply how many times 통일 appears in each textbook. But longer books naturally contain more of every word, so raw counts can be misleading. Per 1,000 tokens adjusts for document length: divide the raw count by the total number of tokens in that book, then multiply by 1,000. This gives you a rate — "out of every 1,000 words, how many are 통일?" — so you can fairly compare books of different lengths.

Concordance: 통일 in Context

A word count tells you how often. Concordance tells you how. Below are five sentences containing 통일, one or two from each era. Notice how the same word carries entirely different meanings depending on the political context in which the textbook was written.

Show R code: KWIC concordance search for 통일R

# ── KWIC concordance for 통일 ─────────────────────────────────────
kwic_results <- corpus |>
  mutate(
    sentences = map(full_text, ~ str_split(.x, "(?<=[다요])\\s+") |> pluck(1))
  ) |>
  unnest(sentences) |>
  filter(str_detect(sentences, "통일")) |>
  select(book_id, era, title, sentence = sentences)

# Browse the results
kwic_results |> print(n = 20)

Curated Concordance

통일

tongil — unification

Colonial

태조는 즉위한 지 19년 만에 신라와 후백제를 병합하고 반도를 통일했다.

King Taejo annexed Silla and Later Baekje, unifying the peninsula within nineteen years of ascending the throne.

심상소학국사보충아동용 · Colonial Period

Authoritarian

신라는 우리 땅을 지배하려는 당을 몰아 내고 마침내 삼국 통일을 이룩하여, 민족의 굳건한 기백을 보여 주었다.

Silla drove out Tang, which sought to dominate our land, and finally achieved the Three Kingdoms unification, demonstrating the unyielding spirit of the nation.

중학교 국사 4차(상) · Authoritarian · 1981

Authoritarian

우리는 바라던 독립을 차지하였으나, 아직도 통일을 이루지 못하고 있으니, 앞으로 더욱 뭉쳐서 통일과 발전을 위하여 노력하여야 하겠다.

We have achieved the independence we longed for, but have still not achieved unification; we must unite further and strive for unification and national development.

초등학교 사회생활 6-1(1차) · Authoritarian · 1954

Democratic

이러한 평화 통일을 위한 노력은 남북 대화가 중단된 후에도 계속되어 우리 정부는 북한에 상호 불가침 협정을 제안하기도 하였다.

These efforts for peaceful unification continued even after inter-Korean dialogue was suspended, and our government proposed a mutual non-aggression pact to North Korea.

중학교 국사 6차(하) · Democratic · 1995

Democratic

광복 후 분단을 딛고 일어선 대한 민국은 민주 정치의 발전, 경제적 번영, 그리고 복지 사회 건설과 민족 통일을 목표로 성장해 왔다.

Since liberation, the Republic of Korea has risen above division and grown with the goals of democratic development, economic prosperity, welfare society, and national unification.

초등학교 사회 6-1(7차) · Democratic · 2002

The same word, 통일, carries the weight of its era. Colonial textbooks use it as a neutral historical term for ancient territorial consolidation. Authoritarian-era texts frame it through nationalist ideology — unification as proof of the Korean people's spirit, and as an urgent Cold War imperative. Democratic-era texts reframe it around peace, diplomacy, and democratic values. This is what concordance analysis reveals: not just how often a word appears, but what it means.