K-Means Clustering: Presidential Speeches

Step through a k-means clustering of 749 presidential speeches. Each step shows how the algorithm finds structure in the data and what that structure means.

Week 7 749 speeches, 7 presidents Democratic era (1988–2022)

Loading speech data...

Step A. Load packages and data
We need four packages. elbird wraps the Kiwi Korean morphological analyzer — install it with install.packages("elbird"). First run downloads the model automatically.

Show codeR

library(tidyverse)
library(tidytext)
library(elbird)       # Korean morphological analysis (wraps Kiwi)
library(cluster)      # for silhouette()

corpus <- read_csv("data/president_speeches/president_speeches_democratic_era.csv")
stopwords_ko <- read_lines("data/stopwords_ko.txt") |> str_trim() |> discard(~ .x == "")

Step B. Tokenize and filter
We tokenize each speech with Kiwi, keep only nouns (NNG, NNP), and remove stopwords and single-character tokens. This is the same preprocessing pipeline from earlier weeks and the same one used in our Orange workflows.

Show codeR

tokenize_kiwi <- function(text) {
  result <- tokenize(text, flatten = TRUE)
  tibble(form = result$form, tag = result$tag)
}

tokens <- corpus |>
  mutate(doc_id = row_number(), morphemes = map(speech_text, tokenize_kiwi)) |>
  unnest(morphemes) |>
  filter(tag %in% c("NNG", "NNP"), !form %in% stopwords_ko,
         str_length(form) >= 2, !str_detect(form, "^[0-9]+$")) |>
  select(doc_id, president, kind, word = form)

# Check per-president counts
tokens |> distinct(doc_id, president) |> count(president)

Step C. Build TF-IDF matrix
We weight word counts with TF-IDF (same technique from Week 4), but first we filter the vocabulary: words must appear in at least 5 speeches (not too rare) and no more than 60% of speeches (not too common — effectively stopwords). This keeps the feature space focused on words that actually differentiate speeches.

Show codeR

n_docs <- tokens |> distinct(doc_id) |> nrow()
word_doc_freq <- tokens |>
  distinct(doc_id, word) |>
  count(word, name = "doc_freq")

keep_words <- word_doc_freq |>
  filter(doc_freq >= 5, doc_freq <= 0.6 * n_docs) |>
  pull(word)

tokens_filtered <- tokens |> filter(word %in% keep_words)

tfidf <- tokens_filtered |> count(doc_id, word) |> bind_tf_idf(word, doc_id, n)
dtm <- tfidf |> select(doc_id, word, tf_idf) |>
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)
mat <- dtm |> select(-doc_id) |> as.matrix()

Step D. Choose k with silhouette scores
We run k-means on the TF-IDF matrix for k = 2 through 8 and compute the average silhouette score for each. The silhouette measures how well each speech fits its assigned cluster vs. the next-best cluster. Higher is better. We pick the k with the highest score. This is the same as selecting k in Orange's k-Means widget.

Show codeR

sil_results <- tibble(k = 2:8) |>
  mutate(
    km = map(k, ~ kmeans(mat, centers = .x, nstart = 10)),
    sil = map_dbl(km, function(m) {
      s <- silhouette(m$cluster, dist(mat))
      mean(s[, 3])
    })
  )

best_k <- sil_results |> slice_max(sil) |> pull(k)
best_k  # print it

Step E. Run k-means and inspect clusters
Now we run k-means one final time with the best k. To understand what each cluster is "about," we re-apply TF-IDF at the cluster level — treating all speeches in a cluster as one big document — and extract the top distinctive words.

Show codeR

set.seed(42)
best_km <- kmeans(mat, centers = best_k, nstart = 10)

doc_clusters <- tibble(doc_id = dtm$doc_id, cluster = as.character(best_km$cluster))

# Top TF-IDF words per cluster
cluster_tfidf <- tokens |>
  left_join(doc_clusters, by = "doc_id") |>
  count(cluster, word) |>
  bind_tf_idf(word, cluster, n) |>
  group_by(cluster) |>
  slice_max(tf_idf, n = 15) |>
  ungroup()

# Plot
cluster_tfidf |>
  mutate(word = reorder_within(word, tf_idf, cluster)) |>
  ggplot(aes(tf_idf, word, fill = cluster)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ cluster, scales = "free") +
  scale_y_reordered() +
  labs(x = "TF-IDF", y = NULL) +
  theme_minimal()

Step F. President distribution across clusters
Finally, we check whether clusters track presidents or topics. If each president's speeches scatter across multiple clusters, it confirms that topic — not speaker identity — drives the grouping.

Show codeR

pres_order <- c("노태우", "김영삼", "김대중", "노무현", "이명박", "박근혜", "문재인")

pres_cluster <- tokens |>
  distinct(doc_id, president) |>
  left_join(doc_clusters, by = "doc_id") |>
  count(president, cluster)

pres_cluster |>
  mutate(president = factor(president, levels = pres_order)) |>
  ggplot(aes(x = president, y = n, fill = cluster)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = NULL, y = "Share of speeches", fill = "Cluster") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))