K-Means Clustering: Presidential Speeches
Step through a k-means clustering of 749 presidential speeches. Each step shows how the algorithm finds structure in the data and what that structure means.
Loading speech data...
Step A. Load packages and data
We need four packages. elbird wraps the Kiwi Korean morphological analyzer — install it with install.packages("elbird"). First run downloads the model automatically.
Show codeR
library(tidyverse)
library(tidytext)
library(elbird) # Korean morphological analysis (wraps Kiwi)
library(cluster) # for silhouette()
corpus <- read_csv("data/president_speeches/president_speeches_democratic_era.csv")
stopwords_ko <- read_lines("data/stopwords_ko.txt") |> str_trim() |> discard(~ .x == "")
Step B. Tokenize and filter
We tokenize each speech with Kiwi, keep only nouns (NNG, NNP), and remove stopwords and single-character tokens. This is the same preprocessing pipeline from earlier weeks and the same one used in our Orange workflows.
Show codeR
tokenize_kiwi <- function(text) {
result <- tokenize(text, flatten = TRUE)
tibble(form = result$form, tag = result$tag)
}
tokens <- corpus |>
mutate(doc_id = row_number(), morphemes = map(speech_text, tokenize_kiwi)) |>
unnest(morphemes) |>
filter(tag %in% c("NNG", "NNP"), !form %in% stopwords_ko,
str_length(form) >= 2, !str_detect(form, "^[0-9]+$")) |>
select(doc_id, president, kind, word = form)
# Check per-president counts
tokens |> distinct(doc_id, president) |> count(president)
Step C. Build TF-IDF matrix
We weight word counts with TF-IDF (same technique from Week 4), but first we filter the vocabulary: words must appear in at least 5 speeches (not too rare) and no more than 60% of speeches (not too common — effectively stopwords). This keeps the feature space focused on words that actually differentiate speeches.
Show codeR
n_docs <- tokens |> distinct(doc_id) |> nrow()
word_doc_freq <- tokens |>
distinct(doc_id, word) |>
count(word, name = "doc_freq")
keep_words <- word_doc_freq |>
filter(doc_freq >= 5, doc_freq <= 0.6 * n_docs) |>
pull(word)
tokens_filtered <- tokens |> filter(word %in% keep_words)
tfidf <- tokens_filtered |> count(doc_id, word) |> bind_tf_idf(word, doc_id, n)
dtm <- tfidf |> select(doc_id, word, tf_idf) |>
pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)
mat <- dtm |> select(-doc_id) |> as.matrix()
Step D. Choose k with silhouette scores
We run k-means on the TF-IDF matrix for k = 2 through 8 and compute the average silhouette score for each. The silhouette measures how well each speech fits its assigned cluster vs. the next-best cluster. Higher is better. We pick the k with the highest score. This is the same as selecting k in Orange's k-Means widget.
Show codeR
sil_results <- tibble(k = 2:8) |>
mutate(
km = map(k, ~ kmeans(mat, centers = .x, nstart = 10)),
sil = map_dbl(km, function(m) {
s <- silhouette(m$cluster, dist(mat))
mean(s[, 3])
})
)
best_k <- sil_results |> slice_max(sil) |> pull(k)
best_k # print it
Step E. Run k-means and inspect clusters
Now we run k-means one final time with the best k. To understand what each cluster is "about," we re-apply TF-IDF at the cluster level — treating all speeches in a cluster as one big document — and extract the top distinctive words.
Show codeR
set.seed(42)
best_km <- kmeans(mat, centers = best_k, nstart = 10)
doc_clusters <- tibble(doc_id = dtm$doc_id, cluster = as.character(best_km$cluster))
# Top TF-IDF words per cluster
cluster_tfidf <- tokens |>
left_join(doc_clusters, by = "doc_id") |>
count(cluster, word) |>
bind_tf_idf(word, cluster, n) |>
group_by(cluster) |>
slice_max(tf_idf, n = 15) |>
ungroup()
# Plot
cluster_tfidf |>
mutate(word = reorder_within(word, tf_idf, cluster)) |>
ggplot(aes(tf_idf, word, fill = cluster)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ cluster, scales = "free") +
scale_y_reordered() +
labs(x = "TF-IDF", y = NULL) +
theme_minimal()
Step F. President distribution across clusters
Finally, we check whether clusters track presidents or topics. If each president's speeches scatter across multiple clusters, it confirms that topic — not speaker identity — drives the grouping.
Show codeR
pres_order <- c("노태우", "김영삼", "김대중", "노무현", "이명박", "박근혜", "문재인")
pres_cluster <- tokens |>
distinct(doc_id, president) |>
left_join(doc_clusters, by = "doc_id") |>
count(president, cluster)
pres_cluster |>
mutate(president = factor(president, levels = pres_order)) |>
ggplot(aes(x = president, y = n, fill = cluster)) +
geom_col(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(x = NULL, y = "Share of speeches", fill = "Cluster") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))