From Word Counts to Patterns: Hierarchical Clustering

Connecting Descriptive Statistics to Pattern Discovery

by Steven Denney


The Bridge: From Description to Discovery

So far, you’ve learned:

Now the question becomes:

“Can we discover patterns in our corpus without manually reading everything?”

This is where clustering comes in.


What Is Hierarchical Clustering?

The Basic Idea:

Hierarchical Clustering asks:

“Which documents are most similar to each other based on the words they use?”

It automatically groups documents into clusters where:

Why “Hierarchical”?

Unlike just dividing documents into groups, hierarchical clustering shows how those groups relate:

Think of it like a family tree for your documents.


The Connection: Bag of Words → Distance → Clustering

Step 1: Bag of Words (What You Already Know)

Remember: Each document becomes a vector of word counts

Document 역사 고구려 독립 민족 문화
Doc 1 5 12 0 1 3
Doc 2 4 10 1 2 4
Doc 3 3 0 15 8 2

Step 2: Calculate Similarity (or Distance)

The key insight: Documents with similar word counts are “close” to each other.

Example:

Distance measures:

Step 3: Group Similar Documents

Hierarchical clustering algorithm:

  1. Start: Each document is its own cluster
  2. Find the two most similar documents → merge into one cluster
  3. Repeat: Find next most similar (documents or clusters) → merge
  4. Continue until everything is in one big cluster

The Dendrogram: Reading the Family Tree

What You See:

        ┌─────────────┐
        │             │
    ┌───┴───┐     ┌───┴────┐
    │       │     │        │
  ┌─┴─┐   ┌─┴─┐ ┌─┴─┐   ┌──┴──┐
  │   │   │   │ │   │   │     │
 D1  D2  D3  D4 D5  D6  D7    D8

How to Read It:

Height = Similarity/Distance:

Example:

Cutting the tree:


From Bag of Words to Interpretation: The Full Pipeline

Step-by-Step Process:

1. Preprocessing (What gets counted matters!)

Raw Text → Clean → Tokenize → POS Filter → Stopwords → Processed Text

2. Bag of Words (Counting)

Processed Text → Word Counts per Document

3. Distance Calculation (Measuring similarity)

Bag of Words → Calculate how similar/different each pair of documents is

4. Hierarchical Clustering (Pattern discovery)

Distances → Dendrogram showing document relationships

5. Interpretation (The hard part!)

Clusters → Read sample documents → Identify what they have in common

Example: Korean History Textbook Corpus

Your Data:

The Clustering Process:

After running Bag of Words → Distance → Hierarchical Clustering:

Dendrogram shows 3 main clusters (just examples!):

Cluster 1 (15 documents):
├─ High 기원전, 삼국시대, 고구려, 백제
└─ Ancient history chapters

Cluster 2 (20 documents):
├─ High 독립, 일제, 저항, 만세
└─ Japanese occupation period

Cluster 3 (16 documents):
├─ High 민주화, 경제, 산업화, 발전
└─ Modern Korea chapters

Key Questions for Analysis

1. Descriptive Questions (What do you see?)

About the dendrogram:

About the documents:

2. Investigative Questions (What’s in the clusters?)

Look at word frequencies by cluster:

Look at actual documents:

3. Interpretive Questions (What does it mean?)

About the content:

About the corpus:

4. Critical Questions (What are the limitations?)

About the method:


From Clustering to Insight: The Interpretive Process

Step 1: Observe the Clusters (Descriptive)

“The dendrogram shows 3 main clusters. Cluster 1 has 15 documents that join at height 0.3.”

Step 2: Investigate the Content (Analytical)

“Examining word frequencies in Cluster 1, I see high counts of 삼국시대, 고구려, 백제. Looking at TF-IDF, these terms have scores above 4.5, indicating they’re distinctive for this cluster.”

Step 3: Interpret the Pattern (Interpretive)

“Cluster 1 appears to represent chapters about ancient Korean kingdoms. These chapters cluster together because they share specialized historical vocabulary about the Three Kingdoms period.”

Step 4: Validate and Refine (Critical)

“Reading sample documents confirms this interpretation. However, one document about early Joseon period also appears here, possibly because it references the Three Kingdoms for historical context.”

Step 5: Consider Implications (Synthetic)

“The clear separation between ancient, colonial, and modern clusters suggests Korean history textbooks organize content into distinct narrative eras with little vocabulary overlap. This may reflect how Korean history education emphasizes periodization.”