From Word Counts to Patterns: Hierarchical Clustering
Connecting Descriptive Statistics to Pattern Discovery
by Steven Denney
The Bridge: From Description to Discovery
So far, you’ve learned:
- How to count words (frequency)
- How to identify distinctive words (TF-IDF)
- How to represent text as numbers (Bag of Words)
Now the question becomes:
“Can we discover patterns in our corpus without manually reading everything?”
This is where clustering comes in.
What Is Hierarchical Clustering?
The Basic Idea:
Hierarchical Clustering asks:
“Which documents are most similar to each other based on the words they use?”
It automatically groups documents into clusters where:
- Documents within a cluster share similar vocabulary
- Documents between clusters use different vocabulary
Why “Hierarchical”?
Unlike just dividing documents into groups, hierarchical clustering shows how those groups relate:
- Some documents are very similar (cluster early/low)
- Some are somewhat similar (cluster later/higher)
- Some are quite different (cluster at the very top)
Think of it like a family tree for your documents.
The Connection: Bag of Words → Distance → Clustering
Step 1: Bag of Words (What You Already Know)
Remember: Each document becomes a vector of word counts
| Document | 역사 | 고구려 | 독립 | 민족 | 문화 |
|---|---|---|---|---|---|
| Doc 1 | 5 | 12 | 0 | 1 | 3 |
| Doc 2 | 4 | 10 | 1 | 2 | 4 |
| Doc 3 | 3 | 0 | 15 | 8 | 2 |
Step 2: Calculate Similarity (or Distance)
The key insight: Documents with similar word counts are “close” to each other.
Example:
- Doc 1 and Doc 2 both have high 고구려 counts → similar (close)
- Doc 3 has high 독립 count, low 고구려 → different (far)
Distance measures:
- Cosine distance: Compares word usage patterns (most common for text)
- Euclidean distance: Straight-line distance in multi-dimensional space
Step 3: Group Similar Documents
Hierarchical clustering algorithm:
- Start: Each document is its own cluster
- Find the two most similar documents → merge into one cluster
- Repeat: Find next most similar (documents or clusters) → merge
- Continue until everything is in one big cluster
The Dendrogram: Reading the Family Tree
What You See:
┌─────────────┐
│ │
┌───┴───┐ ┌───┴────┐
│ │ │ │
┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌──┴──┐
│ │ │ │ │ │ │ │
D1 D2 D3 D4 D5 D6 D7 D8
How to Read It:
Height = Similarity/Distance:
- Documents joined low = very similar
- Documents joined high = less similar
Example:
- D1 and D2 join first (lowest) → most similar documents
- D7 and D8 join high up → less similar
- Everything joins at top → shows the least similar grouping
Cutting the tree:
- Cut low → many small clusters (fine-grained)
- Cut high → few large clusters (broad categories)
From Bag of Words to Interpretation: The Full Pipeline
Step-by-Step Process:
1. Preprocessing (What gets counted matters!)
Raw Text → Clean → Tokenize → POS Filter → Stopwords → Processed Text
- Decisions here affect clustering results
- Different preprocessing = different patterns
2. Bag of Words (Counting)
Processed Text → Word Counts per Document
- Each document = row
- Each word = column
- Values = counts (or TF-IDF scores)
3. Distance Calculation (Measuring similarity)
Bag of Words → Calculate how similar/different each pair of documents is
- Creates a distance matrix
- Shows which documents are “close” vs. “far”
4. Hierarchical Clustering (Pattern discovery)
Distances → Dendrogram showing document relationships
- Reveals hidden structure
- Shows groupings you might not have noticed
5. Interpretation (The hard part!)
Clusters → Read sample documents → Identify what they have in common
- Descriptive: “These 12 documents form Cluster A”
- Interpretive: “Cluster A focuses on ancient Korean history”
- Evidence: Look at TF-IDF words, read actual documents
Example: Korean History Textbook Corpus
Your Data:
- 51 chapters from Korean history textbooks
- After preprocessing: clean Korean morphemes
- Each chapter described by word counts
The Clustering Process:
After running Bag of Words → Distance → Hierarchical Clustering:
Dendrogram shows 3 main clusters (just examples!):
Cluster 1 (15 documents):
├─ High 기원전, 삼국시대, 고구려, 백제
└─ Ancient history chapters
Cluster 2 (20 documents):
├─ High 독립, 일제, 저항, 만세
└─ Japanese occupation period
Cluster 3 (16 documents):
├─ High 민주화, 경제, 산업화, 발전
└─ Modern Korea chapters
Key Questions for Analysis
1. Descriptive Questions (What do you see?)
About the dendrogram:
- How many major clusters are there?
- Which documents cluster together?
- Are some clusters tighter (more similar internally) than others?
About the documents:
- How many documents in each cluster?
- Which cluster is largest/smallest?
2. Investigative Questions (What’s in the clusters?)
Look at word frequencies by cluster:
- What are the most frequent words in Cluster 1?
- What (TF-IDF) words distinguish Cluster 1 from Cluster 2?
- Do the word patterns make sense?
Look at actual documents:
- Read 2-3 sample documents from each cluster
- What do they have in common?
- Are there any “odd” documents that seem misplaced?
3. Interpretive Questions (What does it mean?)
About the content:
- What is each cluster “about”?
- Can you give each cluster a meaningful label?
- Why might these documents group together?
About the corpus:
- Does the clustering reveal structure in your corpus?
- Do clusters correspond to known categories (time periods, topics)?
- What does this tell you about how Korean history is narrated?
4. Critical Questions (What are the limitations?)
About the method:
- Could different preprocessing produce different clusters?
- Are there meaningful documents that didn’t cluster as expected?
- What does clustering miss? (Context, narrative structure, argument)
From Clustering to Insight: The Interpretive Process
Step 1: Observe the Clusters (Descriptive)
“The dendrogram shows 3 main clusters. Cluster 1 has 15 documents that join at height 0.3.”
Step 2: Investigate the Content (Analytical)
“Examining word frequencies in Cluster 1, I see high counts of 삼국시대, 고구려, 백제. Looking at TF-IDF, these terms have scores above 4.5, indicating they’re distinctive for this cluster.”
Step 3: Interpret the Pattern (Interpretive)
“Cluster 1 appears to represent chapters about ancient Korean kingdoms. These chapters cluster together because they share specialized historical vocabulary about the Three Kingdoms period.”
Step 4: Validate and Refine (Critical)
“Reading sample documents confirms this interpretation. However, one document about early Joseon period also appears here, possibly because it references the Three Kingdoms for historical context.”
Step 5: Consider Implications (Synthetic)
“The clear separation between ancient, colonial, and modern clusters suggests Korean history textbooks organize content into distinct narrative eras with little vocabulary overlap. This may reflect how Korean history education emphasizes periodization.”