Clustering & Vectorization
Digital Humanities: Text-as-Data (BA3 – Korean Studies, Leiden University)
This document expands on the Week 5 slides. It explains how text is converted into numerical vectors, how distance metrics define similarity, and how clustering algorithms use these distances to reveal structure in an unlabeled corpus. It is meant to sit alongside your topic modeling explainer.
1. What Clustering Does
Clustering is an unsupervised method: the algorithm is not told what categories exist. Instead, it discovers structure based on similarity between items.
In text-as-data, clustering answers:
Which documents “speak” in similar ways, based on their numerical representations?
Two core principles:
- Cohesion: items within a cluster should be close together.
- Separation: different clusters should be far apart.
These ideas only make sense once documents have been transformed into vectors.
2. Why Distance Matters
Clustering is fundamentally geometric. Distance defines the geometry.
Euclidean distance (L2)
Straight-line distance between vectors:
\[d(A, B) = \sqrt{\sum_i (A_i - B_i)^2}\]Euclidean distance is sensitive to vector magnitude. It works well when absolute length (e.g. document size, overall frequency) is meaningful.
Cosine similarity / cosine distance
Cosine measures angle, not magnitude:
\[ext{cosine}(A,B) = rac{A \cdot B}{\lVert A Vert \lVert B Vert}\]Cosine distance is:
\[ext{cosine\_distance}(A,B) = 1 - ext{cosine}(A,B)\]Interpretation:
- Small angle (cosine close to 1) → similar distribution of features.
- Large angle (cosine near 0 or negative) → different distribution of features.
For text, cosine is usually preferred because document length varies a lot and we care more about relative term proportions than raw counts.
3. Why Text Must Become Numbers
Computers cannot operate on raw text. Every text-as-data method relies on converting documents into numeric representations in a common space.
Two main families:
- Lexical (sparse vectors)
- Bag-of-Words (BoW)
- TF–IDF
- Semantic (dense vectors)
- Word embeddings
- Document embeddings
Your choice of representation largely determines what “similarity” means.
4. Route A: Bag-of-Words and TF–IDF
Bag-of-Words (BoW)
BoW represents each document by counts of each word in a fixed vocabulary:
- Extremely high-dimensional (one dimension per term).
- Mostly zeros (sparse).
- Captures lexical overlap: documents are similar if they use many of the same words.
TF–IDF
TF–IDF reweights BoW counts to emphasize informative words:
\[ext{TF–IDF}(w,d) = ext{TF}(w,d) imes \log\left(rac{N}{n_w} ight)\]Where:
- $ ext{TF}(w,d)$ = frequency of word $w$ in document $d$
- $N$ = total number of documents
- $n_w$ = number of documents containing $w$
Intuition:
- Words that are frequent in a single document but rare overall get high TF–IDF.
- Very common words (appearing everywhere) get low TF–IDF.
This is usually a better starting point than raw counts for clustering documents.
Distance on TF–IDF
On TF–IDF vectors we typically use cosine distance:
- High cosine similarity → documents share a similar profile of weighted terms.
- Low cosine similarity → documents emphasize different vocabulary.
This is what drives “documents with similar wording end up in the same cluster”.
5. Route B: Word and Document Embeddings
Embeddings are dense vectors learned from large corpora. They capture contextual meaning rather than raw counts.
Properties:
- Typically 300–768 dimensions
- Words/documents that appear in similar contexts get nearby vectors.
- Captures semantic similarity, not just exact word overlap.
A classic example relation:
\[ext{king} - ext{man} + ext{woman} pprox ext{queen}\]For longer documents (chapters, textbooks), we can aggregate word embeddings or use document-level models. Clusters then reflect similar meanings, themes, or narrative styles, rather than just shared vocabulary.
6. TF–IDF vs Embeddings
A quick comparison:
| Feature | TF–IDF | Embeddings |
|---|---|---|
| Representation | Sparse term-frequency | Dense learned vectors |
| Captures | Lexical overlap | Contextual / semantic similarity |
| Dimensionality | Very high (1 per word) | Moderate (hundreds) |
| Interpretability | High (each dim = a word) | Lower (dimensions are abstract) |
| Best for | Transparent clustering | Meaning-based structure |
In practice:
- Start with TF–IDF if you want clear, inspectable clusters and simple pipelines.
- Use embeddings when you care more about deeper semantic relationships.
7. Clustering Algorithms
This course uses two clustering approaches in Orange: hierarchical clustering and K-Means.
Hierarchical clustering
- Starts with each document as its own cluster.
- Iteratively merges the closest clusters according to a distance metric and linkage rule.
- Produces a dendrogram (tree).
- You can “cut” the tree at different heights to get different numbers of clusters.
Strengths:
- No need to choose the number of clusters in advance (you explore).
- Great for visualizing similarity and identifying natural groupings.
Requirements:
- A full distance matrix (e.g. cosine distances between all document pairs).
K-Means clustering
K-Means partitions the data into $k$ clusters by minimizing within-cluster variance:
\[\min_{\{C_i\}} \sum_{i=1}^{k} \sum_{x \in C_i} \lVert x - \mu_i Vert^2\]Where $C_i$ is cluster $i$ and $\mu_i$ is its centroid.
Algorithm (informally):
- Choose $k$ (the number of clusters).
- Initialize $k$ centroids (often randomly).
- Assign each document to the nearest centroid.
- Recompute centroids as the mean of their assigned points.
- Repeat until assignments stabilize.
Strengths:
- Fast and scalable.
- Works directly on document vectors (no distance matrix needed).
Limitations:
- You must choose $k$ ahead of time.
- Solutions can depend on initialization.
8. Pipelines in Orange
Below are the main pipelines from the lecture.
TF–IDF + Hierarchical Clustering
- Preprocess text
- Bag-of-Words
- TF–IDF
- Distance (cosine)
- Hierarchical Clustering
TF–IDF + K-Means
- Preprocess
- Bag-of-Words → TF–IDF
- K-Means
- Inspect clusters
Embeddings + Clustering
- Preprocess
- Document Embedding
- Distance → Hierarchical Clustering
or
- Preprocess
- Document Embedding
- K-Means
9. Interpreting Clusters
Clustering is exploratory. It does not prove categories; it suggests patterns.
Good practice:
- Inspect top terms or features for each cluster.
- Read several documents from each cluster.
- Look for historical, thematic, or stylistic consistency.
- Identify “junk” clusters created by noise or boilerplate.
10. Key Takeaways
- Clustering groups documents based on distances between vectors.
- Cosine distance is the standard for text.
- TF–IDF captures vocabulary patterns; embeddings capture meaning.
- Hierarchical clustering visualizes structure; K-Means assigns groups.
- Interpretation requires domain knowledge and close reading.
Use this alongside the slides and the Orange workflows: slides show how, this explainer clarifies why.