Clustering & Vectorization

Digital Humanities: Text-as-Data (BA3 – Korean Studies, Leiden University)

This document expands on the Week 5 slides. It explains how text is converted into numerical vectors, how distance metrics define similarity, and how clustering algorithms use these distances to reveal structure in an unlabeled corpus. It is meant to sit alongside your topic modeling explainer.


1. What Clustering Does

Clustering is an unsupervised method: the algorithm is not told what categories exist. Instead, it discovers structure based on similarity between items.

In text-as-data, clustering answers:

Which documents “speak” in similar ways, based on their numerical representations?

Two core principles:

These ideas only make sense once documents have been transformed into vectors.


2. Why Distance Matters

Clustering is fundamentally geometric. Distance defines the geometry.

Euclidean distance (L2)

Straight-line distance between vectors:

\[d(A, B) = \sqrt{\sum_i (A_i - B_i)^2}\]

Euclidean distance is sensitive to vector magnitude. It works well when absolute length (e.g. document size, overall frequency) is meaningful.

Cosine similarity / cosine distance

Cosine measures angle, not magnitude:

\[ext{cosine}(A,B) = rac{A \cdot B}{\lVert A Vert \lVert B Vert}\]

Cosine distance is:

\[ext{cosine\_distance}(A,B) = 1 - ext{cosine}(A,B)\]

Interpretation:

For text, cosine is usually preferred because document length varies a lot and we care more about relative term proportions than raw counts.


3. Why Text Must Become Numbers

Computers cannot operate on raw text. Every text-as-data method relies on converting documents into numeric representations in a common space.

Two main families:

  1. Lexical (sparse vectors)
    • Bag-of-Words (BoW)
    • TF–IDF
  2. Semantic (dense vectors)
    • Word embeddings
    • Document embeddings

Your choice of representation largely determines what “similarity” means.


4. Route A: Bag-of-Words and TF–IDF

Bag-of-Words (BoW)

BoW represents each document by counts of each word in a fixed vocabulary:

TF–IDF

TF–IDF reweights BoW counts to emphasize informative words:

\[ext{TF–IDF}(w,d) = ext{TF}(w,d) imes \log\left( rac{N}{n_w} ight)\]

Where:

Intuition:

This is usually a better starting point than raw counts for clustering documents.

Distance on TF–IDF

On TF–IDF vectors we typically use cosine distance:

This is what drives “documents with similar wording end up in the same cluster”.


5. Route B: Word and Document Embeddings

Embeddings are dense vectors learned from large corpora. They capture contextual meaning rather than raw counts.

Properties:

A classic example relation:

\[ext{king} - ext{man} + ext{woman} pprox ext{queen}\]

For longer documents (chapters, textbooks), we can aggregate word embeddings or use document-level models. Clusters then reflect similar meanings, themes, or narrative styles, rather than just shared vocabulary.


6. TF–IDF vs Embeddings

A quick comparison:

Feature TF–IDF Embeddings
Representation Sparse term-frequency Dense learned vectors
Captures Lexical overlap Contextual / semantic similarity
Dimensionality Very high (1 per word) Moderate (hundreds)
Interpretability High (each dim = a word) Lower (dimensions are abstract)
Best for Transparent clustering Meaning-based structure

In practice:


7. Clustering Algorithms

This course uses two clustering approaches in Orange: hierarchical clustering and K-Means.

Hierarchical clustering

Strengths:

Requirements:

K-Means clustering

K-Means partitions the data into $k$ clusters by minimizing within-cluster variance:

\[\min_{\{C_i\}} \sum_{i=1}^{k} \sum_{x \in C_i} \lVert x - \mu_i Vert^2\]

Where $C_i$ is cluster $i$ and $\mu_i$ is its centroid.

Algorithm (informally):

  1. Choose $k$ (the number of clusters).
  2. Initialize $k$ centroids (often randomly).
  3. Assign each document to the nearest centroid.
  4. Recompute centroids as the mean of their assigned points.
  5. Repeat until assignments stabilize.

Strengths:

Limitations:


8. Pipelines in Orange

Below are the main pipelines from the lecture.

TF–IDF + Hierarchical Clustering

  1. Preprocess text
  2. Bag-of-Words
  3. TF–IDF
  4. Distance (cosine)
  5. Hierarchical Clustering

TF–IDF + K-Means

  1. Preprocess
  2. Bag-of-Words → TF–IDF
  3. K-Means
  4. Inspect clusters

Embeddings + Clustering

  1. Preprocess
  2. Document Embedding
  3. Distance → Hierarchical Clustering

or

  1. Preprocess
  2. Document Embedding
  3. K-Means

9. Interpreting Clusters

Clustering is exploratory. It does not prove categories; it suggests patterns.

Good practice:


10. Key Takeaways


Use this alongside the slides and the Orange workflows: slides show how, this explainer clarifies why.