Word Embeddings

Numerical representations of words (and documents) that encode semantic relationships as geometric ones — the foundation of most modern NLP and a powerful tool for tracing concepts at scale.


What it is

Word embeddings represent each word (or document) as a vector in a high-dimensional space such that semantically related items land close to each other. Two broad families:

  1. Static embeddings: one vector per word, learned from co-occurrence patterns in a training corpus. Word2Vec (skip-gram, CBOW), GloVe, and fastText are the classics. Fast to train, easy to inspect, blind to context: “bank” has one vector whether it’s a river or a financial institution.
  2. Contextual embeddings: a fresh vector per word in context, produced by a pre-trained transformer (BERT, RoBERTa, sentence-transformers). Much more accurate for downstream tasks, harder to interpret directly, and computationally heavier.

Embeddings aren’t an analysis on their own. They’re a representation you feed into something else: a similarity search, a classifier, a clustering algorithm, a time-trajectory measurement.


What you learn in the DH course

This page draws from the course’s word-embedding material. Students who take it come away with:

  • Vector-space semantics and the distributional hypothesis (“a word is known by the company it keeps”)
  • Training a Word2Vec / GloVe / fastText model on your own corpus vs. using a pre-trained model
  • Contextual embeddings: BERT, multilingual BERT, sentence-transformers, and when each is worth the compute
  • Similarity operations: cosine distance, nearest neighbours, analogy tasks
  • Aligning embedding spaces across time (to measure semantic change) or across languages
  • Using embeddings as input features for classification, clustering, or topic analysis
  • Reporting embedding-based methods: pinning model versions, documenting training corpus, acknowledging limits

What you need to learn first

  • Preprocessing: embeddings learn from the vocabulary you feed them; decisions here propagate into the geometry. See Preprocessing.
  • Linear algebra basics: cosine similarity, vector arithmetic, dimensionality reduction. You don’t need to derive it, but you need a mental model.
  • Python: essentially all embedding tooling is Python-first (gensim, transformers, sentence-transformers). R bindings exist but lag.

What you can do with it

  • Measure how the meaning of a political keyword shifts across decades (diachronic embeddings)
  • Surface near-synonyms and related terms you’d otherwise miss in keyword searches
  • Cluster documents by semantic similarity, even when they share no keywords
  • Build a retrieval system for a large corpus (semantic search instead of exact-match)
  • Feed sentence- or document-level embeddings into a sentiment or classification model
  • Cross-language alignment: find the Korean equivalent of an English concept by projecting embeddings into a shared space

  • Preprocessing — sets the vocabulary the embedding sees.
  • Topic Analysis — embedding-based topic methods (BERTopic, Top2Vec) are built directly on contextual embeddings.
  • Sentiment Analysis — modern sentiment classifiers use contextual embeddings as their feature layer.