Understanding Text Analysis: From Words to Meaning
A Guide to Descriptive Text Statistics
by Steven Denney
Introduction: Why Count Words?
When we analyze text computationally, we start with the most basic question: What words appear, and how often?
But not all words are equally important. Some appear everywhere. Some appear rarely but meaningfully. Some distinguish one type of document from another.
This guide explains the key metrics we use to move from raw word counts to meaningful patterns.
1. Word Frequency (Term Frequency / TF)
What It Is:
The simplest measure: How many times does a word appear?
Example:
In a Korean history textbook:
- 역사 (history) appears 150 times
- 기원전 (BCE) appears 54 times
- 문화 (culture) appears 32 times
What It Tells Us:
- Which topics are discussed most
- The basic vocabulary of the corpus
- Surface-level content overview
Limitations:
High frequency ≠ high importance
Common words like “하다” (to do), “것” (thing), “있다” (to exist) appear frequently but carry little meaning. This is why we remove stopwords.
In Orange:
- Word Cloud - visual display of frequency (bigger = more frequent)
- Data Table after Bag of Words - raw counts
2. Bag of Words (BoW)
What It Is:
A way of representing text as numbers that computers can analyze.
Each document becomes a vector (row) where each word is a feature (column) with a count.
Example:
| Document | 역사 | 기원전 | 문화 | 독립 |
|---|---|---|---|---|
| Doc 1 | 5 | 8 | 2 | 0 |
| Doc 2 | 3 | 0 | 4 | 7 |
| Doc 3 | 6 | 1 | 3 | 2 |
Document 1 contains “역사” 5 times, “기원전” 8 times, etc.
What It Tells Us:
- Transforms text into analyzable data
- Enables comparison between documents
- Foundation for most text analysis methods
Key Assumption:
“The meaning of a document can be captured by which words it contains, ignoring word order”
This is called the “bag” metaphor - like dumping all words into a bag and counting them.
Limitations:
- Loses context: “not good” vs “good” - both have “good” but opposite meanings
- Loses grammar: word order doesn’t matter
- High dimensionality: thousands of columns (one per unique word)
In Orange:
- Bag of Words widget - creates this representation
- Input: preprocessed text
- Output: document-term matrix
3. Document Frequency (DF)
What It Is:
How many documents contain a word? (Not how many times total)
Example:
- 역사 appears in 45 out of 50 documents (DF = 45)
- 기원전 appears in 12 out of 50 documents (DF = 12)
- 독립운동 appears in 3 out of 50 documents (DF = 3)
What It Tells Us:
- How widespread vs. specialized a word is
- Common vocabulary vs. rare/technical terms
Two Extremes:
High DF (appears in most documents):
- Very common words
- General vocabulary
- Less distinctive
Low DF (appears in few documents):
- Specialized terms
- Topic-specific vocabulary
- More distinctive
4. Inverse Document Frequency (IDF)
What It Is:
A weight that increases a word’s importance if it’s rare across documents.
Formula (simplified):
IDF = log(Total Documents / Documents Containing Word)
Intuition:
- Words in ALL documents → low IDF (not distinctive)
- Words in FEW documents → high IDF (distinctive!)
Example:
Corpus of 50 Korean history textbooks:
| Word | Document Frequency | IDF Score |
|---|---|---|
| 역사 | 48/50 documents | 0.04 (LOW) |
| 기원전 | 35/50 documents | 0.15 |
| 삼국시대 | 15/50 documents | 0.52 |
| 갑오개혁 | 5/50 documents | 1.00 (HIGH) |
Interpretation:
- 역사 appears in almost every textbook → not distinctive → low IDF
- 갑오개혁 appears in only a few → distinctive → high IDF
Why This Matters:
IDF helps us find distinctive words, not just frequent ones.
5. TF-IDF (Term Frequency - Inverse Document Frequency)
What It Is:
The most important metric: Combines frequency WITH distinctiveness
Formula:
TF-IDF = (How often in this document) × (How rare across all documents)
How It Works:
High TF-IDF means:
- Word appears OFTEN in this document (high TF)
- Word is RARE across the corpus (high IDF)
- = Distinctive and important for THIS document
Low TF-IDF means:
- Either: appears rarely in this document
- Or: appears in most documents (not distinctive)
- = Not particularly important
Example:
Document about 갑오개혁 (Gabo Reform):
| Word | Frequency in Doc | IDF | TF-IDF |
|---|---|---|---|
| 역사 | 8 times | 0.04 | 0.32 (LOW) |
| 개혁 | 12 times | 0.45 | 5.40 |
| 갑오개혁 | 15 times | 1.00 | 15.00 (HIGH!) |
Interpretation:
- 역사 appears often but isn’t distinctive for THIS document
- 갑오개혁 appears often AND is rare overall = highly distinctive for this document
What TF-IDF Tells Us:
“What words are most CHARACTERISTIC of each document?”
Not just “what appears most” but “what makes this document unique”
In Orange:
- Extract Keywords widget - uses TF-IDF
- Bag of Words widget - has TF-IDF option
- Shows which words best represent each document
Putting It All Together: A Comparison
Scenario: Analyzing 3 Korean history textbook chapters
Chapter 1: Ancient Korea (삼국시대)
- Most Frequent: 역사, 시대, 국가, 왕
- Highest TF-IDF: 삼국시대, 고구려, 백제, 신라
Chapter 2: Japanese Occupation (일제강점기)
- Most Frequent: 역사, 시대, 국가, 일본
- Highest TF-IDF: 독립운동, 일제, 저항, 만세
Chapter 3: Modern Korea (현대)
- Most Frequent: 역사, 시대, 국가, 발전
- Highest TF-IDF: 민주화, 경제, 산업화, 개발
Notice:
- Frequency alone - “역사” tops all chapters (not helpful for distinguishing!)
- TF-IDF - shows what makes EACH chapter unique
Common Student Questions
Q: “Why remove stopwords? They’re frequent for a reason!”
A: Stopwords (like 하다, 있다, 것) appear frequently but don’t help us understand CONTENT. They’re grammatical scaffolding, not meaning.
Think of it like this: If analyzing English news, “the” appears everywhere. Doesn’t tell us if an article is about sports vs. politics.
Q: “Why is TF-IDF better than just counting?”
A: Raw counts reward common words. TF-IDF rewards distinctive words.
Example: 역사 (history) appears 100 times across 50 history textbooks. Not surprising, not distinctive. 갑오개혁 appears 50 times but only in 3 textbooks about that specific reform. VERY distinctive!
Q: “Can’t important words be common?”
A: Yes! That’s why we look at BOTH metrics:
- Word clouds show what’s discussed most (frequency)
- TF-IDF shows what’s distinctive (characteristic vocabulary)
Use both to tell a complete story.
Q: “Why does preprocessing matter?”
A: Preprocessing determines WHAT gets counted:
- Keep all words → noise dominates
- Remove stopwords → content words emerge
- Keep only nouns → different story than nouns+verbs
Preprocessing = analytical choices that affect results!
From Descriptive to Interpretive: Making Claims
Descriptive Claims (What you see):
- “기원전 appears 54 times”
- “민족 has the highest TF-IDF score”
- “This cluster contains 15 documents”
Interpretive Claims (What it might mean):
- “The high frequency of 기원전 suggests emphasis on ancient history”
- “The prominence of 민족 indicates nationalist discourse”
- “Documents cluster by historical period”
Critical Claims (Limitations):
- “However, frequency doesn’t capture how these terms are used in context”
- “TF-IDF may overweight rare technical terms”
- “These patterns could also be explained by…”
Visual Summary: Which Metric When?
┌─────────────────────────────────────────────────────┐
│ RESEARCH QUESTION → USE THIS METRIC │
├─────────────────────────────────────────────────────┤
│ What's discussed most? → Word Frequency │
│ What topics appear? → Word Cloud │
│ What makes docs similar? → Bag of Words │
│ What's distinctive? → TF-IDF │
│ What defines clusters? → TF-IDF by cluster │
│ How do groups differ? → Keyword contrast │
└─────────────────────────────────────────────────────┘
Workflow in Orange: The Pipeline
1. Raw Text
↓
2. Preprocessing
- Korean morphological analysis
- POS tagging
- Stopword removal
↓
3. Bag of Words
- Convert to numbers
- Count words per document
↓
4. Analysis Options:
→ Word Cloud (frequency)
→ Extract Keywords (TF-IDF)
→ Hierarchical Clustering (find patterns)
→ Compare groups (keyword contrast)
Practice Exercise
Given this information about 3 documents:
| Document | 역사 (TF) | 역사 (IDF) | 독립 (TF) | 독립 (IDF) |
|---|---|---|---|---|
| Doc A | 10 | 0.1 | 0 | 0.8 |
| Doc B | 8 | 0.1 | 15 | 0.8 |
| Doc C | 12 | 0.1 | 2 | 0.8 |
Questions:
- Which document is most about 독립 (independence)?
- Which word is more distinctive: 역사 or 독립?
- What would be the TF-IDF for 독립 in Doc B
Answers:
- Doc B (TF = 15, highest frequency)
- 독립 (IDF = 0.8 vs. 0.1) - appears in fewer documents
- TF-IDF = 15 × 0.8 = 12.0 (high)
Key Takeaways
Remember:
- Frequency = What appears most
- Document Frequency = How widespread a word is
- IDF = How distinctive/rare a word is
- TF-IDF = Frequency × Distinctiveness = What’s characteristic
For your analysis:
- Use word frequency to understand general content
- Use TF-IDF to find distinctive vocabulary
- Use both to tell a complete story
- Always justify interpretations with descriptive evidence
The goal: Move from “here’s what the data shows” to “here’s what it might mean” - while staying honest about limitations!
Questions for Reflection
As you work with these metrics in your own analysis:
- What does high frequency tell you? What doesn’t it tell you?
- When might TF-IDF be misleading?
- How do your preprocessing choices affect these metrics?
- What’s the difference between describing patterns and interpreting them?
Remember: Computational text analysis is a tool, not an answer. Your job is to use these metrics thoughtfully to make evidence-based interpretations while acknowledging their limitations and supporting them with (your) domain expertise. This is a good combination of data skills and area studies expertise.