Understanding Text Analysis: From Words to Meaning

A Guide to Descriptive Text Statistics

by Steven Denney

Introduction: Why Count Words?

When we analyze text computationally, we start with the most basic question: What words appear, and how often?

But not all words are equally important. Some appear everywhere. Some appear rarely but meaningfully. Some distinguish one type of document from another.

This guide explains the key metrics we use to move from raw word counts to meaningful patterns.

1. Word Frequency (Term Frequency / TF)

What It Is:

The simplest measure: How many times does a word appear?

Example:

In a Korean history textbook:

역사 (history) appears 150 times
기원전 (BCE) appears 54 times
문화 (culture) appears 32 times

What It Tells Us:

Which topics are discussed most
The basic vocabulary of the corpus
Surface-level content overview

Limitations:

High frequency ≠ high importance

Common words like “하다” (to do), “것” (thing), “있다” (to exist) appear frequently but carry little meaning. This is why we remove stopwords.

In Orange:

Word Cloud - visual display of frequency (bigger = more frequent)
Data Table after Bag of Words - raw counts

2. Bag of Words (BoW)

What It Is:

A way of representing text as numbers that computers can analyze.

Each document becomes a vector (row) where each word is a feature (column) with a count.

Example:

Document	역사	기원전	문화	독립
Doc 1	5	8	2	0
Doc 2	3	0	4	7
Doc 3	6	1	3	2

Document 1 contains “역사” 5 times, “기원전” 8 times, etc.

What It Tells Us:

Transforms text into analyzable data
Enables comparison between documents
Foundation for most text analysis methods

Key Assumption:

“The meaning of a document can be captured by which words it contains, ignoring word order”

This is called the “bag” metaphor - like dumping all words into a bag and counting them.

Limitations:

Loses context: “not good” vs “good” - both have “good” but opposite meanings
Loses grammar: word order doesn’t matter
High dimensionality: thousands of columns (one per unique word)

In Orange:

Bag of Words widget - creates this representation
Input: preprocessed text
Output: document-term matrix

3. Document Frequency (DF)

What It Is:

How many documents contain a word? (Not how many times total)

Example:

역사 appears in 45 out of 50 documents (DF = 45)
기원전 appears in 12 out of 50 documents (DF = 12)
독립운동 appears in 3 out of 50 documents (DF = 3)

What It Tells Us:

How widespread vs. specialized a word is
Common vocabulary vs. rare/technical terms

Two Extremes:

High DF (appears in most documents):

Very common words
General vocabulary
Less distinctive

Low DF (appears in few documents):

Specialized terms
Topic-specific vocabulary
More distinctive

4. Inverse Document Frequency (IDF)

What It Is:

A weight that increases a word’s importance if it’s rare across documents.

Formula (simplified):

IDF = log(Total Documents / Documents Containing Word)

Intuition:

Words in ALL documents → low IDF (not distinctive)
Words in FEW documents → high IDF (distinctive!)

Example:

Corpus of 50 Korean history textbooks:

Word	Document Frequency	IDF Score
역사	48/50 documents	0.04 (LOW)
기원전	35/50 documents	0.15
삼국시대	15/50 documents	0.52
갑오개혁	5/50 documents	1.00 (HIGH)

Interpretation:

역사 appears in almost every textbook → not distinctive → low IDF
갑오개혁 appears in only a few → distinctive → high IDF

Why This Matters:

IDF helps us find distinctive words, not just frequent ones.

5. TF-IDF (Term Frequency - Inverse Document Frequency)

What It Is:

The most important metric: Combines frequency WITH distinctiveness

Formula:

TF-IDF = (How often in this document) × (How rare across all documents)

How It Works:

High TF-IDF means:

Word appears OFTEN in this document (high TF)
Word is RARE across the corpus (high IDF)
= Distinctive and important for THIS document

Low TF-IDF means:

Either: appears rarely in this document
Or: appears in most documents (not distinctive)
= Not particularly important

Example:

Document about 갑오개혁 (Gabo Reform):

Word	Frequency in Doc	IDF	TF-IDF
역사	8 times	0.04	0.32 (LOW)
개혁	12 times	0.45	5.40
갑오개혁	15 times	1.00	15.00 (HIGH!)

Interpretation:

역사 appears often but isn’t distinctive for THIS document
갑오개혁 appears often AND is rare overall = highly distinctive for this document

What TF-IDF Tells Us:

“What words are most CHARACTERISTIC of each document?”

Not just “what appears most” but “what makes this document unique”

In Orange:

Extract Keywords widget - uses TF-IDF
Bag of Words widget - has TF-IDF option
Shows which words best represent each document

Putting It All Together: A Comparison

Scenario: Analyzing 3 Korean history textbook chapters

Chapter 1: Ancient Korea (삼국시대)

Most Frequent: 역사, 시대, 국가, 왕
Highest TF-IDF: 삼국시대, 고구려, 백제, 신라

Chapter 2: Japanese Occupation (일제강점기)

Most Frequent: 역사, 시대, 국가, 일본
Highest TF-IDF: 독립운동, 일제, 저항, 만세

Chapter 3: Modern Korea (현대)

Most Frequent: 역사, 시대, 국가, 발전
Highest TF-IDF: 민주화, 경제, 산업화, 개발

Notice:

Frequency alone - “역사” tops all chapters (not helpful for distinguishing!)
TF-IDF - shows what makes EACH chapter unique

Common Student Questions

Q: “Why remove stopwords? They’re frequent for a reason!”

A: Stopwords (like 하다, 있다, 것) appear frequently but don’t help us understand CONTENT. They’re grammatical scaffolding, not meaning.

Think of it like this: If analyzing English news, “the” appears everywhere. Doesn’t tell us if an article is about sports vs. politics.

Q: “Why is TF-IDF better than just counting?”

A: Raw counts reward common words. TF-IDF rewards distinctive words.

Example: 역사 (history) appears 100 times across 50 history textbooks. Not surprising, not distinctive. 갑오개혁 appears 50 times but only in 3 textbooks about that specific reform. VERY distinctive!

Q: “Can’t important words be common?”

A: Yes! That’s why we look at BOTH metrics:

Word clouds show what’s discussed most (frequency)
TF-IDF shows what’s distinctive (characteristic vocabulary)

Use both to tell a complete story.

Q: “Why does preprocessing matter?”

A: Preprocessing determines WHAT gets counted:

Keep all words → noise dominates
Remove stopwords → content words emerge
Keep only nouns → different story than nouns+verbs

Preprocessing = analytical choices that affect results!

From Descriptive to Interpretive: Making Claims

Descriptive Claims (What you see):

“기원전 appears 54 times”
“민족 has the highest TF-IDF score”
“This cluster contains 15 documents”

Interpretive Claims (What it might mean):

“The high frequency of 기원전 suggests emphasis on ancient history”
“The prominence of 민족 indicates nationalist discourse”
“Documents cluster by historical period”

Critical Claims (Limitations):

“However, frequency doesn’t capture how these terms are used in context”
“TF-IDF may overweight rare technical terms”
“These patterns could also be explained by…”

Visual Summary: Which Metric When?

┌─────────────────────────────────────────────────────┐
│  RESEARCH QUESTION          →    USE THIS METRIC    │
├─────────────────────────────────────────────────────┤
│  What's discussed most?     →    Word Frequency     │
│  What topics appear?        →    Word Cloud         │
│  What makes docs similar?   →    Bag of Words       │
│  What's distinctive?        →    TF-IDF             │
│  What defines clusters?     →    TF-IDF by cluster  │
│  How do groups differ?      →    Keyword contrast   │
└─────────────────────────────────────────────────────┘

Workflow in Orange: The Pipeline

1. Raw Text
   ↓
2. Preprocessing 
   - Korean morphological analysis
   - POS tagging
   - Stopword removal
   ↓
3. Bag of Words
   - Convert to numbers
   - Count words per document
   ↓
4. Analysis Options:
   → Word Cloud (frequency)
   → Extract Keywords (TF-IDF)
   → Hierarchical Clustering (find patterns)
   → Compare groups (keyword contrast)

Practice Exercise

Given this information about 3 documents:

Document	역사 (TF)	역사 (IDF)	독립 (TF)	독립 (IDF)
Doc A	10	0.1	0	0.8
Doc B	8	0.1	15	0.8
Doc C	12	0.1	2	0.8

Questions:

Which document is most about 독립 (independence)?
Which word is more distinctive: 역사 or 독립?
What would be the TF-IDF for 독립 in Doc B

Answers:

Doc B (TF = 15, highest frequency)
독립 (IDF = 0.8 vs. 0.1) - appears in fewer documents
TF-IDF = 15 × 0.8 = 12.0 (high)

Key Takeaways

Remember:

Frequency = What appears most
Document Frequency = How widespread a word is
IDF = How distinctive/rare a word is
TF-IDF = Frequency × Distinctiveness = What’s characteristic

For your analysis:

Use word frequency to understand general content
Use TF-IDF to find distinctive vocabulary
Use both to tell a complete story
Always justify interpretations with descriptive evidence

The goal: Move from “here’s what the data shows” to “here’s what it might mean” - while staying honest about limitations!

Questions for Reflection

As you work with these metrics in your own analysis:

What does high frequency tell you? What doesn’t it tell you?
When might TF-IDF be misleading?
How do your preprocessing choices affect these metrics?
What’s the difference between describing patterns and interpreting them?

Remember: Computational text analysis is a tool, not an answer. Your job is to use these metrics thoughtfully to make evidence-based interpretations while acknowledging their limitations and supporting them with (your) domain expertise. This is a good combination of data skills and area studies expertise.