Understanding Text Analysis: From Words to Meaning

A Guide to Descriptive Text Statistics

by Steven Denney


Introduction: Why Count Words?

When we analyze text computationally, we start with the most basic question: What words appear, and how often?

But not all words are equally important. Some appear everywhere. Some appear rarely but meaningfully. Some distinguish one type of document from another.

This guide explains the key metrics we use to move from raw word counts to meaningful patterns.


1. Word Frequency (Term Frequency / TF)

What It Is:

The simplest measure: How many times does a word appear?

Example:

In a Korean history textbook:

What It Tells Us:

Limitations:

High frequency ≠ high importance

Common words like “하다” (to do), “것” (thing), “있다” (to exist) appear frequently but carry little meaning. This is why we remove stopwords.

In Orange:


2. Bag of Words (BoW)

What It Is:

A way of representing text as numbers that computers can analyze.

Each document becomes a vector (row) where each word is a feature (column) with a count.

Example:

Document 역사 기원전 문화 독립
Doc 1 5 8 2 0
Doc 2 3 0 4 7
Doc 3 6 1 3 2

Document 1 contains “역사” 5 times, “기원전” 8 times, etc.

What It Tells Us:

Key Assumption:

“The meaning of a document can be captured by which words it contains, ignoring word order”

This is called the “bag” metaphor - like dumping all words into a bag and counting them.

Limitations:

In Orange:


3. Document Frequency (DF)

What It Is:

How many documents contain a word? (Not how many times total)

Example:

What It Tells Us:

Two Extremes:

High DF (appears in most documents):

Low DF (appears in few documents):


4. Inverse Document Frequency (IDF)

What It Is:

A weight that increases a word’s importance if it’s rare across documents.

Formula (simplified):

IDF = log(Total Documents / Documents Containing Word)

Intuition:

Example:

Corpus of 50 Korean history textbooks:

Word Document Frequency IDF Score
역사 48/50 documents 0.04 (LOW)
기원전 35/50 documents 0.15
삼국시대 15/50 documents 0.52
갑오개혁 5/50 documents 1.00 (HIGH)

Interpretation:

Why This Matters:

IDF helps us find distinctive words, not just frequent ones.


5. TF-IDF (Term Frequency - Inverse Document Frequency)

What It Is:

The most important metric: Combines frequency WITH distinctiveness

Formula:

TF-IDF = (How often in this document) × (How rare across all documents)

How It Works:

High TF-IDF means:

Low TF-IDF means:

Example:

Document about 갑오개혁 (Gabo Reform):

Word Frequency in Doc IDF TF-IDF
역사 8 times 0.04 0.32 (LOW)
개혁 12 times 0.45 5.40
갑오개혁 15 times 1.00 15.00 (HIGH!)

Interpretation:

What TF-IDF Tells Us:

“What words are most CHARACTERISTIC of each document?”

Not just “what appears most” but “what makes this document unique”

In Orange:


Putting It All Together: A Comparison

Scenario: Analyzing 3 Korean history textbook chapters

Chapter 1: Ancient Korea (삼국시대)

Chapter 2: Japanese Occupation (일제강점기)

Chapter 3: Modern Korea (현대)

Notice:


Common Student Questions

Q: “Why remove stopwords? They’re frequent for a reason!”

A: Stopwords (like 하다, 있다, 것) appear frequently but don’t help us understand CONTENT. They’re grammatical scaffolding, not meaning.

Think of it like this: If analyzing English news, “the” appears everywhere. Doesn’t tell us if an article is about sports vs. politics.

Q: “Why is TF-IDF better than just counting?”

A: Raw counts reward common words. TF-IDF rewards distinctive words.

Example: 역사 (history) appears 100 times across 50 history textbooks. Not surprising, not distinctive. 갑오개혁 appears 50 times but only in 3 textbooks about that specific reform. VERY distinctive!

Q: “Can’t important words be common?”

A: Yes! That’s why we look at BOTH metrics:

Use both to tell a complete story.

Q: “Why does preprocessing matter?”

A: Preprocessing determines WHAT gets counted:

Preprocessing = analytical choices that affect results!


From Descriptive to Interpretive: Making Claims

Descriptive Claims (What you see):

Interpretive Claims (What it might mean):

Critical Claims (Limitations):


Visual Summary: Which Metric When?

┌─────────────────────────────────────────────────────┐
│  RESEARCH QUESTION          →    USE THIS METRIC    │
├─────────────────────────────────────────────────────┤
│  What's discussed most?     →    Word Frequency     │
│  What topics appear?        →    Word Cloud         │
│  What makes docs similar?   →    Bag of Words       │
│  What's distinctive?        →    TF-IDF             │
│  What defines clusters?     →    TF-IDF by cluster  │
│  How do groups differ?      →    Keyword contrast   │
└─────────────────────────────────────────────────────┘

Workflow in Orange: The Pipeline

1. Raw Text
   ↓
2. Preprocessing 
   - Korean morphological analysis
   - POS tagging
   - Stopword removal
   ↓
3. Bag of Words
   - Convert to numbers
   - Count words per document
   ↓
4. Analysis Options:
   → Word Cloud (frequency)
   → Extract Keywords (TF-IDF)
   → Hierarchical Clustering (find patterns)
   → Compare groups (keyword contrast)

Practice Exercise

Given this information about 3 documents:

Document 역사 (TF) 역사 (IDF) 독립 (TF) 독립 (IDF)
Doc A 10 0.1 0 0.8
Doc B 8 0.1 15 0.8
Doc C 12 0.1 2 0.8

Questions:

  1. Which document is most about 독립 (independence)?
  2. Which word is more distinctive: 역사 or 독립?
  3. What would be the TF-IDF for 독립 in Doc B

Answers:

  1. Doc B (TF = 15, highest frequency)
  2. 독립 (IDF = 0.8 vs. 0.1) - appears in fewer documents
  3. TF-IDF = 15 × 0.8 = 12.0 (high)

Key Takeaways

Remember:

  1. Frequency = What appears most
  2. Document Frequency = How widespread a word is
  3. IDF = How distinctive/rare a word is
  4. TF-IDF = Frequency × Distinctiveness = What’s characteristic

For your analysis:

The goal: Move from “here’s what the data shows” to “here’s what it might mean” - while staying honest about limitations!


Questions for Reflection

As you work with these metrics in your own analysis:

  1. What does high frequency tell you? What doesn’t it tell you?
  2. When might TF-IDF be misleading?
  3. How do your preprocessing choices affect these metrics?
  4. What’s the difference between describing patterns and interpreting them?

Remember: Computational text analysis is a tool, not an answer. Your job is to use these metrics thoughtfully to make evidence-based interpretations while acknowledging their limitations and supporting them with (your) domain expertise. This is a good combination of data skills and area studies expertise.