Week 6 Midterm — Answer Guide

1 In computational text analysis, what is a token?

A single word as separated by spaces in the original text
A numerical value representing a word’s frequency in the corpus
A unit of text, such as a word, morpheme, or character, that serves as the basic unit of analysis
A complete sentence that has been cleaned and preprocessed

A token is the basic unit that text gets broken into for analysis. Unlike a raw space-separated word (which in Korean is an eojeol containing multiple morphemes), a token can be a morpheme, word, or character depending on the tokenization level chosen.

2 Korean can be tokenized at multiple levels: sentence, eojeol, morpheme, syllable, and jamo. At which level do our preprocessing scripts tokenize Korean text using Kiwi?

Sentence level — splitting text into full sentences
Eojeol level — splitting on spaces
Morpheme level — decomposing words into their smallest meaningful units
Syllable level — splitting into individual syllable blocks

Korean can be tokenized at many levels, but our Kiwi-based preprocessing scripts operate at the morpheme level, decomposing eojeols into their smallest meaningful parts and tagging each with a POS label.

3 Why do we remove stopwords like 있다, 하다, and 것 even though they may pass POS filtering?

They are misspelled
They are too short to analyze
They are high-frequency words that carry little topical meaning
They only appear in certain eras of text

Words like 있다, 하다, and 것 are grammatically valid and pass noun/verb POS filters, but they appear so frequently and carry so little topical information that they add noise rather than signal to the analysis.

4 The sentence 경제를 배운다 contains two eojeols. What does Kiwi produce when it analyzes the eojeol 경제를?

A single token: 경제를
Two morphemes with POS tags: 경제/NNG + 를/JKO
The base form 경제 with the particle deleted
A frequency count of how often 경제 appears in the document

Kiwi performs morphological analysis: it breaks the eojeol into its constituent morphemes and assigns a POS tag to each. 경제 is tagged NNG (common noun) and 를 is tagged JKO (object particle). POS filtering then keeps 경제 and discards 를.

5 The Bag of Words model is a deliberate simplification. What does it discard?

Word frequencies and document metadata
Word order, grammar, and context
The number of documents in the corpus
Information about which words appear in which documents

BoW deliberately throws away word order, grammar, and context. This is a drastic simplification, but it works because which words a document uses often characterizes it well enough for many analytical tasks.

6 Your presidential speeches dataframe has 749 rows (one per speech) and 7 columns (president, date, title, etc.). After preprocessing and running Bag of Words, the resulting DTM has 749 rows and over 10,000 columns. Why did the number of columns change so dramatically?

The Bag of Words widget duplicated the metadata columns for each document
Each unique word in the corpus becomes its own column, so the DTM has one column per word rather than one column per metadata field
The preprocessing script created a new column for every sentence in every speech
Orange automatically generates extra columns for TF, IDF, and TF-IDF for each original column

The original dataframe has metadata columns (president, date, etc.). After BoW, every unique word in the corpus becomes its own column in the DTM. With thousands of unique words, the DTM has thousands of columns — a fundamentally different shape from the original 7-column dataframe.

7 If a word has high TF in a document but low TF-IDF, what does that tell you?

The word is important and distinctive for that document
The word is frequent in this document but also common across the corpus, so it is not distinctive
The preprocessing pipeline failed for that word
The word should be added to the stopword list

TF-IDF = TF × IDF. If TF is high but TF-IDF is low, then IDF must be very low — meaning the word appears in most documents across the corpus. It’s frequent here, but it’s frequent everywhere, so it doesn’t distinguish this document.

8 In Orange Data Mining, what does the Bag of Words widget produce from your preprocessed corpus?

A filtered version of the corpus with only the most important words remaining in the text
A document-term matrix — a numerical table where rows are documents, columns are words, and cells are counts or weights
A frequency-ranked list of all words in the corpus, sorted from most to least common
A preprocessed corpus with stopwords removed and morphemes tagged

The Bag of Words widget is the step that converts text into numbers. It takes the preprocessed corpus and produces a DTM — the numerical representation that all downstream analysis (word clouds, bar plots, etc.) draws from.

9 Speech A is 5,000 words long and contains the word 경제 (economy) 20 times. Speech B is 500 words long and contains 경제 10 times. Based on raw term frequency alone, Speech A has the higher count. Why can this be misleading, and how do we address it?

It is not misleading — higher raw counts always mean the word is more important in that document
Longer documents naturally produce higher counts for every word. Normalization scales each document to the same length so they can be fairly compared
The word should be removed as a stopword since it appears in both speeches
You should only compare speeches by the same president to avoid this problem

A 5,000-word speech will naturally contain higher raw counts for almost every word compared to a 500-word speech. The rate of 경제 is actually higher in Speech B (10/500 = 2%) than Speech A (20/5000 = 0.4%). Normalization scales each document to the same length, removing this document-length bias so documents can be compared fairly.

10 In TF-IDF, the IDF (Inverse Document Frequency) component gives a higher score to certain words. Which words receive the highest IDF scores, and why is this useful?

Words that appear most frequently in the corpus, because frequent words are the most informative
Words that appear in very few documents, because rarity across the corpus signals that a word is distinctive and informative
Words with the most syllables, because longer words tend to carry more meaning
Words that appear exactly once in every document, because even distribution means they are neutral

IDF = log(N/DF). Words appearing in nearly every document get IDF close to 0 (uninformative). Words appearing in only a few documents get high IDF scores because their rarity makes them useful for distinguishing those documents from the rest of the corpus.

Grading Formula

Component	Scoring	Weight
Concepts Quiz (10 questions)	1 point each	Weighted to 8 points: (raw / 10) × 8
Preprocessing Task	0, 1, or 2 points	2 points
Total		out of 10

Preprocessing task rubric:

0 —: Did not preprocess or did not follow directions (e.g., loaded a previous workflow)
1 —: Attempted but incomplete: missing steps, pipeline errors, or output not clean
2 —: Successful end-to-end preprocessing with clean Word Cloud output and thoughtful reflection

Midterm Assessment — Answer Guide

Grading Formula