Midterm Assessment — Answer Guide

For in-class review after the assessment

1 In computational text analysis, what is a token?
A token is the basic unit that text gets broken into for analysis. Unlike a raw space-separated word (which in Korean is an eojeol containing multiple morphemes), a token can be a morpheme, word, or character depending on the tokenization level chosen.
2 Korean can be tokenized at multiple levels: sentence, eojeol, morpheme, syllable, and jamo. At which level do our preprocessing scripts tokenize Korean text using Kiwi?
Korean can be tokenized at many levels, but our Kiwi-based preprocessing scripts operate at the morpheme level, decomposing eojeols into their smallest meaningful parts and tagging each with a POS label.
3 Why do we remove stopwords like 있다, 하다, and 것 even though they may pass POS filtering?
Words like 있다, 하다, and 것 are grammatically valid and pass noun/verb POS filters, but they appear so frequently and carry so little topical information that they add noise rather than signal to the analysis.
4 The sentence 경제를 배운다 contains two eojeols. What does Kiwi produce when it analyzes the eojeol 경제를?
Kiwi performs morphological analysis: it breaks the eojeol into its constituent morphemes and assigns a POS tag to each. 경제 is tagged NNG (common noun) and 를 is tagged JKO (object particle). POS filtering then keeps 경제 and discards 를.
5 The Bag of Words model is a deliberate simplification. What does it discard?
BoW deliberately throws away word order, grammar, and context. This is a drastic simplification, but it works because which words a document uses often characterizes it well enough for many analytical tasks.
6 Your presidential speeches dataframe has 749 rows (one per speech) and 7 columns (president, date, title, etc.). After preprocessing and running Bag of Words, the resulting DTM has 749 rows and over 10,000 columns. Why did the number of columns change so dramatically?
The original dataframe has metadata columns (president, date, etc.). After BoW, every unique word in the corpus becomes its own column in the DTM. With thousands of unique words, the DTM has thousands of columns — a fundamentally different shape from the original 7-column dataframe.
7 If a word has high TF in a document but low TF-IDF, what does that tell you?
TF-IDF = TF × IDF. If TF is high but TF-IDF is low, then IDF must be very low — meaning the word appears in most documents across the corpus. It’s frequent here, but it’s frequent everywhere, so it doesn’t distinguish this document.
8 In Orange Data Mining, what does the Bag of Words widget produce from your preprocessed corpus?
The Bag of Words widget is the step that converts text into numbers. It takes the preprocessed corpus and produces a DTM — the numerical representation that all downstream analysis (word clouds, bar plots, etc.) draws from.
9 Speech A is 5,000 words long and contains the word 경제 (economy) 20 times. Speech B is 500 words long and contains 경제 10 times. Based on raw term frequency alone, Speech A has the higher count. Why can this be misleading, and how do we address it?
A 5,000-word speech will naturally contain higher raw counts for almost every word compared to a 500-word speech. The rate of 경제 is actually higher in Speech B (10/500 = 2%) than Speech A (20/5000 = 0.4%). Normalization scales each document to the same length, removing this document-length bias so documents can be compared fairly.
10 In TF-IDF, the IDF (Inverse Document Frequency) component gives a higher score to certain words. Which words receive the highest IDF scores, and why is this useful?
IDF = log(N/DF). Words appearing in nearly every document get IDF close to 0 (uninformative). Words appearing in only a few documents get high IDF scores because their rarity makes them useful for distinguishing those documents from the rest of the corpus.

Grading Formula

ComponentScoringWeight
Concepts Quiz (10 questions)1 point eachWeighted to 8 points: (raw / 10) × 8
Preprocessing Task0, 1, or 2 points2 points
Totalout of 10

Preprocessing task rubric:

0 —
Did not preprocess or did not follow directions (e.g., loaded a previous workflow)
1 —
Attempted but incomplete: missing steps, pipeline errors, or output not clean
2 —
Successful end-to-end preprocessing with clean Word Cloud output and thoughtful reflection