Midterm Assessment — Answer Guide
For in-class review after the assessment
- A single word as separated by spaces in the original text
- A numerical value representing a word’s frequency in the corpus
- A unit of text, such as a word, morpheme, or character, that serves as the basic unit of analysis
- A complete sentence that has been cleaned and preprocessed
A token is the basic unit that text gets broken into for analysis. Unlike a raw space-separated word (which in Korean is an eojeol containing multiple morphemes), a token can be a morpheme, word, or character depending on the tokenization level chosen.
- Sentence level — splitting text into full sentences
- Eojeol level — splitting on spaces
- Morpheme level — decomposing words into their smallest meaningful units
- Syllable level — splitting into individual syllable blocks
Korean can be tokenized at many levels, but our Kiwi-based preprocessing scripts operate at the morpheme level, decomposing eojeols into their smallest meaningful parts and tagging each with a POS label.
- They are misspelled
- They are too short to analyze
- They are high-frequency words that carry little topical meaning
- They only appear in certain eras of text
Words like 있다, 하다, and 것 are grammatically valid and pass noun/verb POS filters, but they appear so frequently and carry so little topical information that they add noise rather than signal to the analysis.
- A single token: 경제를
- Two morphemes with POS tags: 경제/NNG + 를/JKO
- The base form 경제 with the particle deleted
- A frequency count of how often 경제 appears in the document
Kiwi performs morphological analysis: it breaks the eojeol into its constituent morphemes and assigns a POS tag to each. 경제 is tagged NNG (common noun) and 를 is tagged JKO (object particle). POS filtering then keeps 경제 and discards 를.
- Word frequencies and document metadata
- Word order, grammar, and context
- The number of documents in the corpus
- Information about which words appear in which documents
BoW deliberately throws away word order, grammar, and context. This is a drastic simplification, but it works because which words a document uses often characterizes it well enough for many analytical tasks.
- The Bag of Words widget duplicated the metadata columns for each document
- Each unique word in the corpus becomes its own column, so the DTM has one column per word rather than one column per metadata field
- The preprocessing script created a new column for every sentence in every speech
- Orange automatically generates extra columns for TF, IDF, and TF-IDF for each original column
The original dataframe has metadata columns (president, date, etc.). After BoW, every unique word in the corpus becomes its own column in the DTM. With thousands of unique words, the DTM has thousands of columns — a fundamentally different shape from the original 7-column dataframe.
- The word is important and distinctive for that document
- The word is frequent in this document but also common across the corpus, so it is not distinctive
- The preprocessing pipeline failed for that word
- The word should be added to the stopword list
TF-IDF = TF × IDF. If TF is high but TF-IDF is low, then IDF must be very low — meaning the word appears in most documents across the corpus. It’s frequent here, but it’s frequent everywhere, so it doesn’t distinguish this document.
- A filtered version of the corpus with only the most important words remaining in the text
- A document-term matrix — a numerical table where rows are documents, columns are words, and cells are counts or weights
- A frequency-ranked list of all words in the corpus, sorted from most to least common
- A preprocessed corpus with stopwords removed and morphemes tagged
The Bag of Words widget is the step that converts text into numbers. It takes the preprocessed corpus and produces a DTM — the numerical representation that all downstream analysis (word clouds, bar plots, etc.) draws from.
- It is not misleading — higher raw counts always mean the word is more important in that document
- Longer documents naturally produce higher counts for every word. Normalization scales each document to the same length so they can be fairly compared
- The word should be removed as a stopword since it appears in both speeches
- You should only compare speeches by the same president to avoid this problem
A 5,000-word speech will naturally contain higher raw counts for almost every word compared to a 500-word speech. The rate of 경제 is actually higher in Speech B (10/500 = 2%) than Speech A (20/5000 = 0.4%). Normalization scales each document to the same length, removing this document-length bias so documents can be compared fairly.
- Words that appear most frequently in the corpus, because frequent words are the most informative
- Words that appear in very few documents, because rarity across the corpus signals that a word is distinctive and informative
- Words with the most syllables, because longer words tend to carry more meaning
- Words that appear exactly once in every document, because even distribution means they are neutral
IDF = log(N/DF). Words appearing in nearly every document get IDF close to 0 (uninformative). Words appearing in only a few documents get high IDF scores because their rarity makes them useful for distinguishing those documents from the rest of the corpus.
Grading Formula
| Component | Scoring | Weight |
| Concepts Quiz (10 questions) | 1 point each | Weighted to 8 points: (raw / 10) × 8 |
| Preprocessing Task | 0, 1, or 2 points | 2 points |
| Total | | out of 10 |
Preprocessing task rubric:
- 0 —
- Did not preprocess or did not follow directions (e.g., loaded a previous workflow)
- 1 —
- Attempted but incomplete: missing steps, pipeline errors, or output not clean
- 2 —
- Successful end-to-end preprocessing with clean Word Cloud output and thoughtful reflection