Text Preprocessing Pipeline

Step through the text preprocessing pipeline on real sentences from Korean presidential speeches. Each step shows what happens to the text — from raw input to cleaned, analysis-ready output. Use the POS tag and stopword controls in step 5 to experiment with different filtering settings.

Why can't I select every token? Korean is an agglutinative language — words are built by attaching grammatical morphemes (particles, endings, suffixes) to content morphemes (nouns, verbs, adjectives). The toggles above control which content POS tags to keep. Grammatical morphemes like subject particles (JKS: 이/가), object particles (JKO: 을/를), verb endings (EF, EC), and derivational suffixes (XSV: 하) are always filtered out because they carry structural rather than semantic meaning. This is standard practice in Korean computational text analysis.