Data & Scripts

This page collects the datasets and scripts we use throughout the course. Resources are added each week as needed — check back for updates.

Datasets are sampled subsets of larger corpora maintained in the NLP Corpora for Korean Studies repository. For full corpora and detailed documentation, see that repository.

Organizing your files: Create subfolders within your /data directory by corpus or file type — e.g., /data/president_speeches/, /data/scripts/. This keeps things tidy as we accumulate more files over the semester.

Datasets

Week(s)	Dataset	Description	Download
2–5	Presidential Speeches	749 democratic-era presidential speeches (Roh Tae-woo–Moon Jae-in), sampled from 5,840. See documentation.	CSV (~4.4 MB)
2–5	Presidential Speeches (Small)	100 randomly selected speeches from the last three presidents (Lee Myung-bak, Park Geun-hye, Moon Jae-in). Use this if Orange runs slowly or crashes with the full file.	CSV (~500 KB)
5+	NIKH History Textbooks (Demo)	9 Korean history textbooks across 3 eras (Colonial, Authoritarian, Democratic), sampled from the 67-book NIKH corpus (1895–2016). Includes `processed_text` column (pre-tokenized nouns, ready for analysis). See full corpus documentation.	CSV (~1.8 MB)
7	NIKH Clustering Demo	11 Korean history textbooks (3 Colonial, 4 Authoritarian, 4 Democratic) for the Week 7 clustering exercise. Contains `full_text` for preprocessing in Orange or R.	CSV (~3.1 MB)
4+	Korean Stopwords	678 Korean stopwords (punctuation, numbers, and high-frequency grammatical words). Load in the Preprocess Text widget under Filtering → Stopwords → From File.	TXT

Scripts

Week	Script	Description	Download
3	Korean Preprocessing (Mac)	POS-based tokenization for Orange Data Mining. Auto-installs kiwipiepy.	Python
3	Korean Preprocessing (Windows)	Same as above but requires kiwipiepy pre-installed.	Python
4	Korean Preprocessing — Annotated (Mac)	Fully annotated version of the Mac preprocessing script. Read this to understand what each step does and why.	Python

How to Use

Datasets:

Download the CSV file
Save it to a subfolder in your GitHub repository (e.g., /data/president_speeches/)
Commit and push via GitHub Desktop
Load the file in Orange Data Mining using the Corpus widget

Scripts:

Download the .py file for your operating system
In Orange, add a Python Script widget and paste the code
Change TEXT_COLUMN to match your corpus column name (see the script comments)
Connect it to your data flow and run