Data & Scripts
This page collects the datasets and scripts we use throughout the course. Resources are added each week as needed — check back for updates.
Datasets are sampled subsets of larger corpora maintained in the NLP Corpora for Korean Studies repository. For full corpora and detailed documentation, see that repository.
Organizing your files: Create subfolders within your /data directory by corpus or file type — e.g., /data/president_speeches/, /data/scripts/. This keeps things tidy as we accumulate more files over the semester.
Datasets
| Week(s) | Dataset | Description | Download |
|---|---|---|---|
| 2–5 | Presidential Speeches | 749 democratic-era presidential speeches (Roh Tae-woo–Moon Jae-in), sampled from 5,840. See documentation. | CSV (~4.4 MB) |
| 2–5 | Presidential Speeches (Small) | 100 randomly selected speeches from the last three presidents (Lee Myung-bak, Park Geun-hye, Moon Jae-in). Use this if Orange runs slowly or crashes with the full file. | CSV (~500 KB) |
| 5+ | NIKH History Textbooks (Demo) | 9 Korean history textbooks across 3 eras (Colonial, Authoritarian, Democratic), sampled from the 67-book NIKH corpus (1895–2016). Includes processed_text column (pre-tokenized nouns, ready for analysis). See full corpus documentation. |
CSV (~1.8 MB) |
| 7 | NIKH Clustering Demo | 11 Korean history textbooks (3 Colonial, 4 Authoritarian, 4 Democratic) for the Week 7 clustering exercise. Contains full_text for preprocessing in Orange or R. |
CSV (~3.1 MB) |
| 4+ | Korean Stopwords | 678 Korean stopwords (punctuation, numbers, and high-frequency grammatical words). Load in the Preprocess Text widget under Filtering → Stopwords → From File. | TXT |
Scripts
| Week | Script | Description | Download |
|---|---|---|---|
| 3 | Korean Preprocessing (Mac) | POS-based tokenization for Orange Data Mining. Auto-installs kiwipiepy. | Python |
| 3 | Korean Preprocessing (Windows) | Same as above but requires kiwipiepy pre-installed. | Python |
| 4 | Korean Preprocessing — Annotated (Mac) | Fully annotated version of the Mac preprocessing script. Read this to understand what each step does and why. | Python |
How to Use
Datasets:
- Download the CSV file
- Save it to a subfolder in your GitHub repository (e.g.,
/data/president_speeches/) - Commit and push via GitHub Desktop
- Load the file in Orange Data Mining using the Corpus widget
Scripts:
- Download the
.pyfile for your operating system - In Orange, add a Python Script widget and paste the code
- Change
TEXT_COLUMNto match your corpus column name (see the script comments) - Connect it to your data flow and run