Data & Scripts

This page collects the datasets, scripts, and dictionaries used throughout the course. Resources are added each week — check back for updates.

Datasets are sampled subsets of larger corpora maintained in the NLP Corpora for Korean Studies repository. For full corpora and documentation, see that repository.

Organizing your files: Create subfolders within your /data directory by corpus or file type — e.g., /data/president_speeches/, /data/scripts/. This keeps things tidy as the semester accumulates files.

Datasets

Week(s)	Dataset	Description	Download
2–5	Presidential Speeches	749 democratic-era presidential speeches (Roh Tae-woo – Moon Jae-in), sampled from 5,840. See README.	CSV (~4.4 MB)
2–5	Presidential Speeches (Small)	100 randomly selected speeches from the last three presidents. Use this if Orange runs slowly with the full file.	CSV (~500 KB)
5, 10	NIKH History Textbooks (Demo)	9 Korean history textbooks across 3 eras (Colonial, Authoritarian, Democratic), sampled from the 67-book NIKH corpus (1895–2016). Includes a pre-tokenized `processed_text` column. The smaller option for the Week 10 LDA assignment. See full corpus documentation.	CSV (~1.8 MB)
7, 10	NIKH Clustering Demo	11 Korean history textbooks (3 Colonial, 4 Authoritarian, 4 Democratic). Used for the Week 7 clustering exercise, and also an option for the Week 10 LDA assignment. Contains `full_text` for preprocessing in Orange.	CSV (~3.1 MB)
9	Moon Jae-in Tweets	3,148 tweets from @moonriver365 (2012–2020), with `favorites`, `retweets`, and a `period3` column (Pre-presidency / Transition / Presidency). See README.	CSV
11 (Final Assessment)	Kyongje Yongu (KJYG) sample	360 articles from the DPRK economics journal, 1987–2017, balanced 120 per leader era (Kim Il-sung / Kim Jong-il / Kim Jong-un). See README and the data dictionary.	CSV (~3.1 MB)
11 (Final Assessment)	Cheong Wa Dae Petitions sample	360 citizen petitions from the Cheong Wa Dae online platform, 2017–2018, balanced 60 per category (정치개혁 / 인권·성평등 / 외교·통일·국방 / 육아·교육 / 보건복지 / 일자리). See README and the data dictionary.	CSV (~600 KB)
Final Paper	Final Paper Dataset Menu	Curated 11-corpus menu for the final research paper. Hosted in a dedicated repo so the course site stays lightweight. Pick one corpus, write a 2,500–6,000-word research paper.	Repo

Scripts

Week(s)	Script	Description	Download
3–8, 10	Korean Preprocessing (Mac)	POS-based Kiwi tokenization for Orange Data Mining. Auto-installs kiwipiepy. Keeps NNG and NNP tags (nouns only).	Python
3–8, 10	Korean Preprocessing (Windows)	Same as above; kiwipiepy must be pre-installed.	Python
3	Korean Preprocessing — Annotated	Fully annotated Mac version. Read this to understand what each step does and why.	Python
9	Sentiment Preprocessing (Mac)	Kiwi tokenization for sentiment analysis. Auto-installs kiwipiepy. Keeps NNG, NNP, VV, and VA tags (nouns + verbs + adjectives).	Python
9	Sentiment Preprocessing (Windows)	Same as above; kiwipiepy must be pre-installed.	Python
Final paper	Hanja Preprocessing (Mac)	Hanja-aware variant for Hanmun-mixed corpora (Colonial Magazines, Kaebyok, older newspaper articles). Converts Chinese characters to their Hangul readings, then runs Kiwi tokenization. Auto-installs `kiwipiepy` and `hanja`.	Python
Final paper	Hanja Preprocessing (Windows)	Same as above; `kiwipiepy` and `hanja` must be pre-installed (`pip install kiwipiepy hanja`).	Python

Dictionaries & Word Lists

Week(s)	File	Description	Download
4–8, 10	Korean Stopwords	678 Korean stopwords (particles, auxiliaries, common grammatical words). Load in Preprocess Text → Filtering → Stopwords → From File.	TXT
9	KNU Positive Word List	4,868 positive-polarity Korean words (Park et al. 2018, Kunsan National University). Load in the Sentiment Analysis widget as the positive word list.	TXT
9	KNU Negative Word List	9,824 negative-polarity Korean words (Park et al. 2018). Load in the Sentiment Analysis widget as the negative word list.	TXT
Reference	KNU Full Dictionary	Original SentiWord_Dict with polarity and intensity scores. For use in R with `read_tsv()` for weighted sentiment analysis.	TXT

How to Use

Datasets:

Download the CSV file
Save it to a subfolder in your GitHub repository (e.g., /data/moon_twitter/)
Commit and push via GitHub Desktop
Load the file in Orange using the Corpus widget

Scripts:

Download the .py file for your operating system
In Orange, add a Python Script widget and paste the code
Change TEXT_COLUMN to match your corpus’s text column name
Connect it to your data flow and run

Dictionaries:

Download the .txt file and save it somewhere you can find it (e.g., /data/sentiment_dic/)
In Orange, open the relevant widget — Preprocess Text for stopwords, Sentiment Analysis for the KNU lists
Point the widget to the file using the file picker in its settings