Assignments
Assignments are posted here as they are assigned. Not every week has an assignment; some weeks involve only in-class work. Refer to the Syllabus for the complete schedule.
Standing policy: All assignments are due by the beginning of the next class unless otherwise specified.
Week 1: R Programming with Swirl
Assigned: Feb. 02 Due: Feb. 09 (before class)
Complete the following Swirl R Programming lessons:
| Lesson | Topic |
|---|---|
| 1 | Basic Building Blocks |
| 2 | Workspace and Files |
| 4 | Vectors |
| 6 | Subsetting Vectors |
| 7 | Matrices and Data Frames |
| 12 | Looking at Data |
How to complete:
- Open RStudio
- Type
library(swirl)thenswirl()in the console - Select “R Programming” and work through each lesson listed above
Submission: You will confirm completion via an in-class poll at the start of Week 2. No screenshots or documentation required.
Week 2: R Programming & Orange Data Mining
Assigned: Feb. 09 Due: Feb. 16 (before class)
R Programming:
Complete DataCamp: Introduction to Text Analysis in R – Chapter 1: Wrangling Text.
Optional: Replicate the in-class Orange demo
For extra practice, replicate the in-class demo by loading the presidential speeches corpus into Orange Data Mining and exploring it with the Corpus widget.
Download the corpus from the Data & Scripts page.
Steps:
- Download the presidential speeches CSV from the Data page
- Add it to a subfolder in your repo (e.g.,
/data/president_speeches/) - Commit and push via GitHub Desktop
- Open Orange Data Mining
- Add a Corpus widget and load the CSV from your data folder
- Explore the corpus – browse speeches, filter, search
- Save your workflow (
.owsfile)
Submission: Upload your .ows file and a screenshot of your Orange workflow to your GitHub repository.
Week 3: Text Preprocessing Basics
Assigned: Feb. 16 Due: Feb. 23 (before class)
Required — R Programming:
Complete DataCamp: Introduction to R — Chapter 1 (Intro to Basics) and Chapter 2 (Vectors).
Optional — Preprocessing Practice (choose one or both):
Practice the full Korean text preprocessing pipeline on the presidential speeches corpus using Orange Data Mining, R, or both. To demonstrate your preprocessing, generate a word cloud from the result — this is a quick way to verify that your pipeline is working and producing meaningful output.
Option A: Orange Data Mining
Download the preprocessing script for your OS from the Data & Scripts page. Refer to the widget pipeline on the Presentations page (Orange column) for the full workflow.
- Open Orange and create a new workflow
- Load the presidential speeches corpus using the Corpus widget
- Add a Preprocess Text widget — connect it to Corpus
- Add a Python Script widget and paste the preprocessing script
- Change
TEXT_COLUMNto match your corpus column name - Add a Word Cloud widget — connect it to the output
- Adjust settings until you have a meaningful word cloud
- Save your deliverables (see below)
Saving your ODM work:
- Screenshot: Right-click the canvas background and select Save As Image, or use your system screenshot tool (
Cmd+Shift+4on macOS,Win+Shift+Son Windows) - Workflow file: Go to File → Save As and save with the
.owsextension
Option B: RStudio
- Download the R script: week03_preprocessing.R
- Open it in RStudio
- Read the comments — the script walks you through each step
- Run the script section by section (select lines and press
Ctrl+Enter/Cmd+Enter) - The script will save
wordcloud.pngto your working directory
Note: The first run installs Python + Kiwi automatically (this takes a few minutes). After that, you can skip the installation steps.
Submitting your work:
- Create a
week03/folder insideassignments/in your repository - Add your deliverables to that folder:
- ODM: screenshot (
.png) + workflow file (.ows) - R: the saved word cloud image (
wordcloud.png)
- ODM: screenshot (
- In GitHub Desktop: you will see the new files listed as changes
- Write a short commit message (e.g., “Add week 3 word cloud”)
- Click Commit to main, then Push origin
- Confirm your files appear on github.com in your repository
The instructor has access to your repository and will review your submission there.
Week 4: From Words to Numbers — BoW & TF-IDF
Assigned: Feb. 23 Due: Mar. 02 (before class) | R Programming extended deadline: Mar. 09, 15:15
Required — R Programming:
Complete DataCamp: Introduction to R — Factors Chapter and Matrices Chapter.
Orange Data Mining Workflow:
Reproduce the in-class Orange workflow to practice building a complete text analysis pipeline from corpus to visualization.
- Corpus widget: load the presidential speeches CSV
- Python Script widget: paste the preprocessing script for your OS
- Corpus widget (second): reload to pick up processed text
- Preprocess Text: tokenize by whitespace, load stopword list
- Bag of Words: select TF-IDF weighting
- Connect at least two visualization widgets (Word Cloud, Bar Plot, Distributions, or Statistics)
- Take a screenshot of your workflow
Submitting your work:
- Create a
week04/folder insideassignments/in your repository - Add your deliverables:
- Workflow file (
.ows) - Screenshot of your Orange workflow (
.png) - Two visualization figures (
.png): one from raw counts (BoW Count) and one from TF-IDF (e.g., Word Cloud, Bar Plot, or another widget for each)
- Workflow file (
- In GitHub Desktop: write a short commit message (e.g., “Add week 4 workflow”)
- Click Commit to main, then Push origin
- Confirm your files appear on github.com in your repository
- Mark your completion on the shared Google Sheet
Week 5: Practice & Deepen — Midterm Prep
Assigned: Mar. 02 Due: Mar. 09 (before class)
Required — R Programming:
Complete DataCamp: Introduction to R — Factors Chapter and Matrices Chapter (extended deadline: Mar. 09, 15:15).
Midterm Preparation:
The Week 6 midterm assessment has two parts:
- Online quiz (~20 min) — multiple-choice questions covering concepts from Weeks 1–5.
- Hands-on task (~20 min) — download a small new corpus, preprocess it in Orange, and produce a clean Word Cloud. Upload your
.owsfile, a saved Word Cloud image, and a short.mdfile (research question + expected findings) to your GitHub repo.
Use the study guide to review all key concepts, the preprocessing pipeline, BoW/TF-IDF, and the Orange workflow.
Study Guide: Week 6 Assessment Study Guide (PDF)
How to prepare:
- Review the study guide — work through the self-check questions
- Practice building Orange workflows end-to-end (File → Corpus → Python Script → Corpus → BoW → Visualization)
- Make sure you understand why each preprocessing step exists, not just how to do it
Week 6: Midterm Review Week
Assigned: Mar. 09 Due: Mar. 16 (before class)
R Programming:
Complete DataCamp: Introduction to R — Lists Chapter and Data Frames Chapter.
Orange Data Mining Tutorials:
Watch the following four tutorials before next class. These cover the clustering methods we will use in Week 7:
- Hierarchical Clustering (Getting Started #05)
- k-Means (Getting Started #11)
- k-Means Explained (Getting Started #12)
- Document Clustering and Cluster Exploration (Text Mining #09)
Week 7: Clustering
Assigned: Mar. 16 Due: Mar. 30 (before class)
Orange Data Mining — Hierarchical Clustering:
Replicate the in-class hierarchical clustering demo using the NIKH clustering demo corpus (11 textbooks). Download it from the Data & Scripts page.
- Load the corpus in Orange using the Corpus widget
- Preprocess the text (Python Script → Preprocess Text → Bag of Words with TF-IDF)
- Compute Distances (Cosine)
- Run Hierarchical Clustering (Ward linkage)
- Select two clusters from the dendrogram
- Explore each cluster using descriptive tools of your choice (e.g., Word Cloud, Bar Plot, Corpus Viewer) — try to understand what makes these clusters different
Write-up: Create a short Markdown file (analysis.md) describing:
- What you did — which settings you chose and how you set up your workflow
- Why you did it — your reasoning for the choices you made
- What you found — which books ended up in each cluster, and what vocabulary or themes distinguish them
This does not need to be long — a few clear paragraphs is adequate. The goal is to reflect on the process and interpret what the clustering reveals.
Submitting your work:
- Create a
week07/folder insideassignments/in your repository - Add the following to that folder:
- Your Orange workflow file (
.ows) - Visualization screenshots (
.png) showing your two-cluster comparison - Your analysis write-up (
analysis.md)
- Your Orange workflow file (
- In GitHub Desktop: write a short commit message (e.g., “Add week 7 clustering”)
- Click Commit to main, then Push origin
- Confirm your files appear on github.com in your repository
- Mark your completion on the shared Google Sheet
R Programming — Swirl Tutorials:
Complete two Swirl lessons from the Exploratory Data Analysis course. You will need to install this course first — it is separate from the R Programming course you have been using.
Installing the course:
- Open RStudio
- Run the following in the console:
library(swirl)
install_course("Exploratory_Data_Analysis")
swirl()
- Select Exploratory Data Analysis from the course list
For more details on installing Swirl courses, see the Swirl student page.
Complete these two lessons:
| Lesson | Topic |
|---|---|
| 11 | Hierarchical Clustering |
| 12 | K-Means Clustering |
Documenting completion: Take a screenshot of the completion message for each lesson and save them as .png files in your week07/ folder alongside your Orange deliverables.
Optional — R Programming:
Complete DataCamp: Intermediate R — The apply family Chapter and Utilities Chapter.
Orange Data Mining Tutorials (preparation for Week 8):
Watch the following tutorials before the next class. These cover the word embedding methods we will use in Week 8:
- Word Embedding and Nearest Neighbors (Text Mining #01)
- Semantic Word Search (Text Mining #02)
- Document Embedding (Text Mining #05)
Uploading Your Work
Unless otherwise noted, coursework must be documented with screenshots and relevant files, then uploaded to your individual GitHub repository. See the Getting Started guide for repository setup and structure.
Optional: Supplementary R Programming
For students interested in developing deeper R programming skills:
Swirl R Programming:
- Lesson 5: Missing Values
- Lesson 8: Logic
- Lesson 9: Functions
Swirl Exploratory Data Analysis:
- Lesson 13: Dimension Reduction
DataCamp:
- Intermediate R — Conditionals and Control Flow
- Intermediate R — Loops
- Data Manipulation & Visualization
- Text Mining with Bag-of-Words in R