Assignments

Assignments are posted here as they are assigned. Not every week has an assignment; some weeks involve only in-class work. Refer to the Syllabus for the complete schedule.

Standing policy: All assignments are due by the beginning of the next class unless otherwise specified.


Week 1: R Programming with Swirl

Assigned: Feb. 02 Due: Feb. 09 (before class)

Complete the following Swirl R Programming lessons:

Lesson Topic
1 Basic Building Blocks
2 Workspace and Files
4 Vectors
6 Subsetting Vectors
7 Matrices and Data Frames
12 Looking at Data

How to complete:

  1. Open RStudio
  2. Type library(swirl) then swirl() in the console
  3. Select “R Programming” and work through each lesson listed above

Submission: You will confirm completion via an in-class poll at the start of Week 2. No screenshots or documentation required.

Week 2: R Programming & Orange Data Mining

Assigned: Feb. 09 Due: Feb. 16 (before class)

R Programming:

Complete DataCamp: Introduction to Text Analysis in R – Chapter 1: Wrangling Text.

Optional: Replicate the in-class Orange demo

For extra practice, replicate the in-class demo by loading the presidential speeches corpus into Orange Data Mining and exploring it with the Corpus widget.

Download the corpus from the Data & Scripts page.

Steps:

  1. Download the presidential speeches CSV from the Data page
  2. Add it to a subfolder in your repo (e.g., /data/president_speeches/)
  3. Commit and push via GitHub Desktop
  4. Open Orange Data Mining
  5. Add a Corpus widget and load the CSV from your data folder
  6. Explore the corpus – browse speeches, filter, search
  7. Save your workflow (.ows file)

Submission: Upload your .ows file and a screenshot of your Orange workflow to your GitHub repository.

Week 3: Text Preprocessing Basics

Assigned: Feb. 16 Due: Feb. 23 (before class)

Required — R Programming:

Complete DataCamp: Introduction to R — Chapter 1 (Intro to Basics) and Chapter 2 (Vectors).

Optional — Preprocessing Practice (choose one or both):

Practice the full Korean text preprocessing pipeline on the presidential speeches corpus using Orange Data Mining, R, or both. To demonstrate your preprocessing, generate a word cloud from the result — this is a quick way to verify that your pipeline is working and producing meaningful output.

Option A: Orange Data Mining

Download the preprocessing script for your OS from the Data & Scripts page. Refer to the widget pipeline on the Presentations page (Orange column) for the full workflow.

  1. Open Orange and create a new workflow
  2. Load the presidential speeches corpus using the Corpus widget
  3. Add a Preprocess Text widget — connect it to Corpus
  4. Add a Python Script widget and paste the preprocessing script
  5. Change TEXT_COLUMN to match your corpus column name
  6. Add a Word Cloud widget — connect it to the output
  7. Adjust settings until you have a meaningful word cloud
  8. Save your deliverables (see below)

Saving your ODM work:

  • Screenshot: Right-click the canvas background and select Save As Image, or use your system screenshot tool (Cmd+Shift+4 on macOS, Win+Shift+S on Windows)
  • Workflow file: Go to File → Save As and save with the .ows extension

Option B: RStudio

  1. Download the R script: week03_preprocessing.R
  2. Open it in RStudio
  3. Read the comments — the script walks you through each step
  4. Run the script section by section (select lines and press Ctrl+Enter / Cmd+Enter)
  5. The script will save wordcloud.png to your working directory

Note: The first run installs Python + Kiwi automatically (this takes a few minutes). After that, you can skip the installation steps.

Submitting your work:

  1. Create a week03/ folder inside assignments/ in your repository
  2. Add your deliverables to that folder:
    • ODM: screenshot (.png) + workflow file (.ows)
    • R: the saved word cloud image (wordcloud.png)
  3. In GitHub Desktop: you will see the new files listed as changes
  4. Write a short commit message (e.g., “Add week 3 word cloud”)
  5. Click Commit to main, then Push origin
  6. Confirm your files appear on github.com in your repository

The instructor has access to your repository and will review your submission there.

Week 4: From Words to Numbers — BoW & TF-IDF

Assigned: Feb. 23 Due: Mar. 02 (before class) | R Programming extended deadline: Mar. 09, 15:15

Required — R Programming:

Complete DataCamp: Introduction to R — Factors Chapter and Matrices Chapter.

Orange Data Mining Workflow:

Reproduce the in-class Orange workflow to practice building a complete text analysis pipeline from corpus to visualization.

  1. Corpus widget: load the presidential speeches CSV
  2. Python Script widget: paste the preprocessing script for your OS
  3. Corpus widget (second): reload to pick up processed text
  4. Preprocess Text: tokenize by whitespace, load stopword list
  5. Bag of Words: select TF-IDF weighting
  6. Connect at least two visualization widgets (Word Cloud, Bar Plot, Distributions, or Statistics)
  7. Take a screenshot of your workflow

Submitting your work:

  1. Create a week04/ folder inside assignments/ in your repository
  2. Add your deliverables:
    • Workflow file (.ows)
    • Screenshot of your Orange workflow (.png)
    • Two visualization figures (.png): one from raw counts (BoW Count) and one from TF-IDF (e.g., Word Cloud, Bar Plot, or another widget for each)
  3. In GitHub Desktop: write a short commit message (e.g., “Add week 4 workflow”)
  4. Click Commit to main, then Push origin
  5. Confirm your files appear on github.com in your repository
  6. Mark your completion on the shared Google Sheet
Week 5: Practice & Deepen — Midterm Prep

Assigned: Mar. 02 Due: Mar. 09 (before class)

Required — R Programming:

Complete DataCamp: Introduction to R — Factors Chapter and Matrices Chapter (extended deadline: Mar. 09, 15:15).

Midterm Preparation:

The Week 6 midterm assessment has two parts:

  1. Online quiz (~20 min) — multiple-choice questions covering concepts from Weeks 1–5.
  2. Hands-on task (~20 min) — download a small new corpus, preprocess it in Orange, and produce a clean Word Cloud. Upload your .ows file, a saved Word Cloud image, and a short .md file (research question + expected findings) to your GitHub repo.

Use the study guide to review all key concepts, the preprocessing pipeline, BoW/TF-IDF, and the Orange workflow.

Study Guide: Week 6 Assessment Study Guide (PDF)

How to prepare:

  • Review the study guide — work through the self-check questions
  • Practice building Orange workflows end-to-end (File → Corpus → Python Script → Corpus → BoW → Visualization)
  • Make sure you understand why each preprocessing step exists, not just how to do it
Week 6: Midterm Review Week

Assigned: Mar. 09 Due: Mar. 16 (before class)

R Programming:

Complete DataCamp: Introduction to R — Lists Chapter and Data Frames Chapter.

Orange Data Mining Tutorials:

Watch the following four tutorials before next class. These cover the clustering methods we will use in Week 7:

Week 7: Clustering

Assigned: Mar. 16 Due: Mar. 30 (before class)

Orange Data Mining — Hierarchical Clustering:

Replicate the in-class hierarchical clustering demo using the NIKH clustering demo corpus (11 textbooks). Download it from the Data & Scripts page.

  1. Load the corpus in Orange using the Corpus widget
  2. Preprocess the text (Python Script → Preprocess Text → Bag of Words with TF-IDF)
  3. Compute Distances (Cosine)
  4. Run Hierarchical Clustering (Ward linkage)
  5. Select two clusters from the dendrogram
  6. Explore each cluster using descriptive tools of your choice (e.g., Word Cloud, Bar Plot, Corpus Viewer) — try to understand what makes these clusters different

Write-up: Create a short Markdown file (analysis.md) describing:

  • What you did — which settings you chose and how you set up your workflow
  • Why you did it — your reasoning for the choices you made
  • What you found — which books ended up in each cluster, and what vocabulary or themes distinguish them

This does not need to be long — a few clear paragraphs is adequate. The goal is to reflect on the process and interpret what the clustering reveals.

Submitting your work:

  1. Create a week07/ folder inside assignments/ in your repository
  2. Add the following to that folder:
    • Your Orange workflow file (.ows)
    • Visualization screenshots (.png) showing your two-cluster comparison
    • Your analysis write-up (analysis.md)
  3. In GitHub Desktop: write a short commit message (e.g., “Add week 7 clustering”)
  4. Click Commit to main, then Push origin
  5. Confirm your files appear on github.com in your repository
  6. Mark your completion on the shared Google Sheet

R Programming — Swirl Tutorials:

Complete two Swirl lessons from the Exploratory Data Analysis course. You will need to install this course first — it is separate from the R Programming course you have been using.

Installing the course:

  1. Open RStudio
  2. Run the following in the console:
library(swirl)
install_course("Exploratory_Data_Analysis")
swirl()
  1. Select Exploratory Data Analysis from the course list

For more details on installing Swirl courses, see the Swirl student page.

Complete these two lessons:

Lesson Topic
11 Hierarchical Clustering
12 K-Means Clustering

Documenting completion: Take a screenshot of the completion message for each lesson and save them as .png files in your week07/ folder alongside your Orange deliverables.

Optional — R Programming:

Complete DataCamp: Intermediate R — The apply family Chapter and Utilities Chapter.

Orange Data Mining Tutorials (preparation for Week 8):

Watch the following tutorials before the next class. These cover the word embedding methods we will use in Week 8:


Uploading Your Work

Unless otherwise noted, coursework must be documented with screenshots and relevant files, then uploaded to your individual GitHub repository. See the Getting Started guide for repository setup and structure.


Optional: Supplementary R Programming

For students interested in developing deeper R programming skills:

Swirl R Programming:

Swirl Exploratory Data Analysis:

DataCamp: