Week 3 Deliverable: Clustering Documents and Understanding Text Weights

Course: Topical Reading: Digital Humanities (BA3 Korean Studies)
Date: October 23
Instructors: Steven Denney & Aron van der Pol

Due: Thursday, October 30, 2025 by 17:00 (longer than usual!)


Objective

Learn how text preprocessing and weighting schemes (raw counts vs. TF-IDF) affect document clustering and keyword extraction in Orange Data Mining (ODM). Also, explore the value of document clustering.


Tasks

1. Read and Reflect on Text Preprocessing (Python file) - Mac Users ONLY originally (note update for Windows users; workable file now exists)

2. Compare Word Counts vs. TF-IDF

a) Export two visualizations:

b) Analysis:

3. Hierarchical Clustering and Cluster Analysis

Write a short report:

5. Provide your .ows file

Deliverables

Place the following files in your GitHub repo folder for week03 assignment:

Folder structure:

/week03/
│
├── raw_counts_visualization.png
├── tfidf_visualization.png
├── clustering_report.pdf
├── week03.ows
└── README.md

README.md should include:\


Optional (R Track)

For students in the R Programming extension, complete:

DataCamp – Introduction to Text Analysis in R

R track assignments are due before the next class.