Week 3 Deliverable: Clustering Documents and Understanding Text Weights
Course: Topical Reading: Digital Humanities (BA3 Korean Studies)
Date: October 23
Instructors: Steven Denney & Aron van der Pol
Due: Thursday, October 30, 2025 by 17:00 (longer than usual!)
Objective
Learn how text preprocessing and weighting schemes (raw counts vs. TF-IDF) affect document clustering and keyword extraction in Orange Data Mining (ODM). Also, explore the value of document clustering.
Tasks
1. Read and Reflect on Text Preprocessing (Python file) - Mac Users ONLY originally (note update for Windows users; workable file now exists)
- Open and read through the annotated
.pyfile on text preprocessing. It’s in the/datafolder. Note: there are files with and without annotations. You can use either, but it’s cleaner to use the one without in ODM. - Try running the custom Python script; it keeps only nouns (common and proper nouns), FYI.
- If yes, continue using it. If no, that’s fine; revert to your previous preprocessing pipeline.
- Let us know in your
README.mdfor this week if it worked, any questions you have, etc. - UPDATE: Windows fix was added (still a little buggy; Windows users can, in fact, try this step and report back to me.)
2. Compare Word Counts vs. TF-IDF
a) Export two visualizations:
- Create a Word Cloud or Bar Chart using raw word counts (unweighted)
- Create a Word Cloud or Bar Chart using TF-IDF weights
- For both: limit the number of words displayed to a reasonable amount (e.g., 50-100 words)
- Save both visualizations, upload them.
b) Analysis:
- What changes between raw counts and TF-IDF?
- What stays the same?
- What does this tell you about how TF-IDF works?
- Write your observations in your
README.md(3-5 sentences is fine)
3. Hierarchical Clustering and Cluster Analysis
- Follow the instructor’s pipeline to generate hierarchical clusters from the nikh corpus
- (Hint: The
.owsfile is available in the/presentationsfolder if you need help)
- (Hint: The
- Select at least 2 different clusters from the dendrogram
- For each cluster, generate either a Word Cloud or Bar Plot showing the top words
- Remember to limit the number of words displayed
- Label each export clearly (e.g., “cluster_1_wordcloud.png”, “cluster_2_wordcloud.png”)
Write a short report:
- Use any word processor you prefer
- Explain:
- What you did (your clustering approach)
- Why you did it (what you were trying to discover)
- What you found (what do the clusters represent? What themes emerged?)
- Don’t worry about length; just write enough to clearly explain your process and findings
- Export as PDF and upload to your assignments folder
5. Provide your .ows file
-
Show your flow! Upload it, please.
Deliverables
Place the following files in your GitHub repo folder for week03 assignment:
Folder structure:
/week03/
│
├── raw_counts_visualization.png
├── tfidf_visualization.png
├── clustering_report.pdf
├── week03.ows
└── README.md
README.md should include:\
- Thoughts about running a custom python script.
- Your comparison of raw counts vs. TF-IDF
Optional (R Track)
For students in the R Programming extension, complete:
DataCamp – Introduction to Text Analysis in R
- Module 2: Visualizing Text
- Module 3: Sentiment Analysis
R track assignments are due before the next class.