Optional Add-On (R Programming Track): Full LDA Assignment

(Top Terms • Topic–Document Distribution • Word Probabilities)

This task mirrors the ODM LDA exercise but uses R to extract full statistical results.
You will generate:

Top terms per topic
Topic–document distribution (θ)
Topic–word probabilities (φ)
Coherence scores for multiple values of k
A short comparison of k options

1. Fit LDA models in R

Use the same nikh__sentences.csv file as in ODM.
After preprocessing and creating a DTM, run LDA for at least three values of k:

library(topicmodels)

lda_k4 <- LDA(dtm, k = 4, control = list(seed = 123))
lda_k6 <- LDA(dtm, k = 6, control = list(seed = 123))
lda_k8 <- LDA(dtm, k = 8, control = list(seed = 123))

Pick one model to use for extraction (usually the best by coherence).

2. Compute coherence scores for choosing k

Use the ldatuning package:

library(ldatuning)

result <- FindTopicsNumber(
  dtm,
  topics = c(4, 6, 8),
  metrics = c("CaoJuan2009", "Arun2010", "Griffiths2004", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 123),
  verbose = TRUE
)

FindTopicsNumber_plot(result)

Export this plot as coherence_plot.png.

3. Extract top terms per topic

terms(lda_k6, 15)    # or whichever model you selected

Save the output as lda_r_top_terms.txt.

4. Extract the topic–document distribution (θ)

theta <- posterior(lda_k6)$topics
head(theta)

write.csv(theta, "lda_r_topic_distribution.csv")

5. Extract the topic–word probabilities (φ)

phi <- posterior(lda_k6)$terms
head(phi)

write.csv(phi, "lda_r_word_probabilities.csv")

6. Visualize the “Two Pillars of LDA”

Produce:

a plot of θ for one document (theta_example.png)
a bar plot of φ for one topic (phi_example.png)

These correspond to the Two Pillars slide.

7. Reflection (README.md)

Write 4–6 sentences:

Which k values you tried
What the coherence plot suggested
Which k you think is most interpretable
Whether your preferred k matches the statistical suggestion
Whether R topics resemble your ODM topics

Deliverables

Place into /week06/r_track/:

coherence_plot.png
lda_r_top_terms.txt
lda_r_topic_distribution.csv
lda_r_word_probabilities.csv
theta_example.png
phi_example.png
README.md

Notes

θ = document–topic mixture
φ = topic–word probabilities (true p(w k))
ODM shows relative weights; R gives normalized distributions
Differences between ODM and R are expected