# Lecture 5: Reduced-dimensionality representations for documents: Gibbs sampling and topic models

• watch the new talk and write summary
• Noah Smith: squash network
• Main points:
• difference between LSA & SVD
• Bayesian graphical models
• informative priors are useful in the model
• Bayesian network
• DAG
• X1X2…Xn
• Po(X1, X2, …, Xn)
• Generative story: HMM (dependencies)
• A and B are conditionally independent given C iff P(A,B|C) = P(A|C) * P(B|C)
•       C (to A, to B)
•   A                             B
• Example:
•               Cloudy (to Sprinter and rain)
• Sprintder(to wet grass)                   Rain (to wet grass)
•                       Wet grass(final state)
• w’s are observable
• State change graph:  π -> S1 -> S2 -> S3 -> … -> St
• w’s are observable
• μ~(A,B,π)
• Have a initial π for C, π~Beta(Gamma(π1), Gamma(π2))
• C is Bernoulli (π)
• S~ Bernoulli(π(S | C))
• R~Bernoulli(π(S | C))
• W ~ Bernoulli(\tao (ω | S,R))
• π~Dirichlet(1)
• S1~Cat(π)S2~Cat(a_s1,1  ,  a_s1,2  ,   ….  ,   a_(s1,n))* Cat is chosen from the transition matrix
• ω1~Cat(b_s1,1  ,  b_s1,2  ,   ….  ,   b_(s1,n))ω2~Cat(b_s1,1  ,  b_s1,2  ,   ….  ,   b_(s1,n))* Cat is chosen from the transition matrix
• What just been introduced is the unigram model, here is the bigram model
• P(s1, s2, …. , sn) = P(s1) P(s2|s1) P(s3|s1,s2) P(s4|s2,s3)..
• For EM, we need to recalculate everything, conditional distribution are different
• Some distributions:
• Binomial vs. Bernoulli
• Multinomial vs. discrete / categorical
• Document squashing

MCMC

• X = HHHH TTTTTT
• π_ML = argmax P(X | π),  P(y | X) ~= P(y | π_MLE)
• π_MAP = argmax P(π | X), P(y | X) ~= P(y | π_MAP)
• P(y|x) = ∫P(y|π) P(π|X) dπ
• To avoid integration, use Mento Carlo (random sample)
• E_p(Z) [f(Z)] = ∫ f(Z) p(Z) dZ = lim_(n->∞) 1/N * ∑(i = 1:N) (f(z(t))) = 1/T * ∑_(t = 1:T) f(Z(t))
• z(t)~p(Z)
• MCMC theory:
• z(0): random start (“state”)
• for t = 1, t > Tao, z(t+1) = g(z(t)), where g(z(t)) is the visits to states is promotional to p(z)

Gibbs Sampling

• Assume Z = <z1, z2, z3>
• Define Z’ = <z1′, z2′, z3′>
• new value: z1′ ~ P(z1 | z2,  z3)
• new value: z2′ ~P(z2 | z1′ z3)
• new value: z3’~ P(z3 | z1′ z2′)
• Good reference:
• How do we get the original distribution?
• Use the model