# Lecture3: Information Theory

• Hypothesis testing
• collocations
• Info theory
• Hypothesis Testing
• Last lecture covered the methodology.
• Collocation
• “religion war”
• PMI, PPMI
• PMI = pointwise mutual information
• PMI = log2(P(x,y)/(P(x)P(y))) = I(x,y)
• PPMI = positive PMI = max(0, PMI)
• Example: Hong Kong, the frequency of “Hong” and “Kong” are low, but the frequency for “Hong Kong” is high.
• Non-statistical measure but very useful: Not a classic hypothesis testing.
• Note: If the frequency of the word is less than 5, then even if the PMI is high, the bigram is meaningless.
• RFR
• rfr(w) = (freq(w in DP)/(#DP))/(freq (w in background corpus)/(#background corpus))
• DP is an child class in background corpus.
• Chi-square test
• Draw the table is important.
• T-test
• Non-statistical measure but very useful.
• Not a classic hypothesis testing.
• sliding window scan the sentence to find the bigram.
• found ->1
• A fox saw a new company’s emerge…..
• A fox ->0
• fox saw -> 0
• saw a ->0
• a new -> 0
• new company -> 1
• company emerge -> 0
• Info theory
• Entropy: degree of certainty of X; information content of X; pointwise information of X; (ling) surprisal, perplexity
• X~P, H(X) = E[-log2(p(x))] = -sum(p(x)log2(p(x)))
• H(X,Y) = H(X) + H(Y) – I(X, Y)
• H(X, Y) is the cost of communicating between X and Y;
• H(X) is the cost of communicating X;
• H(Y) is the cost of communicating Y;
• I(X, Y) is the saving because X and Y can communicate with each other.
• Example: 8 horses run, the probability of winning is:
• horse:    1  .     2   .   3   .    4   .    5   .   6   .     7   .       8
• Prob:    1/2, 1/4   , 1/8,  1/16, 1/32, 1/64, 1/128, 1/256
• Code1:   000, 001, 010,  011,   100,  101,     110,      111
• H(X) = 3;
• Code2:0, 10, 110,  1110,  111100, 1111101, 111110, 111111
• Huffman code.
• H(X) = 2;
• Structure
• [W -> encoder] -> X-> [                    ]->    Y  -> [decoder  -> W^]
•        Source                    channel (P(y|x))
• Anything before Y is the noisy channel model.
• P(X,Y) is the joint model, P(X|Y) discriminative model.
• T^ (estimated T)
• =argmax_T (P(T|W))
• = argmax_T(P(W|T)P(T)/P(W))
• = argmax_T(P(W|T)P(T)) (W can be omitted because its’s just a scaler)
• = argmax_T(P(w1|t1)P(w2|t2)…p(w.n|t.n) P(t1)P(t2|t1)P(t3|t2,t1)…P(tn|t1,t2, …, t.n-1))
• = argmax_T(P(w1|t1)P(w2|t2)…p(w.n|t.n) P(t1)P(t2|t1)P(t3|t2)…P(t.n))
• E^ = argmax P(F|E)P(E)
• F = French, E = English, translation from English to French.
• Relative entropy: KL divergence
• We have an estimated model q and a true distribution p
• D(p||q) = sum_x(p(x).log(p(x))/ log(q(x)))
• Properties:
• D(p||q) >= 0, equal is true when p == q
• D(p||q) != D(q||p)
• D(p||q) = E_p[p(x)] – E_p[q(x)]   E_p(q(x)) is the cross entropy
• Example:
• Model q: P(X)P(Y) independent
• truth   p: Q(X, Y) joint
• D(p||q)
• = sum_x,y [p(x,y) log2(p(x,y)/p(x)p(y))]
• = E_x,y [log2(p(x,y)/p(x)p(y))] average mutual info is the expectation of pointwise mutual info
• Cross entropy: (perplexity)
• Question: how good is my language model? How do I measure?
• p(w1, w2, w3, w4) = p(w1) p(w2|w1) … p(w4|w1 w2 w3)
• p(w4|w1 w2 w3) = p(w4|w*, w2 w3)
• make it Markov, and combine everything beside the prev and the prev-prev.
• D(p||m) = sum[p(x.1 … x.n)log(p(x.1 … x.n)/m(x.1 … x.n))]
• but how do you know the truth?
• what if n is very large?
• H(p,m)=H(p) + D(p||m)  = -sum[p(x)log(m(x))]
• H(p,m) can be an approximation of D(p||m), then:
• We can sample from p and check H(p,m) by the samples, then:
• H(L,m) = lim_(n->infinity) [- 1/n sum(p(x.1 … x.n)log(m(x.1 … x.n)))], then:
• we need p(x.1 … x.n) be a good sample of L.
• H(L,m) = lim_(n->infinity) [1/n log m(x.1 … x.n)]
• m(x.1 … x.n) = product(m(x.i| x.i-1))
• If m(*) = 0 -> H(p,m) = infinity
• Should associate the cost with something…
• smoothing (lots of ways for smoothing)
• Perplexity:
• 2^(H(x.1 … x.n, m)) i.e. 2^ (cross entropy)
• effective vocabulary size
• recover the unit from “bits” to the original one.

After class (M& S book)

• 6.2.1 MLE: gives the highest probability to the training corpus.
• C(w1, … ,wn) is the frequency of n-gram w1…wn in the training set.
• N is the number of training instances.
• P_MLE(w1, …, wn) = C(w1, … , wn) / N
• P_MLE(wn | w.1, …, w.n-1) = C(w1, … , wn)/C(w1, … , w.n-1)
• 6.2.2 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law
• Laplace’s law
• original smoothing
• Lidstone’s law
• Jeffreys-Perks law
• Expected likelihood estimation (ELE)
• 6.2.3 Held Out Estimation
• Take further test and see how often bigrams that appeared r times in the training text tend to turn up in the future text.
• 6.2.4 Cross Validation
• Cross validation:
• Rather than using some of the training data only for frequency counts and some only for smoothing probability estimates, more efficient schemes are possible where each part of the training data is used both as initial traiining data and as held out data.
• Deleted Estimation: two-way cross validation: