 Information about midterm
 PCFG
 Start with S
 ∑Pr(A > gamma  A) = 1
 (conditional) probability of each item has to sum to one
 Pr(O = o1,o2,…,onµ)
 HMM: Forward
 PCFG: InsideOutside
 Guess Pr: argmax_(Z)[ Pr(ZO, µ) ]
 HMM:Use Viterbi to get
 PCFG: Use Viterbi CKY to get
 *Z is the best sequence of states
 Guess µ: argmax_(µ)[Pr(Oµ)]
 HMM:Use forwardbackward to get
 PCFG: Use Insideoutside to get
 Example:
 Sentence:
 ——————S
 ——–NP—————VP
 ——–NP———V————NP
 ——people——eats —–adj——–N
 —————————roasted—peanuts
 Problem:
 Pr_µ(peanuts eat roasted people) = Pr_µ(people eat roasted peanut)
 We can try to generate head of each phrase:
 ————————————S (Head: eat)
 ——–NP(Head: people)—————————–VP(Head: eat)
 ——–NP(Head: people)———V(Head: eat)——————————–NP(Head: peanut)
 ——people(Head: people)——eats(Head: eat)————adj(Head: N/A)—————–N(Head: peanut)
 —————————————————————–roasted(Head: N/A)————peanuts(Head: peanut)
 Should have: Pr[S (Head: eat) > NP(Head: people) VP(Head: eat)] > Pr[ S (Head: eat) > NP(Head: peanut) VP(Head: eat) ]
 Sentence:
 Dependency representation:
 Sentence:
 —————————eat
 —————people—————peanuts
 —————–the—————–roasted
 Lexical (bottom up)
 NP >det N
 Sentence:
 Evaluation
 Reference Reading:How Evaluation Guides AI Research
 Intrinsic evaluation
 Extrinsic evaluation
 Kappa’s evaluation
 Metric: precision recall
 How to evaluate two structures which could generate the same sentence?
 Answer: Generate more than one output for each input, convert the output into set of output, and use precision and recall to measure.
 Reader evaluation:
 If the reader’s score agree with the machine, stop
 else: let another reader read the essay
 Reference Reading:How Evaluation Guides AI Research
Author: xiaoxumeng
03102018
 Gensim Tutorial:
 https://radimrehurek.com/gensim/models/word2vec.html
 Use googlenews model as pretrained model
 clustering based on distance matrix
 Question: how do we do the clustering?
 should cluster on the keywords?
 should cluster on the keywordsrelated words?
 Leg dissection demo:
 18 cameras 30frames 10G
 5 cameras 100 frames 6G
 Question:
 what is our task?
 we cannot change focal length now. we can only change the viewpoint
 if we want dynamic, we should have dynamic mesh?
 what is our task?
 Foveated raytracing:
 input: eye ray + 1spp
 output: foveated image
 question: If we use foveated image as ground truth, what should be the denoising algorithm for the ground truth?
 TODO:
 read G3D code and change sample number
 read papers (nvd, disney)
 Homework
Lecture 6: Contextfree parsing
Questions:
 Generative Model P(X,Y)
 Discriminative model P(YX)
MainPoints
 Block sampler: Instead of sample one element at a time, we can sample a batch of samples in Gibbs Sampling.
 Lag and Burnin: can be viewed as parameters (we can control the number of iterations)
 lag: mark some iterations in the loop as lag, then throw away the lag iterations, then the other samples become independent.
 Example: run 1000 iters > run 100 lags > run 1000 iters > 100 lags …
 burn in: throw away the initial 10 or 20 iterations (burnin iterations), where the model has not converged.
 The right way is to test whether the model has converged.
 lag: mark some iterations in the loop as lag, then throw away the lag iterations, then the other samples become independent.
 Topic model:
 The sum of the parameter of each word in a topic doesn’t need to be one
 The derivative (branches) of LDA (Nonparametric model):
 Supervised LDA
 Chinese restaurant process (CRP)
 Hierarchy models
 example: SHLDA
 gunsafety (Clinton) & guncontrol (Trump)
Parsing:
Any context which can be processed with FSA can be processed with CFGs. But not vice versa.
?  turning machine  
(Don’t cover in this lecture)  
CSL  Tree adjusting Grammar  PTAGs 
(Don’t cover in this lecture)  
CF  PDA/CFGs  PCFGs 
PDA/CFGs Allow some negative examples. And can handle some cases that cannot be processed by FSA.
For example: S > aSb {a^nb^n} cannot be processed by FSA because we need to know the variable n. But FSM only remember the states, it cannot count. 

Regular  FSA/regular expressions  HMM 
Example1:
The rat that the cat that the dog bit chased died.
Example2:
Sentence: The tall man loves Mary.
————loves
—–man————Mary
The——tall
Structure:
——————S
——–NP——————VP
—DTAdj—N V——–NP
—thetall–man loves—–Mary
Example3: CKY Algorithm
0 The 1 tall 2 man 3 loves 4 Mary 5
[w, i, j] A>w \in G
[A, i, j]
Chart (bottom up parsing algorithm):
0 The 1 tall 2 man 3 loves 4 Mary 5
–Det —— N ———V ——NP—
———–NP ———–VP ———
————— S ———————
Then I have:
[B, i, j]
[C, j, k]
So I can have:
A>BC : [A, i, k] # BC are nondeterminal phrases
NP >Det N
VP > V NP
S > NP VP
Example4:
I saw the man in the park with a telescope.
——————– NP—————– PP ———–
– —————NP———————
————————NP—————————–
A>B . CD
B: already
CD: predicted
[A > a*ß, j]
[0, S’> *S, 0]
scan: make progress through rules.
[i, A > a* (w_j+1) ß, j]
A [the tall] B*C
i, [the tall], j
Prediction: topdown prediction B> γ
[i, A> a * Bß, j]
[j, B>*γ, j]
Combine
Complete (move the dot):
[i, A>a*ß, k] [k, B>γ, j]
I k j
–[A>a*Bß]———[B>γ*]—–
Then I have:
[I, A>aB*ß, k]
Lecture 5: Reduceddimensionality representations for documents: Gibbs sampling and topic models

 watch the new talk and write summary
 Noah Smith: squash network
 Main points:
 difference between LSA & SVD
 Bayesian graphical models
 informative priors are useful in the model
 Bayesian network
 DAG
 X1X2…Xn
 Po(X1, X2, …, Xn)
 Generative story: HMM (dependencies)
 A and B are conditionally independent given C iff P(A,BC) = P(AC) * P(BC)
 C (to A, to B)
 A B
 Example:
 Cloudy (to Sprinter and rain)
 Sprintder(to wet grass) Rain (to wet grass)
 Wet grass(final state)
 w’s are observable
 State change graph: π > S1 > S2 > S3 > … > St
 w’s are observable
 μ~(A,B,π)
 Have a initial π for C, π~Beta(Gamma(π1), Gamma(π2))
 C is Bernoulli (π)
 S~ Bernoulli(π(S  C))
 R~Bernoulli(π(S  C))
 W ~ Bernoulli(\tao (ω  S,R))
 π~Dirichlet(1)
 S1~Cat(π)S2~Cat(a_s1,1 , a_s1,2 , …. , a_(s1,n))* Cat is chosen from the transition matrix
 ω1~Cat(b_s1,1 , b_s1,2 , …. , b_(s1,n))ω2~Cat(b_s1,1 , b_s1,2 , …. , b_(s1,n))* Cat is chosen from the transition matrix
 What just been introduced is the unigram model, here is the bigram model
 P(s1, s2, …. , sn) = P(s1) P(s2s1) P(s3s1,s2) P(s4s2,s3)..
 For EM, we need to recalculate everything, conditional distribution are different
 Some distributions:
 Binomial vs. Bernoulli
 Multinomial vs. discrete / categorical
 Document squashing
An interesting WordCloud website
https://www.wordclouds.com/
When loading .obj
GLuint posLength = sizeof(PointStruct) * PointCloudData.size(); correct
GLuint posLength = sizeof(PointCloudData) ; wrong
glBufferData(GL_ARRAY_BUFFER, posLength, &PointCloudData[0], GL_STATIC_DRAW);
Hello
 Cross entropy
 H(p,q) = D(pq) + H(p)
 H(p) is some inherent randomness in p
 D(pq) is what we care about. we can try to get D(pq) by calculating cross entropy.
 Conclusion: a model is good is that it assign good approximation to the observed data. So we need to find some good q
 H(p,q) = D(pq) + H(p)
 Main points:
 Example: She loves her. (It’s a correct string, but English is not like this. It should be “She loves herself.”)
 We need a meaning pair.
 Two orthogonal dimensions:
 probability for the strings.

Units Prob String {a^{n}b^{n}n>=1} P(w1, w2,…, wn) Structure A tree structure PCFG  L1 = L2: Language 1 is equal to Language 2
 Weak equivalence
 Sense of string are the same.
 Strong equivalence
 Structure of language 1 and 2 are the same.
 G1 = G2 iff {x G1 generates string x} = {xG2 generates string x} (all and only the same structures)

G1 G2 S>a s s>s a s>e s>e  G1 and G2 are weak equivalent (they generate the same strings) but not strong equivalent
 Weak equivalence
 Example: Jon loves mary
 Questions:
 How to measure equivalence?
 binary judgements?
 EM
 Question: How to find a good model? Expectation maximization (EM)
 The structure of model is given, we need to find the parameters for the model.
 Coin: H H H H T T T T T T
 MLE: argmax [p(xmu)]
 Solve:
 Result: p = k/n
 HMM <A,B, pi>
 pi: initial probabilities
 N states
 V words
 recipe: http://www.umiacs.umd.edu/~resnik/ling773_sp2011/readings/em_recipe.v2.only_hmm.pdf
 Three fundamental questions for EM:
 What is P(Omu)
 Best hidden events given O, mu?
 What’s the best model I can have? argmax_mu P(Omu)
Lecture 3: HMMs and Expectation Maximization
Lecture3: Information Theory
 Today’s class is about:
 Hypothesis testing
 collocations
 Info theory
 Hypothesis Testing
 Last lecture covered the methodology.
 Collocation
 “religion war”
 PMI, PPMI
 PMI = pointwise mutual information
 PMI = log2(P(x,y)/(P(x)P(y))) = I(x,y)
 PPMI = positive PMI = max(0, PMI)
 Example: Hong Kong, the frequency of “Hong” and “Kong” are low, but the frequency for “Hong Kong” is high.
 Nonstatistical measure but very useful: Not a classic hypothesis testing.
 Note: If the frequency of the word is less than 5, then even if the PMI is high, the bigram is meaningless.
 PMI = pointwise mutual information
 RFR
 rfr(w) = (freq(w in DP)/(#DP))/(freq (w in background corpus)/(#background corpus))
 DP is an child class in background corpus.
 Chisquare test
 Draw the table is important.
 Ttest
 Nonstatistical measure but very useful.
 Not a classic hypothesis testing.
 sliding window scan the sentence to find the bigram.
 found >1
 not found >0
 A fox saw a new company’s emerge…..
 A fox >0
 fox saw > 0
 saw a >0
 a new > 0
 new company > 1
 company emerge > 0
 Nonstatistical measure but very useful.
 Info theory
 Entropy: degree of certainty of X; information content of X; pointwise information of X; (ling) surprisal, perplexity
 X~P, H(X) = E[log2(p(x))] = sum(p(x)log2(p(x)))
 H(X,Y) = H(X) + H(Y) – I(X, Y)
 H(X, Y) is the cost of communicating between X and Y;
 H(X) is the cost of communicating X;
 H(Y) is the cost of communicating Y;
 I(X, Y) is the saving because X and Y can communicate with each other.
 Example: 8 horses run, the probability of winning is:
 horse: 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8
 Prob: 1/2, 1/4 , 1/8, 1/16, 1/32, 1/64, 1/128, 1/256
 Code1: 000, 001, 010, 011, 100, 101, 110, 111
 H(X) = 3;
 Code2:0, 10, 110, 1110, 111100, 1111101, 111110, 111111
 Huffman code.
 H(X) = 2;
 Structure
 [W > encoder] > X> [ ]> Y > [decoder > W^]
 Source channel (P(yx))
 Anything before Y is the noisy channel model.
 P(X,Y) is the joint model, P(XY) discriminative model.
 T^ (estimated T)
 =argmax_T (P(TW))
 = argmax_T(P(WT)P(T)/P(W))
 = argmax_T(P(WT)P(T)) (W can be omitted because its’s just a scaler)
 = argmax_T(P(w1t1)P(w2t2)…p(w.nt.n) P(t1)P(t2t1)P(t3t2,t1)…P(tnt1,t2, …, t.n1))
 = argmax_T(P(w1t1)P(w2t2)…p(w.nt.n) P(t1)P(t2t1)P(t3t2)…P(t.n))
 E^ = argmax P(FE)P(E)
 F = French, E = English, translation from English to French.
 Relative entropy: KL divergence
 We have an estimated model q and a true distribution p
 D(pq) = sum_x(p(x).log(p(x))/ log(q(x)))
 Properties:
 D(pq) >= 0, equal is true when p == q
 D(pq) != D(qp)
 D(pq) = E_p[p(x)] – E_p[q(x)] E_p(q(x)) is the cross entropy
 Example:
 Model q: P(X)P(Y) independent
 truth p: Q(X, Y) joint
 D(pq)
 = sum_x,y [p(x,y) log2(p(x,y)/p(x)p(y))]
 = E_x,y [log2(p(x,y)/p(x)p(y))] average mutual info is the expectation of pointwise mutual info
 Cross entropy: (perplexity)
 Question: how good is my language model? How do I measure?
 p(w1, w2, w3, w4) = p(w1) p(w2w1) … p(w4w1 w2 w3)
 p(w4w1 w2 w3) = p(w4w*, w2 w3)
 make it Markov, and combine everything beside the prev and the prevprev.
 D(pm) = sum[p(x.1 … x.n)log(p(x.1 … x.n)/m(x.1 … x.n))]
 but how do you know the truth?
 what if n is very large?
 H(p,m)=H(p) + D(pm) = sum[p(x)log(m(x))]
 H(p,m) can be an approximation of D(pm), then:
 We can sample from p and check H(p,m) by the samples, then:
 H(L,m) = lim_(n>infinity) [ 1/n sum(p(x.1 … x.n)log(m(x.1 … x.n)))], then:
 we need p(x.1 … x.n) be a good sample of L.
 H(L,m) = lim_(n>infinity) [1/n log m(x.1 … x.n)]
 m(x.1 … x.n) = product(m(x.i x.i1))
 If m(*) = 0 > H(p,m) = infinity
 Should associate the cost with something…
 smoothing (lots of ways for smoothing)
 Perplexity:
 2^(H(x.1 … x.n, m)) i.e. 2^ (cross entropy)
 effective vocabulary size
 recover the unit from “bits” to the original one.
 Entropy: degree of certainty of X; information content of X; pointwise information of X; (ling) surprisal, perplexity
After class (M& S book)
 6.2.1 MLE: gives the highest probability to the training corpus.
 C(w1, … ,wn) is the frequency of ngram w1…wn in the training set.
 N is the number of training instances.
 P_MLE(w1, …, wn) = C(w1, … , wn) / N
 P_MLE(wn  w.1, …, w.n1) = C(w1, … , wn)/C(w1, … , w.n1)
 6.2.2 Laplace’s law, Lidstone’s law and the JeffreysPerks law
 Laplace’s law
 original smoothing
 Lidstone’s law
 JeffreysPerks law
 Expected likelihood estimation (ELE)
 Laplace’s law
 6.2.3 Held Out Estimation
 Take further test and see how often bigrams that appeared r times in the training text tend to turn up in the future text.
 6.2.4 Cross Validation
 Cross validation:
 Rather than using some of the training data only for frequency counts and some only for smoothing probability estimates, more efficient schemes are possible where each part of the training data is used both as initial traiining data and as held out data.
 Deleted Estimation: twoway cross validation:
 Cross validation:
The test speed of neural network?
Basically, the time spent on testing depends on:
 the complexity of the neural network
 For example, the fastest network should be the fullyconnected network.
 CNN should be faster than LSTM because LSTM is sequential (sequential = slow)
 Currently, there are many ways to compress deep learning model (remove nodes with lighter weight)
 the complexity of data
 Try to build parallel models.
Analyze of different deep learning models:
 RNN
 CNN
 LSTM
 GRU
 Seq2seq:
 used for NLP
 Autoencoder:
 learn the features and reduce dimension’
Other machine learning models:
 random forest
 used for classification
Machine learning thought: