## Lecture 8: Evaluation

• PCFG
• ∑Pr(A -> gamma | A) = 1
• (conditional) probability of each item has to sum to one
• Pr(O = o1,o2,…,on|µ)
• HMM: Forward
• PCFG: Inside-Outside
• Guess Pr: argmax_(Z)[ Pr(Z|O, µ) ]
• HMM:Use Viterbi to get
• PCFG: Use Viterbi CKY to get
• *Z is the best sequence of states
• Guess µ: argmax_(µ)[Pr(O|µ)]
• HMM:Use forward-backward to get
• PCFG: Use Inside-outside to get
• Example:
• Sentence:
• ——————-S
• ——–NP—————-VP
• ——–NP———-V————-NP
• —————————roasted—-peanuts
• Problem:
• Pr_µ(peanuts eat roasted people) = Pr_µ(people eat roasted peanut)
• We can try to generate head of each phrase:
• Dependency representation:
• Sentence:
• —————————eat
• —————people—————peanuts
• —————–the—————–roasted
• Lexical (bottom up)
• NP ->det N
• Evaluation
• Reference Reading:How Evaluation Guides AI Research
• Intrinsic evaluation
• Extrinsic evaluation
• Kappa’s evaluation
• Metric: precision recall
• How to evaluate two structures which could generate the same sentence?
• Answer: Generate more than one output for each input, convert the output into set of output, and use precision and recall to measure.
• If the reader’s score agree with the machine, stop

## 03102018

• Gensim Tutorial:
• Use google-news model as pre-trained model
• clustering based on distance matrix
• Question: how do we do the clustering?
• should cluster on the keywords?
• should cluster on the keywords-related words?
• Leg dissection demo:
• 18 cameras 30frames 10G
• 5 cameras 100 frames 6G
• Question:
• we cannot change focal length now. we can only change the viewpoint
• if we want dynamic, we should have dynamic mesh?
• Foveated ray-tracing:
• input: eye ray + 1spp
• output: foveated image
• question: If we use foveated image as ground truth, what should be the denoising algorithm for the ground truth?
• TODO:
• read G3D code and change sample number
• Homework

## Lecture 6: Context-free parsing

Questions:

1. Generative Model P(X,Y)
2. Discriminative model P(Y|X)

MainPoints

1. Block sampler: Instead of sample one element at a time, we can sample a batch of samples in Gibbs Sampling.
2. Lag and Burn-in: can be viewed as parameters (we can control the number of iterations)
1. lag: mark some iterations in the loop as lag, then throw away the lag iterations, then the other samples become independent.
1. Example: run 1000 iters -> run 100 lags -> run 1000 iters -> 100 lags …
2. burn in: throw away the initial 10 or 20 iterations (burn-in iterations), where the model has not converged.
1. The right way is to test whether the model has converged.
3. Topic model:
4. The sum of the parameter of each word in a topic doesn’t need to be one
5. The derivative (branches) of LDA (Non-parametric model):
1. Supervised LDA
2. Chinese restaurant process (CRP)
3. Hierarchy models
1. example: SHLDA
2. gun-safety (Clinton) &  gun-control (Trump)

Parsing:

Any context which can be processed with FSA can be processed with CFGs. But not vice versa.

 ? turning machine (Don’t cover in this lecture) CSL Tree adjusting Grammar PTAGs (Don’t cover in this lecture) CF PDA/CFGs PCFGs PDA/CFGs Allow some negative examples. And can handle some cases that cannot be processed by FSA. For example: S -> aSb {a^nb^n} cannot be processed by FSA because we need to know the variable n. But FSM only remember the states, it cannot count. Regular FSA/regular expressions HMM

Example1:

The rat that the cat that the dog bit chased died.

Example2:

Sentence: The tall man loves Mary.

————-loves

—–man————Mary

-The——tall

Structure:

——————-S

——–NP——————VP

—the-tall–man       loves—–Mary

Example3: CKY Algorithm

0 The 1 tall 2 man 3 loves 4 Mary 5

[w, i, j] A->w \in G

[A, i, j]

Chart (bottom up parsing algorithm):

0 The 1 tall 2 man 3 loves 4 Mary 5

–Det —— N ———V ——NP—-

———–NP ———–VP ———

—————- S ———————

Then I have:

[B, i, j]

[C, j, k]

So I can have:

A->BC : [A, i, k] # BC are non-determinal phrases

NP ->Det N

VP -> V NP

S -> NP VP

Example4:

I      saw         the        man       in        the       park        with      a       telescope.

——————– NP—————– PP ———–

–                         —————NP———————

————————NP—————————–

A->B . CD

CD: predicted

[A -> a*ß, j]

[0, S’-> *S, 0]

scan: make progress through rules.

[i, A -> a* (w_j+1) ß, j]

A [the tall] B*C

i, [the tall], j

Prediction: top-down prediction B-> γ

[i, A-> a * Bß, j]

[j, B->*γ, j]

Combine

Complete (move the dot):

[i, A->a*ß, k] [k, B->γ, j]

I                          k                       j

–[A->a*Bß]———[B->γ*]—–

Then I have:

[I, A->aB*ß, k]

## Lecture 5: Reduced-dimensionality representations for documents: Gibbs sampling and topic models

• watch the new talk and write summary
• Noah Smith: squash network
• Main points:
• difference between LSA & SVD
• Bayesian graphical models
• informative priors are useful in the model
• Bayesian network
• DAG
• X1X2…Xn
• Po(X1, X2, …, Xn)
• Generative story: HMM (dependencies)
• A and B are conditionally independent given C iff P(A,B|C) = P(A|C) * P(B|C)
•       C (to A, to B)
•   A                             B
• Example:
•               Cloudy (to Sprinter and rain)
• Sprintder(to wet grass)                   Rain (to wet grass)
•                       Wet grass(final state)
• w’s are observable
• State change graph:  π -> S1 -> S2 -> S3 -> … -> St
• w’s are observable
• μ~(A,B,π)
• Have a initial π for C, π~Beta(Gamma(π1), Gamma(π2))
• C is Bernoulli (π)
• S~ Bernoulli(π(S | C))
• R~Bernoulli(π(S | C))
• W ~ Bernoulli(\tao (ω | S,R))
• π~Dirichlet(1)
• S1~Cat(π)S2~Cat(a_s1,1  ,  a_s1,2  ,   ….  ,   a_(s1,n))* Cat is chosen from the transition matrix
• ω1~Cat(b_s1,1  ,  b_s1,2  ,   ….  ,   b_(s1,n))ω2~Cat(b_s1,1  ,  b_s1,2  ,   ….  ,   b_(s1,n))* Cat is chosen from the transition matrix
• What just been introduced is the unigram model, here is the bigram model
• P(s1, s2, …. , sn) = P(s1) P(s2|s1) P(s3|s1,s2) P(s4|s2,s3)..
• For EM, we need to recalculate everything, conditional distribution are different
• Some distributions:
• Binomial vs. Bernoulli
• Multinomial vs. discrete / categorical
• Document squashing

## An interesting WordCloud website

https://www.wordclouds.com/

GLuint posLength = sizeof(PointStruct) * PointCloudData.size(); correct

GLuint posLength = sizeof(PointCloudData) ; wrong

glBufferData(GL_ARRAY_BUFFER, posLength, &PointCloudData[0], GL_STATIC_DRAW);

## Hello

• Cross entropy
• H(p,q) = D(p||q) + H(p)
• H(p) is some inherent randomness in p
• D(p||q) is what we care about. we can try to get D(p||q) by calculating cross entropy.
• Conclusion: a model is good is that it assign good approximation to the observed data. So we need to find some good q
• Main points:
• Example: She loves her. (It’s a correct string, but English is not like this. It should be “She loves herself.”)
• We need a meaning pair.
• Two orthogonal dimensions:
• probability for the strings.
•  Units Prob String {anbn|n>=1} P(w1, w2,…, wn) Structure A tree structure PCFG
• L1 = L2: Language 1 is equal to Language 2
• Weak equivalence
• Sense of string are the same.
• Strong equivalence
• Structure of language 1 and 2 are the same.
• G1 = G2 iff {x| G1 generates string x} = {x|G2 generates string x} (all and only the same structures)
•  G1 G2 S->a s s->s a s->e s->e
• G1 and G2 are weak equivalent (they generate the same strings) but not strong equivalent
• Example: Jon loves mary
• Questions:
• How to measure equivalence?
• binary judgements?
• EM
• Question: How to find a good model? Expectation maximization (EM)
• The structure of model is given, we need to find the parameters for the model.
• Coin: H H H H T T T T T T
•
• MLE: argmax [p(x|mu)]
• Solve:
• Result: p = k/n
• HMM <A,B, pi>

## Lecture3: Information Theory

• Hypothesis testing
• collocations
• Info theory
• Hypothesis Testing
• Last lecture covered the methodology.
• Collocation
• “religion war”
• PMI, PPMI
• PMI = pointwise mutual information
• PMI = log2(P(x,y)/(P(x)P(y))) = I(x,y)
• PPMI = positive PMI = max(0, PMI)
• Example: Hong Kong, the frequency of “Hong” and “Kong” are low, but the frequency for “Hong Kong” is high.
• Non-statistical measure but very useful: Not a classic hypothesis testing.
• Note: If the frequency of the word is less than 5, then even if the PMI is high, the bigram is meaningless.
• RFR
• rfr(w) = (freq(w in DP)/(#DP))/(freq (w in background corpus)/(#background corpus))
• DP is an child class in background corpus.
• Chi-square test
• Draw the table is important.
• T-test
• Non-statistical measure but very useful.
• Not a classic hypothesis testing.
• sliding window scan the sentence to find the bigram.
• found ->1
• A fox saw a new company’s emerge…..
• A fox ->0
• fox saw -> 0
• saw a ->0
• a new -> 0
• new company -> 1
• company emerge -> 0
• Info theory
• Entropy: degree of certainty of X; information content of X; pointwise information of X; (ling) surprisal, perplexity
• X~P, H(X) = E[-log2(p(x))] = -sum(p(x)log2(p(x)))
• H(X,Y) = H(X) + H(Y) – I(X, Y)
• H(X, Y) is the cost of communicating between X and Y;
• H(X) is the cost of communicating X;
• H(Y) is the cost of communicating Y;
• I(X, Y) is the saving because X and Y can communicate with each other.
• Example: 8 horses run, the probability of winning is:
• horse:    1  .     2   .   3   .    4   .    5   .   6   .     7   .       8
• Prob:    1/2, 1/4   , 1/8,  1/16, 1/32, 1/64, 1/128, 1/256
• Code1:   000, 001, 010,  011,   100,  101,     110,      111
• H(X) = 3;
• Code2:0, 10, 110,  1110,  111100, 1111101, 111110, 111111
• Huffman code.
• H(X) = 2;
• Structure
• [W -> encoder] -> X-> [                    ]->    Y  -> [decoder  -> W^]
•        Source                    channel (P(y|x))
• Anything before Y is the noisy channel model.
• P(X,Y) is the joint model, P(X|Y) discriminative model.
• T^ (estimated T)
• =argmax_T (P(T|W))
• = argmax_T(P(W|T)P(T)/P(W))
• = argmax_T(P(W|T)P(T)) (W can be omitted because its’s just a scaler)
• = argmax_T(P(w1|t1)P(w2|t2)…p(w.n|t.n) P(t1)P(t2|t1)P(t3|t2,t1)…P(tn|t1,t2, …, t.n-1))
• = argmax_T(P(w1|t1)P(w2|t2)…p(w.n|t.n) P(t1)P(t2|t1)P(t3|t2)…P(t.n))
• E^ = argmax P(F|E)P(E)
• F = French, E = English, translation from English to French.
• Relative entropy: KL divergence
• We have an estimated model q and a true distribution p
• D(p||q) = sum_x(p(x).log(p(x))/ log(q(x)))
• Properties:
• D(p||q) >= 0, equal is true when p == q
• D(p||q) != D(q||p)
• D(p||q) = E_p[p(x)] – E_p[q(x)]   E_p(q(x)) is the cross entropy
• Example:
• Model q: P(X)P(Y) independent
• truth   p: Q(X, Y) joint
• D(p||q)
• = sum_x,y [p(x,y) log2(p(x,y)/p(x)p(y))]
• = E_x,y [log2(p(x,y)/p(x)p(y))] average mutual info is the expectation of pointwise mutual info
• Cross entropy: (perplexity)
• Question: how good is my language model? How do I measure?
• p(w1, w2, w3, w4) = p(w1) p(w2|w1) … p(w4|w1 w2 w3)
• p(w4|w1 w2 w3) = p(w4|w*, w2 w3)
• make it Markov, and combine everything beside the prev and the prev-prev.
• D(p||m) = sum[p(x.1 … x.n)log(p(x.1 … x.n)/m(x.1 … x.n))]
• but how do you know the truth?
• what if n is very large?
• H(p,m)=H(p) + D(p||m)  = -sum[p(x)log(m(x))]
• H(p,m) can be an approximation of D(p||m), then:
• We can sample from p and check H(p,m) by the samples, then:
• H(L,m) = lim_(n->infinity) [- 1/n sum(p(x.1 … x.n)log(m(x.1 … x.n)))], then:
• we need p(x.1 … x.n) be a good sample of L.
• H(L,m) = lim_(n->infinity) [1/n log m(x.1 … x.n)]
• m(x.1 … x.n) = product(m(x.i| x.i-1))
• If m(*) = 0 -> H(p,m) = infinity
• Should associate the cost with something…
• smoothing (lots of ways for smoothing)
• Perplexity:
• 2^(H(x.1 … x.n, m)) i.e. 2^ (cross entropy)
• effective vocabulary size
• recover the unit from “bits” to the original one.

After class (M& S book)

• 6.2.1 MLE: gives the highest probability to the training corpus.
• C(w1, … ,wn) is the frequency of n-gram w1…wn in the training set.
• N is the number of training instances.
• P_MLE(w1, …, wn) = C(w1, … , wn) / N
• P_MLE(wn | w.1, …, w.n-1) = C(w1, … , wn)/C(w1, … , w.n-1)
• 6.2.2 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law
• Laplace’s law
• original smoothing
• Lidstone’s law
• Jeffreys-Perks law
• Expected likelihood estimation (ELE)
• 6.2.3 Held Out Estimation
• Take further test and see how often bigrams that appeared r times in the training text tend to turn up in the future text.
• 6.2.4 Cross Validation
• Cross validation:
• Rather than using some of the training data only for frequency counts and some only for smoothing probability estimates, more efficient schemes are possible where each part of the training data is used both as initial traiining data and as held out data.
• Deleted Estimation: two-way cross validation:

## The test speed of neural network?

Basically, the time spent on testing depends on:

• the complexity of the neural network
• For example, the fastest network should be the fully-connected network.
• CNN should be faster than LSTM because LSTM is sequential (sequential = slow)
• Currently, there are many ways to compress deep learning model (remove nodes with lighter weight)
• the complexity of data
• Try to build parallel models.

Analyze of different deep learning models:

• RNN
• CNN
• LSTM
• GRU
• Seq2seq:
• used for NLP
• Autoencoder:
• learn the features and reduce dimension’

Other machine learning models:

• random forest
• used for classification

Machine learning thought: