January 2018 – Xiaoxu Meng

Lecture 2: Lexical association measures and hypothesis testing

Pre-lecture Readings.

Lexical association
- Named entities: http://www.nltk.org/book/ch07.html
  - Information extraction architecture
    - raw text->sentence segmentation->takenization0<part of speech tagging->entity detection->relation detection
    - chunking: segments and labels multi-token sequences as illustrated in 2.1.
      - Noun-phrase (NP) chunking
      - tag patterns: describe sequences of tagged words
      - Chunking with Regular Expressions
      - Exploring Text Corpora
    - Chinking: define a chink to be a sequence of tokens that is not included in a chunk
    - Representing chunks: tags & trees
    - Training Classifier-Based Chunkers
- Accurate methods for the statistics of surprise and coincidence
An introduction to lexical association measures:
- We use the term lexical association to refer to the association between words.
- 3 types of association:
  - collocational association: restricting combination of words into phrases
  - semantic association: reflecting the semantic relationship between words
  - d cross-language association: corresponding to potential translations of words between different languages
Topics:
- Knowledge/ data->zipf
- Hypothesis testing
- Collations
- NER
Knowledge/ data
- Rationalist:
- Empiricist: driven by observation
- Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table.
  - Empiricist Rationalist
  - core frequent
  - periphery infrequent exception relation
- Example 1
  - Alice like it. a book.
  - Question: what does Alice like?
  - Instead of using a string of word, use a tree structure to represent a sentence.
  - CP
  - spec IP
  - what Alice like
- Example 2 (Rationalist)
  - Which book did mary say that Bill thinks Alice like?
  - He dropped a book about info theory.
  - What did he drop a book about? (questionable) What did he write a book about? (Good)
  - He dropped [a book about info theory].
  - S NP
- Example 3
  - What does Mary read?
  - SBarQ
  - WhNP SQ
  - WP VR NP VBR NP
  - what odes mary read T*-1
- Balance between rationalist and empiricist
- Accuracy vs. coverage
- Data-driven algorithms less concerned with over generation
- Semi-supervised learning
- How to relate the algorithm to the wat we learned.
- 80-20 rule
- Remove the low-frequency word and focus on the frequent words for the structure.
Hypothesis testing (Chi^2 test)
- P(H|E) H: hypothesis, E: evidence
- P(q = 0.7 | 76 success, 24 failures)
- pmf:
- R = P(H1|E) / P(H2|E)
- Recipe for statistical hypothesis
  - Step 1: H0: null hypothesis (what we think is not true)
    - Example: coin flip: coin has only one side.
  - Step 2: Test statistic: number we compute based on evidence collection
  - Step 3: Identify test statistics distribution: if H0 were true.
  - Step 4: Pick a threshold to convince that the H0 is not true.
  - Step 5: Do an experiment and calculate the test statistic.
  - Step 6: If it sufficiently improbable assuming H0 is true, should reject H0.
    - H0: q = 0.05, H1: q != 0.05
- Example:
  - H0: P(H) = 0.05
  - R~Binomial(n, p)
  - Likelihood func: b(r,n,p) = C(n r) p^r (1-p)^(n-r)
  - Test statistics: # heads
  - Threshold:
    - alpha = 10e-10 Type II: “miss”, “false negative”
    - alpha = 0.2 Type I: “false positive”
    - Typically, people choose alpha = 0.05
  - The value on the edge (alpha = 2.5%) is critical value.
  - Tips:
    - alpha is selected in advance. (p-hacking)
    - Cannot be used to prove something is true, only can show that something is not true.
    - Chi^2? X^2
- Difference between Chi^2 test and T-test
- How do I know how many times of experiment is convincing?
- Collocation: collocation is a sequence of words or terms that co-occur more often than would be expected by chance.
  - Kict the bucket = to die

CMSC773: HW1

Question2:
- Word order:
  - Explanation: Word order refers to the structure of a sentense:
    - Alexa, when is your birthday?
    - (Alexa answers)
    - Alexa, when your birthday is?
    - (Alexa answers)
    - This will test whether alexa handles some wrong word order.
- Inflectional morphology:
  - Explanation:
Question 3:
Question 4:
- Reference: http://statweb.stanford.edu/~serban/116/bayes.pdf
Question 5:
Question 6:
- New definitions:
  - log-entropy weighting
  - cosine similarity

CMSC773: pre-coarse knowledge

Dependency Parsing
- Dependency: focuses on relations between words
  - Typed: Label indicating relationship between words
  - Untyped: Only which words depend
- Phrase structure: focuses on identifying phrases and their recursive structure
- Dependency parsing methods:
  - Shift reduce:
    - Predict from left to right
    - Fast, but slightly less accurate
    - MaltParser
  - Spanning tree:
    - Calculate full tree at once
    - Slightly more accurate, slower
    - MSTParser, Eda
  - Cascaded chunking
    - Chunk words into phrases, find heads, delete non-heads, repeat
    - CaboCha
- Example
  - Maximum Spanning Tree
    - Each dependency is an edge in a directed graph
    - Assign each edge a score
    - Keep the tree with the highest score
  - Cascaded Chunking
    
    For Japanese
  - Shift-Reduce
    - Two data Structures:
      - Queue: of unprocessed words
      - Stack: of partially processed words
    - Process: at each point, choose”
      - Shift: move one word from queue to stack
      - Reduce left: top word on stack is head of second word
      - Reduce right: second word on stack is head of top word
    - Classification for shift-reduce
      - We have a weight vector for “shift” “reduce left”,“reduce right”: Ws, Wl, Wr
      - Calculate feature functions from the queue and stack: φ(queue, stack)
      - Multiply the feature functions to get scores: Ss = Ws * φ(queue, stack)
      - take the highest score and choose the function
    - Features for Shift Reduce:
      - Features should generally cover at least the last stack entries and first queue entry
    - Algorithm:
      - Input:
        
        weights:Ws, Wl, Wr
        
        queue: [(1 word1, Pos1), (2, word2, Pos2), …]
        
        stack: [(0, ROOT, ROOT)]
      - Output:
        
        heads = [-1, head1, head2, …]
      - Pseudo Code
      - Training
        
        Can be trained using perceptron algorithm
        
        Do parsing, if correct answer, corr different from classifier answer ans, update weights
        
        Shift-reduce training algorithm
- CKY (Cocke-Kasami-Younger) Parsing Algorithm & Earley Parsing Algorithms
  - Definitions:
    - CFG (Context-free Grammar): G = (N, Σ, P, S)
      - N is a finite set of non-terminal symbols
      - Σ is a finite set of terminal symbols (tokens)
      - P is a finite set of productions (rules)
        
        Production : (α, β)∈P will be written α → β where α is in N and β is in (N∪Σ)*
      - S is the start symbol in N
    - CNF (Chomsky Normal Form):
      A context-free grammar where the right side of each production rule is restricted to be either two non terminals or one
      
      terminal. Production can be one of following formats: A->a || A->BC.
      
      Any CFG can be converted to a weakly equivalent grammar in CNF
  - Parsing Algorithms:
    - Should know:
      - CFGs are basis for describing (syntactic) structure of NL sentences
      - Parsing Algorithms are core of NL analysis systems
      - Recognition vs. Parsing:
        
        Recognition-deciding the membership in the language
        
        Parsing–Recognition+ producing a parse tree for it
      - Parsing is more “difficult” than recognition (time complexity)
      - Ambiguity-an input may have exponentially many parses.
    - Top-down VS Bottom-up
      - Top-down (goal driven): from start symbol down
      - Bottom-up (data driven): from the symbols up
    - Naive vs. dynamic programming
      - Naive: enumerate everything
      - Back tracking: try something, discard partial solutions
      - DP: save partial solutions in a table
    - CKY: Bottom-up DP
    - Earley Parsing: top-down DP
- Chart Parsing:
- Lambda Calculus
- Perceptron Learning
- Restricted Boltzmann Machines
- Statistical Hypothesis Testing
- Latent Semantic Analysis
- Latent Dirichlet Allocation
- Gibbs Sampling
- Inter-rater Reliability
- IBM Models for MT
- Phrase-based Statistical MT
- Hierarchical Phrase-based Statistical MT
- Lambda Reductions
- Probabilistic Soft Logic
- Open Information Extraction
- Semantic Role Labeling
- Combinatory Categorial Grammar (CCG)
- Tree Adjoining Grammar
- Abstract Meaning Representation (AMR)
- Wikification