Lecture 2: Lexical association measures and hypothesis testing

Pre-lecture Readings.

  • Lexical association
    • Named entities: http://www.nltk.org/book/ch07.html
      • Information extraction architecture
        • raw text->sentence segmentation->takenization0<part of speech tagging->entity detection->relation detection
        • chunking: segments and labels multi-token sequences as illustrated in 2.1.../images/chunk-segmentation.png
          • Noun-phrase (NP) chunking
          • tag patterns: describe sequences of tagged words
          • Chunking with Regular Expressions
          • Exploring Text Corpora
        • Chinking: define a chink to be a sequence of tokens that is not included in a chunk
        • Representing chunks: tags & trees
        • Training Classifier-Based Chunkers
    • Accurate methods for the statistics of surprise and coincidence
  • An introduction to lexical association measures:
    • We use the term lexical association to refer to the association between words.
    • 3 types of association:
      • collocational association: restricting combination of words into phrases
      • semantic association: reflecting the semantic relationship between words
      • d cross-language association: corresponding to potential translations of words between different languages
  • Topics:
    • Knowledge/ data->zipf
    • Hypothesis testing
    • Collations
    • NER
  • Knowledge/ data
    • Rationalist:
    • Empiricist: driven by observation
    • Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table.
      •                                      Empiricist                          Rationalist
      • core                              frequent
      • periphery                     infrequent                     exception relation
    • Example 1
      • Alice like it. a book.
      • Question: what does Alice like?
      • Instead of using a string of word, use a tree structure to represent a sentence.
      •              CP
      •    spec           IP
      • what         Alice like
    • Example 2 (Rationalist)
      • Which book did mary say that Bill thinks Alice like?
      • He dropped a book about info theory.
      • What did he drop a book about? (questionable) What did he write a book about? (Good)
      • He dropped [a book about info theory].
      • S                                   NP
    • Example 3
      • What does Mary read?
      •        SBarQ
      • WhNP                             SQ
      • WP           VR           NP             VBR                   NP
      •  what         odes     mary              read                  T*-1
    • Balance between rationalist and empiricist
    • Accuracy vs. coverage
    • Data-driven algorithms less concerned with over generation
    • Semi-supervised learning
    • How to relate the algorithm to the wat we learned.
    • 80-20 rule
    • Remove the low-frequency word and focus on the frequent words for the structure.
  • Hypothesis testing (Chi^2 test)
    • P(H|E) H: hypothesis, E: evidence
    • P(q = 0.7 | 76 success, 24 failures)
    • pmf:
    • R = P(H1|E) / P(H2|E)
    • Recipe for statistical hypothesis
      • Step 1: H0: null hypothesis (what we think is not true)
        • Example: coin flip: coin has only one side.
      • Step 2: Test statistic: number we compute based on evidence collection
      • Step 3: Identify test statistics distribution: if H0 were true.
      • Step 4: Pick a threshold to convince that the H0 is not true.
      • Step 5: Do an experiment and calculate the test statistic.
      • Step 6: If it sufficiently improbable assuming H0 is true, should reject H0.
        • H0: q = 0.05, H1: q != 0.05
    • Example:
      • H0: P(H) = 0.05
      • R~Binomial(n, p)
      • Likelihood func: b(r,n,p) = C(n r) p^r (1-p)^(n-r)
      • Test statistics: # heads
      • Threshold:
        • alpha = 10e-10 Type II: “miss”, “false negative”
        • alpha = 0.2 Type I: “false positive”
        • Typically, people choose alpha = 0.05
      • The value on the edge (alpha = 2.5%) is critical value.
      • Tips:
        • alpha is selected in advance. (p-hacking)
        • Cannot be used to prove something is true, only can show that something is not true.
        • Chi^2? X^2
    • Difference between Chi^2 test and T-test
    • How do I know how many times of experiment is convincing?
    • Collocation:  collocation is a sequence of words or terms that co-occur more often than would be expected by chance.
      • Kict the bucket = to die

CMSC773: HW1

  • Question2:
    • Word order:
      • Explanation: Word order refers to the structure of a sentense:
        • Alexa, when is your birthday?
        • (Alexa answers)
        • Alexa, when your birthday is?
        • (Alexa answers)
        • This will test whether alexa handles some wrong word order.
    • Inflectional morphology:
      • Explanation:
  • Question 3:
  • Question 4:
  • Question 5:
  • Question 6:
    • New definitions:
      • log-entropy weighting
      • cosine similarity

CMSC773: pre-coarse knowledge

  • Dependency Parsing
    • Dependency: focuses on relations between words
      • Typed: Label indicating relationship between words
      • Untyped: Only which words depend
    • Phrase structure: focuses on identifying phrases and their recursive structure
    • Dependency parsing methods:
      • Shift reduce:
        • Predict from left to right
        • Fast, but slightly less accurate
        • MaltParser
      • Spanning tree:
        • Calculate full tree at once
        • Slightly more accurate, slower
        • MSTParser, Eda
      • Cascaded chunking
        • Chunk words into phrases, find heads, delete non-heads, repeat
        • CaboCha
    • Example
      • Maximum Spanning Tree
        • Each dependency is an edge in a directed graph
        • Assign each edge a score
        • Keep the tree with the highest score
      • Cascaded Chunking
        For Japanese
      • Shift-Reduce
        • Two data Structures:
          • Queue: of unprocessed words
          • Stack: of partially processed words
        • Process: at each point, choose”
          • Shift: move one word from queue to stack
          • Reduce left: top word on stack is head of second word
          • Reduce right: second word on stack is head of top word
        • Classification for shift-reduce
          • We have a weight vector for “shift” “reduce left”,“reduce right”: Ws, Wl, Wr
          • Calculate feature functions from the queue and stack: φ(queue, stack)
          • Multiply the feature functions to get scores: Ss = Ws * φ(queue, stack)
          • take the highest score and choose the function
        • Features for Shift Reduce:
          • Features should generally cover at least the last stack entries and first queue entry
        • Algorithm:
          • Input:
            • weights:Ws, Wl, Wr
            • queue: [(1 word1, Pos1), (2, word2, Pos2), …]
            • stack: [(0, ROOT, ROOT)]
          • Output:
            • heads = [-1, head1, head2, …]
          • Pseudo Code
          • Training
            • Can be trained using perceptron algorithm
            • Do parsing, if correct answer, corr different from classifier answer ans, update weights
            • Shift-reduce training algorithm
    • CKY (Cocke-Kasami-Younger) Parsing Algorithm & Earley Parsing Algorithms
      • Definitions:
        • CFG (Context-free Grammar): G = (N, Σ, P, S)
          • N is a finite set of non-terminal symbols
          • Σ is a finite set of terminal symbols (tokens)
          • P is a finite set of productions (rules)
            • Production : (α, β)∈P will be written α → β where α is in N and β is in (N∪Σ)*
          • S is the start symbol in N
        • CNF (Chomsky Normal Form):
          A context-free grammar where the right side of each production rule is restricted to be either two non terminals or one
          terminal. Production can be one of following formats: A->a || A->BC.

          Any CFG can be converted to a weakly equivalent grammar in CNF
      • Parsing Algorithms:
        • Should know:
          • CFGs are basis for describing (syntactic) structure of NL sentences
          • Parsing Algorithms are core of NL analysis systems
          • Recognition vs. Parsing:
            • Recognition-deciding the membership in the language
            • Parsing–Recognition+ producing a parse tree for it
          • Parsing is more “difficult” than recognition (time complexity)
          • Ambiguity-an input may have exponentially many parses.
        • Top-down VS Bottom-up
          • Top-down (goal driven): from start symbol down
          • Bottom-up (data driven): from the symbols up
        • Naive vs. dynamic programming
          • Naive: enumerate everything
          • Back tracking: try something, discard partial solutions
          • DP: save partial solutions in a table
        • CKY: Bottom-up DP
        • Earley Parsing: top-down DP
    • Chart Parsing:
    • Lambda Calculus
    • Perceptron Learning
    • Restricted Boltzmann Machines
    • Statistical Hypothesis Testing
    • Latent Semantic Analysis
    • Latent Dirichlet Allocation
    • Gibbs Sampling
    • Inter-rater Reliability
    • IBM Models for MT
    • Phrase-based Statistical MT
    • Hierarchical Phrase-based Statistical MT
    • Lambda Reductions
    • Probabilistic Soft Logic
    • Open Information Extraction
    • Semantic Role Labeling
    • Combinatory Categorial Grammar (CCG)
    • Tree Adjoining Grammar
    • Abstract Meaning Representation (AMR)
    • Wikification