2012-04-11

Maximum Entropy (Maxent) Models

http://spark-public.s3.amazonaws.com/nlp/slides/Maximum_Entropy_Classifiers_v2.pdf

So far we've looked at different types of statistical models
- "Generative": Language models, Naïve Bayes.
- Generates random stat. data. Joint probability as opposed to conditional [P(A,B)/ P(B|A)]
- Discriminative – Conditional.
  - They give high accuracy performance
- • They make it easy to incorporate lots of linguistically important features
- • They allow automatic building of language independent, retargetable NLP modules

Everything we've used 'till now has been generative, ie. used joint probs.
Likelihood: Conditional - we'll try to maximize this.

   Training Set                  Test Set
 Objective  Accuracy       Objective  Accuracy 
 Joint Like.  86.8         Joint Like.  73.6 
 Cond. Like.  98.5         Cond. Like.  76.1

What do these tables mean? All features, smoothing, ... unchanged, conditional probabilities are more accurate – higher performance.

We are told more about the system. Instead of just "how many times does 'a' appear in (hidden) class 'b'?"
it's "given class 'b' how likely is it that we see 'a'?

Features:
- f1 (c, d) ≡ [c = LOCATION ∧ w_(-1) = “in” ∧ isCapitalized(w)] --> in Acadia
- f2 (c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)] --> in Québec
- f3 (c, d) ≡ [c = DRUG ∧ ends(w, “c”)] -- Zantac
  - (c = hidden category d = data)

Each feature gets a weight:
- Positive: Likely correct
- Negative: Likely incorrect

There are two types of expectation:
- Empirical (count) E(f_i) = sum {over (c,d) in observed}f_i(c,d) --> just counting how many times 'd' occurs in class 'c'.
- Model: E(f_i) = sum {over (c,d) in (C,D)}P(c,d).f_i(c,d)
  - C,D -> Sets of classes and data.

Features: Boolean [Φ(d)] and a class [c_j]
We will say that Φ(d) is a feature of the data d, when, for each c_j, the conjunction Φ(d) ∧ c = c_j is a feature of the data-class pair (c, d)

Feature-based models:
- Text categorization
- Word-sense disambiguation
- POS tagging

 Only ACTIVE features matter for a decision about a data point.
 BUSINESS: Stocks hit a yearly low … 
 
 Label: BUSINESS
   Features 
 {…, stocks, hit, a, yearly, low, …}

In text cat., features are the presence of each word (d) in a class, and the document class (c)
- Logistic regression was 86.4% accurate on the Reuters data set

Building Maxent:

Features defined over data points
- Words, but also “word contains number”, “word ends with -ing”, etc.
Each Φ feature (active features – see above) is a string
- f_i(c, d) ≡ [Φ(d) ∧ c = c_j] features get real-number weight.
- Focus is on the Φ features – but math uses 'i' instead of f_i

Lots of math. P(C|D,lambda[parameters]) More here: http://spark-public.s3.amazonaws.com/nlp/slides/Maximum_Entropy_Classifiers_v2.pdf **"Generative vs. Discriminative models: The problem of overcounting evidence" section.

Info. Extraction/Named Entity Recognition.

http://spark-public.s3.amazonaws.com/nlp/slides/Information_Extraction_and_Named_Entity_Recognition_v2.pdf

http://start.csail.mit.edu
More precision/recall/F-measure: see 2012-04-04 for refresher

Assignment

Submitted to Coursera.

I would like to keep playing with this. I probably would adapt it a little because it's currently written to run over their submission server – I just need it to run locally.

Updated Code from Last Time

compare.py

http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ 2012-04-11
last modified Wednesday April 11 2012 12:30 am EDT

Jim's
Tutorials

course

navigation

2012-04-11

Maximum Entropy (Maxent) Models

Info. Extraction/Named Entity Recognition.

Assignment

Updated Code from Last Time

Jim'sTutorials

course

navigation

2012-04-11

Maximum Entropy (Maxent) Models

Info. Extraction/Named Entity Recognition.

Assignment

Updated Code from Last Time

Jim's
Tutorials