2012-04-11
Maximum Entropy (Maxent) Models
- So far we've looked at different types of statistical models
- "Generative": Language models, Naïve Bayes.
- Generates random stat. data. Joint probability as opposed to conditional [P(A,B)/ P(B|A)]
- Discriminative – Conditional.
- They give high accuracy performance
- • They make it easy to incorporate lots of linguistically important features
- • They allow automatic building of language independent, retargetable NLP modules
- Everything we've used 'till now has been generative, ie. used joint probs.
- Likelihood: Conditional - we'll try to maximize this.
Training Set Test Set
Objective Accuracy Objective Accuracy
Joint Like. 86.8 Joint Like. 73.6
Cond. Like. 98.5 Cond. Like. 76.1
What do these tables mean? All features, smoothing, ... unchanged, conditional probabilities are more accurate – higher performance.
- We are told more about the system. Instead of just "how many times does 'a' appear in (hidden) class 'b'?"
it's "given class 'b' how likely is it that we see 'a'?
- Features:
- f1 (c, d) ≡ [c = LOCATION ∧ w_(-1) = “in” ∧ isCapitalized(w)] --> in Acadia
- f2 (c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)] --> in Québec
- f3 (c, d) ≡ [c = DRUG ∧ ends(w, “c”)] -- Zantac
- (c = hidden category d = data)
- Each feature gets a weight:
- Positive: Likely correct
- Negative: Likely incorrect
- There are two types of expectation:
- Empirical (count) E(f_i) = sum {over (c,d) in observed}f_i(c,d) --> just counting how many times 'd' occurs in class 'c'.
- Model: E(f_i) = sum {over (c,d) in (C,D)}P(c,d).f_i(c,d)
- C,D -> Sets of classes and data.
- Features: Boolean [Φ(d)] and a class [c_j]
- We will say that Φ(d) is a feature of the data d, when, for each c_j, the conjunction Φ(d) ∧ c = c_j is a feature of the data-class pair (c, d)
- Feature-based models:
- Text categorization
- Word-sense disambiguation
- POS tagging
Only ACTIVE features matter for a decision about a data point.
BUSINESS: Stocks hit a yearly low …
Label: BUSINESS
Features
{…, stocks, hit, a, yearly, low, …}
- In text cat., features are the presence of each word (d) in a class, and the document class (c)
- Logistic regression was 86.4% accurate on the Reuters data set
Building Maxent:
- Features defined over data points
- Words, but also “word contains number”, “word ends with -ing”, etc.
- Each Φ feature (active features – see above) is a string
- f_i(c, d) ≡ [Φ(d) ∧ c = c_j] features get real-number weight.
- Focus is on the Φ features – but math uses 'i' instead of f_i
Info. Extraction/Named Entity Recognition.
Assignment
Submitted to Coursera.
I would like to keep playing with this. I probably would adapt it a little because it's currently written to run over their submission server – I just need it to run locally.
Updated Code from Last Time