2012-03-28
Stanford's online
NLP Class started over Spring Break.
Week one
- Covered text processing (including Regular Expressions) and edit distance.
Text processing
Topics covered:
- RegEx
- Tokenization
- Stemming
- Sentence Segmentation
Edit Distance
Week two
- Language Modeling and spelling correction
Language Modeling
- N-grams
- Interpolation
- ? = λ_1 P(w_n|w_(n-1),w_(n-2)) + λ_2 P(w_n|w_(n-1)) + λ_3 P(w_n) where Sum(λ)=1
- Training | Held-Out | Test
Choose λ to maximize Held-Out data
- Unknown marker. Train as regular word.
- Remove count 1 (Zipf's Law) or Compute entropy
- Huffman coding (large nº words, 2bits)
- Stupid Backoff Interpolation
- Add-One Smoothing (Laplace)
- Good-Turing Smoothing
- Estimate what's never seen by Count(things seen only once)/Count(all).
- Kneser-Ney Smoothing
Spelling Correction
- Noisy channel
- Real world application
- State-of-the-art