Jim's
Tutorials

Spring 2012

course

home
students
- Aaron
- Alex
- Elias
- Sam
- Trevor

navigation

2012-03-28

Stanford's online NLP Class started over Spring Break.

Week one

Covered text processing (including Regular Expressions) and edit distance.

Text processing

Topics covered:

RegEx
Tokenization
Stemming
Sentence Segmentation

Edit Distance

Definition
Computation: Hamming, Levenshtein

Week two

Language Modeling and spelling correction

Language Modeling

N-grams
Interpolation
- ? = λ_1 P(w_n|w_(n-1),w_(n-2)) + λ_2 P(w_n|w_(n-1)) + λ_3 P(w_n) where Sum(λ)=1
- Training | Held-Out | Test
  Choose λ to maximize Held-Out data
- Unknown marker. Train as regular word.
- Remove count 1 (Zipf's Law) or Compute entropy
- Huffman coding (large nº words, 2bits)
- Stupid Backoff Interpolation

Add-One Smoothing (Laplace)
Good-Turing Smoothing
- Estimate what's never seen by Count(things seen only once)/Count(all).
Kneser-Ney Smoothing

Spelling Correction

Noisy channel
Real world application
State-of-the-art

http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ 2012-03-28
last modified Wednesday March 28 2012 10:23 am EDT