2012-03-07

NLP-Class updates

Official start date has been scheduled for 3/12.
First week's lectures, assignment, and quiz are up.
I've watched half of the lectures

Started looking at the first assignment for that class, due 03-27.
It's data extraction. Specifically using regex to extract phone numbers (###)###-####, etc., and email addresses (eg: abc@xyz.com, qwerty@bar.foo.edu...) from a file.

Tutorial Stuff

Looking at "make_alignment_probabilities".

 P(a, f|e) = {{Prod(P(f_1|e_1), P(f_2|e_1), ..., P(f_n|e_1),
             {Prod(P(f_1|e_2), P(f_2|e_2), ..., P(f_n|e_2),
             {Prod(P(f_1|e_3), P(f_2|e_3), ..., P(f_n|e_3),
              ...
             }

Alignments

Assuming: Sentence corpus is:

 [[["the", "blue", "house"], ["la", "maison", "bleue"]],
  [["blue", "house"], ["maison", "bleue"]],
  [["blue"], ["bleue"]]]

Possible alignments are:

   Eng    Fre      P(a f|e)
 #Sentence 1
   the    la        0.333
   the    maison    0.333
   the    bleue     0.333
   blue   la        0.333
   blue   maison    0.333
   blue   bleue     0.333
   house  la        0.333
   house  maison    0.333
   house  bleue     0.333
 
 #Sentence 2
   blue   maison    0.333
   blue   bleue     0.333
   house  maison    0.333
   house  bleue     0.333
 
 #Sentence 3
   blue   bleue     0.333

Pseudocode p.25 initialize t(f|e) uniformly do set count(f|e) to 0 for all f,e set total(e) to 0 for all e for all sentence pairs (f_s,e_s) for all unique words f in f_s n_f = count of f in f_s total_s = 0 for all unique words e in e_s total_s += t(f|e) * n_f for all unique words e in e_s n_e = count of e in e_s count(f|e) += t(f|e) * n_f * n_e / total_s total(e) += t(f|e) * n_f * n_e / total_s for all e in domain( total(.) ) for all f in domain( count(.|e) ) t(f|e) = count(f|e) / total(e) until convergence

>>> corpus = [[["the", "blue", "house"], ["la", "maison", "bleue"]], [["blue", "house"], ["maison", "bleue"]], [["blue"], ["bleue"]]] >>> params2 = make_alignment_parameters(corpus) {'blue': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}, 'house': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}, 'the': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}} >>> align2 = make_alignment_probabilities(corpus, params2) [{'blue': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}, 'house': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}, 'the': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}}, {'blue': {'maison': 0.25, 'bleue': 0.25}, 'house': {'maison': 0.25, 'bleue': 0.25}}, {'blue': {'bleue': 1.0}}] ... >>> normalize(corpus, align2) {'blue': {'bleue': 1.0}, 'house': {'maison': 0.5, 'bleue': 0.5}, 'the': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}}

Comments from Tutorial

Stop working on IBM models -- NLP-Class has started.
If I did continue, changes would probably need to be made to the align2 method: look at E_i and F_i (i from 0 to len(E [or F])-1) not E_i and F_j

http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ 2012-03-07
last modified Tuesday March 27 2012 10:07 pm EDT

attachments

name last modified size

Jim's
Tutorials

course

navigation

2012-03-07

NLP-Class updates

Tutorial Stuff

Alignments

Comments from Tutorial

attachments

Jim'sTutorials

course

navigation

2012-03-07

NLP-Class updates

Tutorial Stuff

Alignments

Comments from Tutorial

attachments

Jim's
Tutorials