Jim's
Tutorials

Spring 2012
course
navigation

2012-03-07

NLP-Class updates

Started looking at the first assignment for that class, due 03-27.
It's data extraction. Specifically using regex to extract phone numbers (###)###-####, etc., and email addresses (eg: abc@xyz.com, qwerty@bar.foo.edu...) from a file.

Tutorial Stuff

Looking at "make_alignment_probabilities".
P(a, f|e) = {{Prod(P(f_1|e_1), P(f_2|e_1), ..., P(f_n|e_1), {Prod(P(f_1|e_2), P(f_2|e_2), ..., P(f_n|e_2), {Prod(P(f_1|e_3), P(f_2|e_3), ..., P(f_n|e_3), ... }

Alignments

Assuming: Sentence corpus is:
[[["the", "blue", "house"], ["la", "maison", "bleue"]], [["blue", "house"], ["maison", "bleue"]], [["blue"], ["bleue"]]]
Possible alignments are:
Eng Fre P(a f|e) #Sentence 1 the la 0.333 the maison 0.333 the bleue 0.333 blue la 0.333 blue maison 0.333 blue bleue 0.333 house la 0.333 house maison 0.333 house bleue 0.333 #Sentence 2 blue maison 0.333 blue bleue 0.333 house maison 0.333 house bleue 0.333 #Sentence 3 blue bleue 0.333

Sentence 1 | Sentence 2 | Sentence 3 | Sent1 Initial Alignments: P(a f|e) | | | 0.333*0.333*0.333 ≈ 0.0369 the blue house | the blue house | blue house blue house | blue | | | | | | | | | | | | | | | | Sent2 Initial Alignments: la maison bleue | la bleue maison | maison bleue bleue maison | bleue | 0.333*0.333 ≈ 0.111 | | | | the blue house | the blue house | | | Sent3 Initial Alignment: | | | | | | | | | | 0.333 ≈ 0.333 maison bleue la |maison la bleue | | | | | | |____________________________ the blue house | the blue house | | | m | | | | | | | | | | | P(a, f|e) = PI[t(f_j|ea_j) | bleue maison la |bleue la maison | | | j=1 |

Pseudocode p.25 initialize t(f|e) uniformly do set count(f|e) to 0 for all f,e set total(e) to 0 for all e for all sentence pairs (f_s,e_s) for all unique words f in f_s n_f = count of f in f_s total_s = 0 for all unique words e in e_s total_s += t(f|e) * n_f for all unique words e in e_s n_e = count of e in e_s count(f|e) += t(f|e) * n_f * n_e / total_s total(e) += t(f|e) * n_f * n_e / total_s for all e in domain( total(.) ) for all f in domain( count(.|e) ) t(f|e) = count(f|e) / total(e) until convergence

>>> corpus = [[["the", "blue", "house"], ["la", "maison", "bleue"]], [["blue", "house"], ["maison", "bleue"]], [["blue"], ["bleue"]]] >>> params2 = make_alignment_parameters(corpus) {'blue': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}, 'house': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}, 'the': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}} >>> align2 = make_alignment_probabilities(corpus, params2) [{'blue': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}, 'house': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}, 'the': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}}, {'blue': {'maison': 0.25, 'bleue': 0.25}, 'house': {'maison': 0.25, 'bleue': 0.25}}, {'blue': {'bleue': 1.0}}] ... >>> normalize(corpus, align2) {'blue': {'bleue': 1.0}, 'house': {'maison': 0.5, 'bleue': 0.5}, 'the': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}}

Comments from Tutorial

http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ 2012-03-07
last modified Tuesday March 27 2012 10:07 pm EDT

attachments [paper clip]

     name last modified size
   translation_v4.py Mar 6 2012 10:20 pm 7.31kB