2012-03-07
NLP-Class updates
- Official start date has been scheduled for 3/12.
- First week's lectures, assignment, and quiz are up.
- I've watched half of the lectures
Started looking at the first assignment for that class, due 03-27.
It's data extraction. Specifically using regex to extract phone numbers (###)###-####, etc., and email addresses (eg: abc@xyz.com, qwerty@bar.foo.edu...) from a file.
Tutorial Stuff
Looking at "make_alignment_probabilities".
P(a, f|e) = {{Prod(P(f_1|e_1), P(f_2|e_1), ..., P(f_n|e_1),
{Prod(P(f_1|e_2), P(f_2|e_2), ..., P(f_n|e_2),
{Prod(P(f_1|e_3), P(f_2|e_3), ..., P(f_n|e_3),
...
}
Alignments
Assuming: Sentence corpus is:
[[["the", "blue", "house"], ["la", "maison", "bleue"]],
[["blue", "house"], ["maison", "bleue"]],
[["blue"], ["bleue"]]]
Possible alignments are:
Eng Fre P(a f|e)
#Sentence 1
the la 0.333
the maison 0.333
the bleue 0.333
blue la 0.333
blue maison 0.333
blue bleue 0.333
house la 0.333
house maison 0.333
house bleue 0.333
#Sentence 2
blue maison 0.333
blue bleue 0.333
house maison 0.333
house bleue 0.333
#Sentence 3
blue bleue 0.333
Sentence 1 | Sentence 2 | Sentence 3 | Sent1 Initial Alignments: P(a f|e)
| | | 0.333*0.333*0.333 ≈ 0.0369
the blue house | the blue house | blue house blue house | blue |
| | | | | | | | | | | | | | | Sent2 Initial Alignments:
la maison bleue | la bleue maison | maison bleue bleue maison | bleue | 0.333*0.333 ≈ 0.111
| | | |
the blue house | the blue house | | | Sent3 Initial Alignment:
| | | | | | | | | | 0.333 ≈ 0.333
maison bleue la |maison la bleue | | |
| | | |____________________________
the blue house | the blue house | | | m |
| | | | | | | | | | P(a, f|e) = PI[t(f_j|ea_j) |
bleue maison la |bleue la maison | | | j=1 |
Pseudocode
p.25
initialize t(f|e) uniformly
do
set count(f|e) to 0 for all f,e
set total(e) to 0 for all e
for all sentence pairs (f_s,e_s)
for all unique words f in f_s
n_f = count of f in f_s
total_s = 0
for all unique words e in e_s
total_s += t(f|e) * n_f
for all unique words e in e_s
n_e = count of e in e_s
count(f|e) += t(f|e) * n_f * n_e / total_s
total(e) += t(f|e) * n_f * n_e / total_s
for all e in domain( total(.) )
for all f in domain( count(.|e) )
t(f|e) = count(f|e) / total(e)
until convergence
>>> corpus = [[["the", "blue", "house"], ["la", "maison", "bleue"]],
[["blue", "house"], ["maison", "bleue"]],
[["blue"], ["bleue"]]]
>>> params2 = make_alignment_parameters(corpus)
{'blue': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331},
'house': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331},
'the': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}}
>>> align2 = make_alignment_probabilities(corpus, params2)
[{'blue': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111},
'house': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111},
'the': {'maison': 0.1111111111111111, 'bleue': 0.1111111111111111, 'la': 0.1111111111111111}},
{'blue': {'maison': 0.25, 'bleue': 0.25},
'house': {'maison': 0.25, 'bleue': 0.25}},
{'blue': {'bleue': 1.0}}]
...
>>> normalize(corpus, align2)
{'blue': {'bleue': 1.0},
'house': {'maison': 0.5, 'bleue': 0.5},
'the': {'maison': 0.33333333333333331, 'bleue': 0.33333333333333331, 'la': 0.33333333333333331}}
Comments from Tutorial
- Stop working on IBM models -- NLP-Class has started.
- If I did continue, changes would probably need to be made to the align2 method: look at E_i and F_i (i from 0 to len(E [or F])-1) not E_i and F_j