2011-11-8
Jim: attached a version 6; final round of editing. Measuring conditional probabilities works. Verterbi seems to work, though understanding is thin. Understanding of theoretical conditional probabilities of what's seen is missing.
Continued playing around with HMMs.
Sample output from hm4.py: hm4.out Shows P(j|i) where i, j are subsequent (hidden or visible) states.
Answers to questions
1.
a+b) See attached code.
c) The probabilities found with the code are in the range of the expected probabilities.
And you know this ... how? All I see are lots of numbers;
no discussion of what they're supposed to be, or how close. - Jim
2.
a) See attached code.
b) Theoretical P(i|j) are found by conjunctions of conditional transfer and emission probabilities.
Example: With [s] being the pre-initial state (before the start state is chosen),
a visible sequence "a, b, d, d, e",
and hidden sequence "1, 1, 3, 2, 3",
then:
P(e|d) = P(a|[s])*P(b|[s], a)*P(d|[s], a, b)*P(d|[s], a, b, d)*P(e|[s],a, b, d, d)
So why don't you calculate both theory and experiment and compare directly,
rather than talking about doing the comparison? - Jim
3. Viterbi algorithm:
a) Input: The observable/visible states; eg: ['a','b','d','c','a','d','d','e','c','d','b,'e'], as well as the transition, emission, and start probabilities.
b) Output: The most likely path of hidden states that produced the observables.
The algorithm (in the form given on Wikipedia) makes another dictionary-of-dictionaries ("path") that has the transition probabilities for each state in the sequence. It then adds the point with the highest probability to path to find the most likely path of hidden states.
Jim: "that has the transition probabilities" - really? I don't think so.
V = [{'1': 0.20000000000000001, '3': 0.020000000000000004, '2': 0.050000000000000003},
{'1': 0.020000000000000004, '3': 0.012, '2': 0.010000000000000002},
{'1': 0.0, '3': 0.0024000000000000007, '2': 0.0020000000000000005},
{'1': 0.00060000000000000016, '3': 0.00024000000000000009, '2': 4.8000000000000022e-05},
{'1': 0.0, '3': 7.2000000000000015e-05, '2': 6.0000000000000022e-05},
{'1': 1.0800000000000002e-05, '3': 3.6000000000000015e-06, '2': 4.320000000000001e-06},
{'1': 0.0, '3': 1.2960000000000002e-06, '2': 1.0800000000000002e-06},
{'1': 3.2400000000000004e- 07, '3': 1.2960000000000002e-07, '2': 2.5920000000000006e-08},
{'1': 6.4800000000000016e-09, '3': 2.9160000000000001e-08, '2': 3.2400000000000006e- 08},
{'1': 0.0, '3': 7.7760000000000013e-09, '2': 1.1664000000000002e-09}]
This is where all the magic happens:
for t in range(1,len(obs)):
V.append({})
newpath = {}
for y in states:
(prob, state) = max([(V[t-1][y0] * trans_p[y0][y] * emit_p[y][obs[t]], y0) for y0 in states]) # This is the real magic!
V[t][y] = prob
newpath[y] = path[state] + [y]
path = newpath
It returns something like
(7.7760000000000013e-09, ['1', '2', '3', '1', '3', '1', '3', '1', '2', '3'])
which is the conditional probability of, as well as the most likely hidden sequence.
So what was the actual hidden sequence?
And if this really is the probability of this "most likely"
sequence, why is it so very tiny? - Jim
Elias playing with the code
When running hm5_viterbi.py or hm4.py, the doctest throws this:
>>> hm4.calc_conditionals(['b', 'a', 'a', 'a'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "hm4.py", line 124, in calc_conditionals
result[j][i] = float(result[j][i])/counts[j]
KeyError: 'b'
>>> hm5_viterbi.calc_conditionals(['b', 'a', 'a', 'a'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "hm5_viterbi.py", line 124, in calc_conditionals
result[j][i] = float(result[j][i])/counts[j]
KeyError: 'b'
I can see what's going on, where it's happening, but I'm not sure how to fully fix it.
I fixed the error above by changing
(i, j) = (states[index], states[index+1])
to
(j, i) = (states[index], states[index+1])
How does this "fix the error?" Is it doing the right thing, or just
letting it compile?
However, the iteration over the list (for "counts") still stops at the second-to-last element.
I think this is because of the (j,i) pairing, as there are only (n-1) pairs in a list of length n.
I added a "brute force" fix for this error:
for index in range(len(states)-1, len(states)): # to fix error thrown
counts[states[index]] = counts.get(j,0) + 1 # at (j,i) = (n-2, n-1)
New output (with debugging print statements everywhere):
>>> hm5_viterbi.calc_conditionals(['b', 'a', 'a', 'a','c','b'])
['a', 'b', 'c']
{'a': {'a': 0, 'c': 0, 'b': 0}, 'c': {'a': 0, 'c': 0, 'b': 0}, 'b': {'a': 0, 'c': 0, 'b': 0}}
0
1
2
3
4
5
{'a': {'a': 2, 'c': 1, 'b': 0}, 'c': {'a': 0, 'c': 0, 'b': 1}, 'b': {'a': 1, 'c': 0, 'b': 0}} {'a': 3, 'c': 1, 'b': 2} 6
{'a': {'a': 0.66666666666666663, 'c': 0.33333333333333331, 'b': 0.0}, 'c': {'a': 0.0, 'c': 0.0, 'b': 1.0}, 'b': {'a': 0.5, 'c': 0.0, 'b': 0.0}}
NLTK Book
I re-downloaded the python nltk files.
Chapter 1
...is Text Processing. Basically what we've been doing in the past few sessions.
Text is specific to NLTK and demonstrates the use of functions in nltk.py: strings and lists (and associated functions), bigrams, counting, (nested) iterations.
New stuff!
- Babelizer: machine translation -- translates to Chinese, English, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, Spanish then translates it back to English.
- "Word Comparison Operators"
- Question 25 from the end of Ch. 1:
Define sent to be the list of words ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore'].
Now write code to perform the following tasks:
1. Print all words beginning with sh
2. Print all words longer than four characters
Chapter 2
Accessing Text Corpora and Lexical Resources
- Demonstrates "from A import B"
- Different corpora (Gutenberg, Web/Chat, Brown, ...)
- UNICODE! (utf8)
- WordNet
- NLTK has methods to access all the corpora and analyze the content
Chapter 3
Text Processing
-I've been doing a lot of this.
- Read from file/URL
- String methods:
Method Functionality
s.find(t) index of first instance of string t inside s (-1 if not found)
s.rfind(t) index of last instance of string t inside s (-1 if not found)
s.index(t) like s.find(t) except it raises ValueError if not found
s.rindex(t) like s.rfind(t) except it raises ValueError if not found
s.join(text) combine the words of the text into a string using s as the glue
s.split(t) split s into a list wherever a t is found (whitespace by default)
s.splitlines() split s into a list of strings, one per line
s.lower() a lowercased version of the string s
s.upper() an uppercased version of the string s
s.title() a titlecased version of the string s
s.strip() a copy of s without leading or trailing whitespace
s.replace(t, u) replace instances of t with u inside s
- Encoding - More about Unicode.
- Tokenization
- More Sentence Manipulation
- Text wrapping (could be interesting)
- Translation
- This should come in handy very soon. As well as the Babelizer.
I like NLTK. It is easy to read and understand, very hands-on, and isn't as "hand-wavey" as SLP was.
However, it seems to be very NLTK-specific which, while not a BAD thing, limits things a bit.
(NLTK is fun!)