Jim's
Tutorials

Fall 2011
course
navigation

2011-11-8

Jim: attached a version 6; final round of editing. Measuring conditional probabilities works. Verterbi seems to work, though understanding is thin. Understanding of theoretical conditional probabilities of what's seen is missing.

Continued playing around with HMMs.

Sample output from hm4.py: hm4.out Shows P(j|i) where i, j are subsequent (hidden or visible) states.

Answers to questions

1. a+b) See attached code.
c) The probabilities found with the code are in the range of the expected probabilities.
And you know this ... how? All I see are lots of numbers; no discussion of what they're supposed to be, or how close. - Jim
2. a) See attached code.
b) Theoretical P(i|j) are found by conjunctions of conditional transfer and emission probabilities.
Example: With [s] being the pre-initial state (before the start state is chosen), a visible sequence "a, b, d, d, e", and hidden sequence "1, 1, 3, 2, 3", then: P(e|d) = P(a|[s])*P(b|[s], a)*P(d|[s], a, b)*P(d|[s], a, b, d)*P(e|[s],a, b, d, d)
So why don't you calculate both theory and experiment and compare directly, rather than talking about doing the comparison? - Jim
3. Viterbi algorithm:
a) Input: The observable/visible states; eg: ['a','b','d','c','a','d','d','e','c','d','b,'e'], as well as the transition, emission, and start probabilities.
b) Output: The most likely path of hidden states that produced the observables. The algorithm (in the form given on Wikipedia) makes another dictionary-of-dictionaries ("path") that has the transition probabilities for each state in the sequence. It then adds the point with the highest probability to path to find the most likely path of hidden states.
Jim: "that has the transition probabilities" - really? I don't think so.
V = [{'1': 0.20000000000000001, '3': 0.020000000000000004, '2': 0.050000000000000003}, {'1': 0.020000000000000004, '3': 0.012, '2': 0.010000000000000002}, {'1': 0.0, '3': 0.0024000000000000007, '2': 0.0020000000000000005}, {'1': 0.00060000000000000016, '3': 0.00024000000000000009, '2': 4.8000000000000022e-05}, {'1': 0.0, '3': 7.2000000000000015e-05, '2': 6.0000000000000022e-05}, {'1': 1.0800000000000002e-05, '3': 3.6000000000000015e-06, '2': 4.320000000000001e-06}, {'1': 0.0, '3': 1.2960000000000002e-06, '2': 1.0800000000000002e-06}, {'1': 3.2400000000000004e- 07, '3': 1.2960000000000002e-07, '2': 2.5920000000000006e-08}, {'1': 6.4800000000000016e-09, '3': 2.9160000000000001e-08, '2': 3.2400000000000006e- 08}, {'1': 0.0, '3': 7.7760000000000013e-09, '2': 1.1664000000000002e-09}]
This is where all the magic happens: for t in range(1,len(obs)): V.append({}) newpath = {} for y in states: (prob, state) = max([(V[t-1][y0] * trans_p[y0][y] * emit_p[y][obs[t]], y0) for y0 in states]) # This is the real magic! V[t][y] = prob newpath[y] = path[state] + [y] path = newpath
It returns something like
(7.7760000000000013e-09, ['1', '2', '3', '1', '3', '1', '3', '1', '2', '3'])
which is the conditional probability of, as well as the most likely hidden sequence.
So what was the actual hidden sequence? And if this really is the probability of this "most likely" sequence, why is it so very tiny? - Jim

Elias playing with the code

When running hm5_viterbi.py or hm4.py, the doctest throws this:
>>> hm4.calc_conditionals(['b', 'a', 'a', 'a']) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "hm4.py", line 124, in calc_conditionals result[j][i] = float(result[j][i])/counts[j] KeyError: 'b' >>> hm5_viterbi.calc_conditionals(['b', 'a', 'a', 'a']) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "hm5_viterbi.py", line 124, in calc_conditionals result[j][i] = float(result[j][i])/counts[j] KeyError: 'b'
I can see what's going on, where it's happening, but I'm not sure how to fully fix it.
I fixed the error above by changing (i, j) = (states[index], states[index+1]) to (j, i) = (states[index], states[index+1])
How does this "fix the error?" Is it doing the right thing, or just letting it compile?
However, the iteration over the list (for "counts") still stops at the second-to-last element.
I think this is because of the (j,i) pairing, as there are only (n-1) pairs in a list of length n.
I added a "brute force" fix for this error:
for index in range(len(states)-1, len(states)): # to fix error thrown counts[states[index]] = counts.get(j,0) + 1 # at (j,i) = (n-2, n-1)
New output (with debugging print statements everywhere): >>> hm5_viterbi.calc_conditionals(['b', 'a', 'a', 'a','c','b']) ['a', 'b', 'c'] {'a': {'a': 0, 'c': 0, 'b': 0}, 'c': {'a': 0, 'c': 0, 'b': 0}, 'b': {'a': 0, 'c': 0, 'b': 0}} 0 1 2 3 4 5 {'a': {'a': 2, 'c': 1, 'b': 0}, 'c': {'a': 0, 'c': 0, 'b': 1}, 'b': {'a': 1, 'c': 0, 'b': 0}} {'a': 3, 'c': 1, 'b': 2} 6 {'a': {'a': 0.66666666666666663, 'c': 0.33333333333333331, 'b': 0.0}, 'c': {'a': 0.0, 'c': 0.0, 'b': 1.0}, 'b': {'a': 0.5, 'c': 0.0, 'b': 0.0}}

NLTK Book

NLTK book
I re-downloaded the python nltk files.

Chapter 1

...is Text Processing. Basically what we've been doing in the past few sessions.
Text is specific to NLTK and demonstrates the use of functions in nltk.py: strings and lists (and associated functions), bigrams, counting, (nested) iterations.
New stuff!
Define sent to be the list of words ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']. Now write code to perform the following tasks: 1. Print all words beginning with sh 2. Print all words longer than four characters

Chapter 2

Accessing Text Corpora and Lexical Resources

Chapter 3

Text Processing -I've been doing a lot of this.
Method Functionality s.find(t) index of first instance of string t inside s (-1 if not found) s.rfind(t) index of last instance of string t inside s (-1 if not found) s.index(t) like s.find(t) except it raises ValueError if not found s.rindex(t) like s.rfind(t) except it raises ValueError if not found s.join(text) combine the words of the text into a string using s as the glue s.split(t) split s into a list wherever a t is found (whitespace by default) s.splitlines() split s into a list of strings, one per line s.lower() a lowercased version of the string s s.upper() an uppercased version of the string s s.title() a titlecased version of the string s s.strip() a copy of s without leading or trailing whitespace s.replace(t, u) replace instances of t with u inside s
I like NLTK. It is easy to read and understand, very hands-on, and isn't as "hand-wavey" as SLP was.
However, it seems to be very NLTK-specific which, while not a BAD thing, limits things a bit.

(NLTK is fun!)
http://cs.marlboro.edu/ courses/ fall2011/jims_tutorials/ elias/ 2011-11-8
last modified Tuesday November 8 2011 4:13 pm EST

attachments [paper clip]

     name last modified size
   h5c.out Nov 8 2011 2:06 am 6.33kB    hm4.out Nov 7 2011 6:47 pm 12.9kB    hm4.py Nov 7 2011 11:41 pm 4.67kB    hm5_viterbi.py Nov 8 2011 2:56 am 6.83kB    hmv6.py Nov 8 2011 4:13 pm 6.92kB