2011-10-02
Elias
Uploaded word_count1.py, adding general N-gram and output.
Jim
I wrote a template for a word counting program in python and uploaded it, as a starting point for playing around with this stuff.
Elias
My laptop keyboard just died, so I will do what I can with the computers in the library.
That said, I am picking some exercises from chapter 4, and starting chapter 5 for Tuesday.
4.1
"Write out the equation for trigram probability estimation (modifying Eq. 4.14 [p. 89])"
Reasoning: In this case,
N = 3. Using the general formula, (4.15) on p 89,
The trigram looks at the probability of a word given the two words before it (

and

)
4.2 onwards
These questions involve programming that I am not sure I can do. I'd like to start the next tutorial with this.
For 4.2 I know:
- I need to find probability of:
- every unigram (single word)
- every bigram (two-word phrases)
- The formulae needed to do so
I don't know how I would go about writing this. I haven't had much luck with file I/O in Lisp; and although I know how to input and output files in Python, I don't know how I would compute these.
I guess a pseudo code version could look like:
input file
read file to list
count total words
store as variable (eg defvar and setq)
for unigrams:
sort list
while words:
loop through word list
count the occurrence of each word
remove each occurrence of the word from the list
divide count by total number of words
add to output array/vector
reset count
I think this might work for reading a file in:
if os.path.isfile("inputstring")
f = open("inputstring")
inputstring = array.tostring(f.readlines()) // or should it just be =f.readlines()?
I'd put it inside the calc_ngram function, above the check for inputstring.
Elias...later
Found MIT code for bigram segmenting. Uses nltk.