Jim's
Tutorials

Fall 2011
course
navigation

2011-10-02

Elias

Uploaded word_count1.py, adding general N-gram and output.

Jim

I wrote a template for a word counting program in python and uploaded it, as a starting point for playing around with this stuff.

Elias

My laptop keyboard just died, so I will do what I can with the computers in the library. That said, I am picking some exercises from chapter 4, and starting chapter 5 for Tuesday.
Interesting links: wikipedia: n-gram | wikipedia: Markov chain | wikipedia: Hidden Markov model | Google N-gram viewer | Microsoft Web N-gram Services | W3C N-gram Specifications

4.1

"Write out the equation for trigram probability estimation (modifying Eq. 4.14 [p. 89])"
Reasoning: In this case, N = 3. Using the general formula, (4.15) on p 89,
The trigram looks at the probability of a word given the two words before it ( and )

4.2 onwards

These questions involve programming that I am not sure I can do. I'd like to start the next tutorial with this.
For 4.2 I know:
I don't know how I would go about writing this. I haven't had much luck with file I/O in Lisp; and although I know how to input and output files in Python, I don't know how I would compute these.
I guess a pseudo code version could look like:
input file read file to list count total words store as variable (eg defvar and setq) for unigrams: sort list while words: loop through word list count the occurrence of each word remove each occurrence of the word from the list divide count by total number of words add to output array/vector reset count
I found this n-gram calculator which calculates just from a raw input.
I think this might work for reading a file in: if os.path.isfile("inputstring") f = open("inputstring") inputstring = array.tostring(f.readlines()) // or should it just be =f.readlines()? I'd put it inside the calc_ngram function, above the check for inputstring.

Elias...later

Found MIT code for bigram segmenting. Uses nltk.
http://cs.marlboro.edu/ courses/ fall2011/jims_tutorials/ elias/ 2011-10-02
last modified Monday October 24 2011 5:01 pm EDT

attachments [paper clip]

     name last modified size
   segment.py Oct 24 2011 5:01 pm 3.94kB    word_count.py Oct 4 2011 4:04 pm 1.44kB    word_count1.py Oct 5 2011 11:20 am 2.07kB