Jim's
Tutorials

Fall 2011
course
navigation

2011-11-22

Looking at http://www.nltk.org/book

Chapter 4

About programs: The basics. Writing functions, style conventions, passing functions in functions, algorithm design (recursion, space/time, dynamic programming), and sample libraries. This is background info.
12. Initialize an n-by-m list of lists of empty strings using list multiplication, e.g. word_table = [[''] * n] * m. What happens when you set one of its values, e.g. word_table[1][2] = "hello"? Explain why this happens. Now write an expression using range() to construct a list of lists, and show that it does not have this problem.
word_table = [['']*5]*5 # word_table # [['', '', '', '', ''], ['', '', '', '', ''], # ['', '', '', '', ''], ['', '', '', '', ''], ['', '', '', '', '']] word_table[3][2] = 'hello' # word_table # [['', '', 'hello', '', ''], ['', '', 'hello', '', ''], # ['', '', 'hello', '', ''], ['', '', 'hello', '', ''], ['', '', 'hello', '', '']]
This happens because the list is created by multiplying a list of empty strings. Multiplication by n of lists in Python is the same as copying (with the same pointer) n times.
word_table2 = [] n = 6 for i in range(n-1): word_table2.append(['']*5) word_table2[3][2] = 'hello' # word_table2 # [['', '', '', '', ''], ['', '', '', '', ''], # ['', '', '', '', ''], ['', '', 'hello', '', ''], ['', '', '', '', '']]
This time, the pointer to each nested list is different. So modifying word_table2[n][m] does not change all word_table2[n]s the same way.
16. Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (http://en.wikipedia.org/wiki/Gematria, http://essenes.net/gemcal.htm).
a) Write a function gematria() that sums the numerical values of the letters of a word, according to the letter values in letter_vals:
letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8, 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100, 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}
b) Process a corpus (e.g. nltk.corpus.state_union) and for each document, count how many of its words have the number 666.
c) Write a function decode() to process a text, randomly replacing words with their Gematria equivalents, in order to discover the "hidden meaning" of the text.
a) def gemetria(word): word_list = list(word) sum = 0 for letter in word_list: sum += letter_vals[letter] print "%s: %d" % (letter, letter_vals[letter]) print "sum = %d" % sum return sum
gemetria('hello') h: 8 sum = 8 e: 5 sum = 13 l: 30 sum = 43 l: 30 sum = 73 o: 70 sum = 143 143 gemetria('elias') e: 5 sum = 5 l: 30 sum = 35 i: 10 sum = 45 a: 1 sum = 46 s: 300 sum = 346 346
b) def gemetria_666(text): tokens = nltk.word_tokenize(text) count = 0 seen = [] words = [] for token in tokens: sum = 0 word_list = list(token) if token not in seen: seen.append(token) for letter in word_list: if letter in string.ascii_lowercase: sum += letter_vals[letter] if sum == 666: count += 1 words.append(token) pprint.pprint(words) return count
gemetria_666(string.lower(nltk.corpus.state_union.raw())) ['eloquent', 'outlook', 'market', 'retain', 'market.', 'extra', 'miraculous', 'philosophy', 'squander', 'papers', 'competency', 'retina', 'smallpox...and', 'papers.'] 14

Chapter 5

Tagging (POS)
NLTK has pre-tagged corpora and sample data.
14. Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates. def make_tag_set(tagged_words): tags = [] for (w, t) in tagged_words: tags.append(t) tag_s = set(tags) return pprint.pprint(sorted(tag_s)) # tags.out

Chapter 6

Text classification, using POS-tagging and other features.

Chapter 7

Information extraction
Drawing syntactical trees... tree1=nltk.Tree('NP', ['Alice']) print tree1 # (NP Alice) tree2=nltk.Tree('NP', ['the', 'rabbit']) tree2 # Tree('NP', ['the', 'rabbit']) print tree2 # (NP the rabbit) tree3=nltk.Tree('VP', ['chased', tree2]) tree4=nltk.Tree('S', [tree1, tree3]) print tree4 # (S (NP Alice) (VP chased (NP the rabbit))) tree4.draw() # alicetree.png
http://cs.marlboro.edu/ courses/ fall2011/jims_tutorials/ elias/ 2011-11-22
last modified Tuesday November 22 2011 3:40 pm EST

attachments [paper clip]

     name last modified size
[IMG]alicetree.png Nov 22 2011 4:57 am 12.2kB    errors.out Nov 22 2011 3:52 am 142kB    rules.yaml Nov 22 2011 3:52 am 158kB    tags.out Nov 22 2011 4:11 am 5.38kB