2011-11-22
Chapter 4
About programs: The basics. Writing functions, style conventions, passing functions in functions, algorithm design (recursion, space/time, dynamic programming), and sample libraries. This is background info.
12. Initialize an n-by-m list of lists of empty strings using list multiplication, e.g. word_table = [[''] * n] * m. What happens when you set one of
its values, e.g. word_table[1][2] = "hello"? Explain why this happens. Now write an expression using range() to construct a list of lists, and show
that it does not have this problem.
word_table = [['']*5]*5
# word_table
# [['', '', '', '', ''], ['', '', '', '', ''],
# ['', '', '', '', ''], ['', '', '', '', ''], ['', '', '', '', '']]
word_table[3][2] = 'hello'
# word_table
# [['', '', 'hello', '', ''], ['', '', 'hello', '', ''],
# ['', '', 'hello', '', ''], ['', '', 'hello', '', ''], ['', '', 'hello', '', '']]
This happens because the list is created by multiplying a list of empty strings. Multiplication by n of lists in Python is the same as copying (with the same pointer) n times.
word_table2 = []
n = 6
for i in range(n-1):
word_table2.append(['']*5)
word_table2[3][2] = 'hello'
# word_table2
# [['', '', '', '', ''], ['', '', '', '', ''],
# ['', '', '', '', ''], ['', '', 'hello', '', ''], ['', '', '', '', '']]
This time, the pointer to each nested list is different. So modifying word_table2[n][m] does not change all word_table2[n]s the same way.
a) Write a function gematria() that sums the numerical values of the letters of a word, according to the letter values in letter_vals:
letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8, 'i':10,
'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,
'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}
b) Process a corpus (e.g. nltk.corpus.state_union) and for each document, count how many of its words have the number 666.
c) Write a function decode() to process a text, randomly replacing words with their Gematria equivalents, in order to discover the "hidden meaning" of the text.
a)
def gemetria(word):
word_list = list(word)
sum = 0
for letter in word_list:
sum += letter_vals[letter]
print "%s: %d" % (letter, letter_vals[letter])
print "sum = %d" % sum
return sum
gemetria('hello')
h: 8
sum = 8
e: 5
sum = 13
l: 30
sum = 43
l: 30
sum = 73
o: 70
sum = 143
143
gemetria('elias')
e: 5
sum = 5
l: 30
sum = 35
i: 10
sum = 45
a: 1
sum = 46
s: 300
sum = 346
346
b)
def gemetria_666(text):
tokens = nltk.word_tokenize(text)
count = 0
seen = []
words = []
for token in tokens:
sum = 0
word_list = list(token)
if token not in seen:
seen.append(token)
for letter in word_list:
if letter in string.ascii_lowercase:
sum += letter_vals[letter]
if sum == 666:
count += 1
words.append(token)
pprint.pprint(words)
return count
gemetria_666(string.lower(nltk.corpus.state_union.raw()))
['eloquent',
'outlook',
'market',
'retain',
'market.',
'extra',
'miraculous',
'philosophy',
'squander',
'papers',
'competency',
'retina',
'smallpox...and',
'papers.']
14
Chapter 5
Tagging (POS)
NLTK has pre-tagged corpora and sample data.
14. Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates.
def make_tag_set(tagged_words):
tags = []
for (w, t) in tagged_words:
tags.append(t)
tag_s = set(tags)
return pprint.pprint(sorted(tag_s)) # tags.out
Chapter 6
Text classification, using POS-tagging and other features.
Chapter 7
Information extraction
Drawing syntactical trees...
tree1=nltk.Tree('NP', ['Alice'])
print tree1 # (NP Alice)
tree2=nltk.Tree('NP', ['the', 'rabbit'])
tree2 # Tree('NP', ['the', 'rabbit'])
print tree2 # (NP the rabbit)
tree3=nltk.Tree('VP', ['chased', tree2])
tree4=nltk.Tree('S', [tree1, tree3])
print tree4 # (S (NP Alice) (VP chased (NP the rabbit)))
tree4.draw() # alicetree.png