2011-11-22

Chapter 4

About programs: The basics. Writing functions, style conventions, passing functions in functions, algorithm design (recursion, space/time, dynamic programming), and sample libraries. This is background info.

12. Initialize an n-by-m list of lists of empty strings using list multiplication, e.g. word_table = [[''] * n] * m. What happens when you set one of its values, e.g. word_table[1][2] = "hello"? Explain why this happens. Now write an expression using range() to construct a list of lists, and show that it does not have this problem.

word_table = [['']*5]*5
# word_table
#    [['', '', '', '', ''], ['', '', '', '', ''], 
#     ['', '', '', '', ''], ['', '', '', '', ''], ['', '', '', '', '']]
word_table[3][2] = 'hello'
# word_table
#    [['', '', 'hello', '', ''], ['', '', 'hello', '', ''], 
#     ['', '', 'hello', '', ''], ['', '', 'hello', '', ''], ['', '', 'hello', '', '']]

This happens because the list is created by multiplying a list of empty strings. Multiplication by n of lists in Python is the same as copying (with the same pointer) n times.

word_table2 = []
n = 6
for i in range(n-1):
    word_table2.append(['']*5)
word_table2[3][2] = 'hello'
# word_table2
#    [['', '', '', '', ''], ['', '', '', '', ''], 
#     ['', '', '', '', ''], ['', '', 'hello', '', ''], ['', '', '', '', '']]

This time, the pointer to each nested list is different. So modifying word_table2[n][m] does not change all word_table2[n]s the same way.

16. Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (http://en.wikipedia.org/wiki/Gematria, http://essenes.net/gemcal.htm).

a) Write a function gematria() that sums the numerical values of the letters of a word, according to the letter values in letter_vals:

 letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8, 'i':10,
                'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,
                'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}

b) Process a corpus (e.g. nltk.corpus.state_union) and for each document, count how many of its words have the number 666.

c) Write a function decode() to process a text, randomly replacing words with their Gematria equivalents, in order to discover the "hidden meaning" of the text.

def gemetria(word):
    word_list = list(word)
    sum = 0
    for letter in word_list:
        sum += letter_vals[letter]
        print "%s: %d" % (letter, letter_vals[letter])
        print "sum = %d" % sum
    return sum

 gemetria('hello')
  h: 8
  sum = 8
  e: 5
  sum = 13
  l: 30
  sum = 43
  l: 30
  sum = 73
  o: 70
  sum = 143
  143

 gemetria('elias')
   e: 5
   sum = 5
   l: 30
   sum = 35
   i: 10
   sum = 45
   a: 1
   sum = 46
   s: 300
   sum = 346
   346

def gemetria_666(text):
    tokens = nltk.word_tokenize(text)
    count = 0
    seen = []
    words = []
    for token in tokens:
        sum = 0
        word_list = list(token)
        if token not in seen:
            seen.append(token)
            for letter in word_list:
                if letter in string.ascii_lowercase:
                    sum += letter_vals[letter]
        if sum == 666:
            count += 1
            words.append(token)
    pprint.pprint(words)
    return count

 gemetria_666(string.lower(nltk.corpus.state_union.raw()))
    ['eloquent',
     'outlook',
     'market',
     'retain',
     'market.',
     'extra',
     'miraculous',
     'philosophy',
     'squander',
     'papers',
     'competency',
     'retina',
     'smallpox...and',
     'papers.']
 14

Chapter 5

Tagging (POS)

What follows what?

NLTK has pre-tagged corpora and sample data.

14. Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates.

def make_tag_set(tagged_words):
    tags = []
    for (w, t) in tagged_words:
        tags.append(t)
    tag_s = set(tags)
    return pprint.pprint(sorted(tag_s))    # tags.out

Chapter 6

Text classification, using POS-tagging and other features.

Chapter 7

Information extraction

Drawing syntactical trees...

tree1=nltk.Tree('NP', ['Alice'])
print tree1                       # (NP Alice)
tree2=nltk.Tree('NP', ['the', 'rabbit'])
tree2                             # Tree('NP', ['the', 'rabbit'])
print tree2                       # (NP the rabbit)
tree3=nltk.Tree('VP', ['chased', tree2])
tree4=nltk.Tree('S', [tree1, tree3])
print tree4                       # (S (NP Alice) (VP chased (NP the rabbit)))
tree4.draw()                      # alicetree.png

http://cs.marlboro.edu/ courses/ fall2011/jims_tutorials/ elias/ 2011-11-22
last modified Tuesday November 22 2011 3:40 pm EST

attachments

name last modified size

Jim's
Tutorials

course

navigation

2011-11-22

Chapter 4

Chapter 5

Chapter 6

Chapter 7

attachments

Jim'sTutorials

course

navigation

2011-11-22

Chapter 4

Chapter 5

Chapter 6

Chapter 7

attachments

Jim's
Tutorials