""" hash.py A "hash function" is something that takes a key (e.g. a string) and produces a pseudo-random integer. Ideally it is * quick to calculate * probalistically evenly distributed in some range of values * unlikely to give same values for different strings The word 'hash' is in the sense of 'churn', or 'mix up'. One common scheme for strings, which is similar to the way "pseudo-random" numbers are produced, is to repeatedly do a short numeric calculation on the ascii value for character, folding in the next character in a loop. For cryptographic hashes (which is *not* what I'm doing here) you typically also want an 'unpredictability' so that for example similar strings don't give similar hash values. The function here does not have that property. (For example, 'abcd' and 'abce' will have myhash values that only differ by 1.) See http://effbot.org/zone/python-hash.htm for a description of python's own hash algorithm. This file will generate all the hash values for words in Moby Dick, lowercase, without punctuation, in a csv (comma separated value) form that's easy to plot. Probability of 19738 words having different myhash values if all of 'em have the same probability (approx true, I think) (D/D)*((D-1)/D)*((D-2)/D)*...*((D-19737)/D) = 0.95 so there (probably) aren't any collisions. (See http://en.wikipedia.org/wiki/Birthday_problem) These shell commands generate myhash.png and pythonhash.png $ python hash.py > hash.csv $ Rscript hash_plot.r Jim Mahoney | April 2013 """ A = 33 # 3 * 11 B = 5381 # prime C = 4294967296 # 2**32; do everything mod this def myhash(key): """ Return integer hash value given a string. """ # This is close djb2 , but I didn't want 'abc' and 'abcd' to be sequential. # see e.g. http://www.cse.yorku.ca/~oz/hash.html # and http://stackoverflow.com/questions/1579721/ result = 0 for char in key: result = (result + A * result + B + ord(char)) % C return result def moby_words(): """ Return list of words from Moby Dick """ filename = 'moby_dick.txt' file = open(filename, 'r') punc = [',', '.', '"', "'", ';', ':', '!', '(', ')', '*', '$', '?', '--', '&', '[', ']', '_'] result = set() while True: line = file.readline() if not line: result = list(result) result.sort() return result words = line.split() for word in words: word = word.lower().strip() for p in punc: word = word.replace(p, '') if word != '': result.add(word) def main(): i = 0 print " %8s, %12s, %24s, %s " % ('i', 'myhash', 'pythonhash', 'word') words = moby_words() for word in words: i += 1 print " %8i, %12i, %24i, '%s' " % (i, myhash(word), hash(word), word) prob = 1.0 for i in xrange(len(words)): prob = prob * (C - i)/C print print "## probability of no myhash collisions = ", prob main()