Jim's
Tutorials

Spring 2012
course
navigation

Levenshtein distance

This is the edit distance metric that I got really excited about over Spring Break. I also had hoped that there was an equivalent for strings, and it turns out that there is: Damerau-Levenshtein distance .
# S a t u r d a y # 0 1 2 3 4 5 6 7 8 S 1 0 1 2 3 4 5 6 7 u 2 1 1 2 2 3 4 5 6 n 3 2 2 2 3 3 4 5 6 d 4 3 3 3 3 4 3 4 5 a 5 4 3 4 4 4 4 3 4 y 6 5 4 4 5 5 5 4 3 (from Wikipedia)
## {{{ http://code.activestate.com/recipes/576874/ (r1) ''' Calculates the Levenshtein distance of 2 strings''' def printMatrix(m): print ' ' for line in m: spTupel = () breite = len(line) for column in line: spTupel = spTupel + (column, ) print "%3i"*breite % spTupel s1 = raw_input('first word: ') s2 = raw_input('second word: ') def levenshtein(s1, s2): l1 = len(s1) l2 = len(s2) matrix = [range(l1 + 1)] * (l2 + 1) for zz in range(l2 + 1): matrix[zz] = range(zz,zz + l1 + 1) for zz in range(0,l2): for sz in range(0,l1): if s1[sz] == s2[zz]: matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz]) else: matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1) print "That's the Levenshtein-Matrix:" printMatrix(matrix) return matrix[l2][l1] distance = levenshtein(s1, s2) print 'The Levenshtein-Distance of ',s1, ' and ', s2, ' is ', distance ## end of http://code.activestate.com/recipes/576874/ }}}
def dameraulevenshtein(seq1, seq2): """Calculate the Damerau-Levenshtein distance between sequences. This distance is the number of additions, deletions, substitutions, and transpositions needed to transform the first sequence into the second. Although generally used with strings, any sequences of comparable objects will work. Transpositions are exchanges of *consecutive* characters; all other operations are self-explanatory. This implementation is O(N*M) time and O(M) space, for N and M the lengths of the two sequences. >>> dameraulevenshtein('ba', 'abc') 2 >>> dameraulevenshtein('fee', 'deed') 2 It works with arbitrary sequences too: >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e']) 2 """ # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix. # However, only the current and two previous rows are needed at once, # so we only store those. oneago = None thisrow = range(1, len(seq2) + 1) + [0] for x in xrange(len(seq1)): # Python lists wrap around for negative indices, so put the # leftmost column at the *end* of the list. This matches with # the zero-indexed strings and saves extra calculation. twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1] for y in xrange(len(seq2)): delcost = oneago[y] + 1 addcost = thisrow[y - 1] + 1 subcost = oneago[y - 1] + (seq1[x] != seq2[y]) thisrow[y] = min(delcost, addcost, subcost) # This block deals with transpositions if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]): thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) return thisrow[len(seq2) - 1]
I was thinking that I could use this and/or some other similar distance metric for parts of my project.
http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ Levenshtein_distance
last modified Wednesday March 28 2012 9:26 am EDT