Apr 26

string search

Looking for a word (length m) in a larger body of text (length n) is a *very* common task. Turns out there are a many clever tricks to do this.

1. naive approach

The idea:

left-to-right, shift word by 1 char on failure
O(n*m)
simple; works. But ....

2. Knuth-Morris-Pratt

Our first attempt at something better.

The idea:

left-to-right again, but if partial match use information from word to skip ahead
need to pre-calculate a skip table giving number of previous characters match beginning of word

Skip table example (from Wikipedia article):

   i	0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
 W[i]	P A R T I C I P A T E 	I N   P A R A C H U T E
 T[i]  -1 0 0 0 0 0 0 0 1 2 0 0	0 0 0 0 1 2 3 0	0 0 0 0

For example, if the text to be searched includes ...PARTICIPATE IN PARROTS... then when the match failed at the 2nd R in PARROTS, we'd skip forward until the PAR we'd already seen lined up with the PAR in PARROTS (3 chars back from current location). If the C in PARACHUTE failed, we'd jump forward 19 characters, since the PARA just before doesn't match anything previous in the string.

Sources:

2. Boyer-Moore

One of the most popular methods in practice.

The idea :

the search is done from the end of the string, not the start
so when the match goes wrong, the string can be slid ahead more than 1 char
there are two shifts that can be made. (Both are precalculated)
- a "bad character" table (how far to shift to get this miss to line perhaps up with something)
- a "good suffix" table (how far to shift to get the accepted partial pattern to fit)

Sources:

Discussion

bioinformatics
complicated tricks

http://cs.marlboro.edu/ courses/ spring2011/algorithms/ notes/ Apr_26
last modified Monday April 25 2011 9:54 pm EDT

Algorithms

course

navigation

string search