Apr 26
string search
Looking for a word (length m) in a larger body of text (length n) is a *very* common task. Turns out there are a many clever tricks to do this.
1. naive approach
The idea:
- left-to-right, shift word by 1 char on failure
- O(n*m)
- simple; works. But ....
2. Knuth-Morris-Pratt
Our first attempt at something better.
The idea:
- left-to-right again, but if partial match use information from word to skip ahead
- need to pre-calculate a skip table giving number of previous characters match beginning of word
Skip table example (from Wikipedia article):
i 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
W[i] P A R T I C I P A T E I N P A R A C H U T E
T[i] -1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 3 0 0 0 0 0
For example, if the text to be searched includes ...PARTICIPATE IN PARROTS... then when the match failed at the 2nd R in PARROTS, we'd skip forward until the PAR we'd already seen lined up with the PAR in PARROTS (3 chars back from current location). If the C in PARACHUTE failed, we'd jump forward 19 characters, since the PARA just before doesn't match anything previous in the string.
Sources:
2. Boyer-Moore
One of the most popular methods in practice.
The idea :
- the search is done from the end of the string, not the start
- so when the match goes wrong, the string can be slid ahead more than 1 char
- there are two shifts that can be made. (Both are precalculated)
- a "bad character" table (how far to shift to get this miss to line perhaps up with something)
- a "good suffix" table (how far to shift to get the accepted partial pattern to fit)
Sources:
Discussion
- bioinformatics
- complicated tricks