march 8
- point to midterm assignment
- overview of chapter 7 - time vs space tradeoffs and preprocessing
sorting by counting
Describe the basic idea: counting how many you have of each thing.
Works particularly well if you have a small-ish number of different
things, and you know this in advance. Takes a huge amount of space
if there are lots of things, most of which won't be in the list
under consideration.
hashes
The idea is to reduce searching for something to a O(1) operation by
just jumping directly to it. To do this you need to find a "hash function"
that tells you where to jump to. This function is typically a map
f(key)=integer which sends the keys you're searching over to fairly random
numbers in a range that's larger than your number of items, but small enough
to index your storage array.
example
Suppose you want to store information
on n=500 people, using their names as keys. You'd like to be able to
find a specific person quickly using a hash table.
Choose as a hash function
f(string) = product of ascii values of characters mod 7919
where I picked a prime number (the 1000'th prime, actually)
much bigger than n (about 10 times bigger) but not so big that
I can't easily allocate storage space for my hash table, H[0..7918].
Then given anyone's name, say "Jim Mahoney", I apply the hash function
to calculate the corresponding integer
Using Mathematica's programming language (just for kicks)
f[x_] := Mod[ Apply[Times, ToCharacterCode[x]], 7919];
f["Jim Mahoney"]
6251
The drawback to this method is that two different keys may give the same index,
which is called a "collision". The algorithm you use to search, store, and delete
items from the hash table needs to be able to deal with these collisions - this is
the price you pay for the speed of the O(1) lookup in most cases.
The two most common ways of dealing with collisions are
- external: if there's a collision store all collided entries in a linked list, and search that list sequentially. This requires extra storage outside the hash table.
- internal: put that data somewhere else in the hash table, either in the next empty slot or in a quasi-random-but-deterministic spot. Either way, the insertion/delection/search methods need to "do the right thing".