Apr 5
homework
Discuss state of open-ended LZW or Huffman
coding assignment. Easy? Hard? Finished?
Need more time?
As we move forward with gathering tools,
one of the keys is building up a tested,
stable, usable set of utilities. For some
examples: the graph algorithms uses a priority
queue, LZW uses bit-access buffer and some sort of
lookup table, and so on.
For this homework, one of the ways to make it
manageable is to decide which parts you're going
to code, and which parts you're just going to
find (and cite) from an online source.
Midterm project comments and grades are up: nice work.
sample implementation
Included utilties:
- bit processing
- command line processing
- binary tree (for string lookup table)
- other?
See attached for an example of both in action.
hash tables
From discussion in section 5.1, pg 177 of Wirth :
Let keys = 16char names ; total number = 26<sup>16</sup>.
Let there be a thousand names ; total number = 10<sup>3</sup>.
Then "hashing" or "a hash table" or a "python dictionary"
or an "associative array" are all names for finding a way
to store information in an array for each person, but
still jump to the right person from their name.
The idea has two parts.
1. Find a function
Hash(key) = index
that turns a char16 into a number 0..999
(or perhaps a larger array, such as double size or more)
spread evenly and apparently randomly over that range.
Anything that "mixes up" the keys can be used
(hence the name "hash" in the first place); typically
something like "ord(key) mod some_prime" works pretty well.
Often sort of XOR folding is used; see the discussion
at http://en.wikipedia.org/wiki/Hash_function
and http://en.wikipedia.org/wiki/List_of_hash_functions
Desired properties:
a) fast
b) uniform across array indices
2. Since many strings map to the same number (Quick quiz: how many?)
we need to deal with the situation when we get to the right
location but find the wrong string there. (Which also means
we need to store the string as well as any other data at
that array index, typically there's a pointer there
to a data structure.)
Two collision mechanics are common :
i) put all items with same index into a linked list
(or other searchable thing) outside the hash's array.
ii) or use some other spot(s) in the array, looking
(in some deterministic way) for one that isn't used yet.
Typically we add an offset, mod the size of the table.
a) variation 1: linear probing ... but can cluster entries
b) variation 2: quadratic (computed with recursion)
These work best if the size of the table is prime,
since otherwise the offsets may well not be uniform.
Here's some sample C code :