A '''hash''' is a data structure.
In Tcl, it is used for [array]s, [dict]s, and many other things.
For [C] programmers, the convenient and generic hash functions
are one of the advantages of using the [Tcl Portable Runtime Library].
A hash associates a ''value'' to each of zero or more ''keys'',
which are typically not sequential. The basic idea is to store the
pairs of key and value in a vector (like a Tcl list), at a position
computed by applying the ''hash function'' to the key. A good
hash function thoroughly mixes the bits in the key and thereby
produces a seemingly random number from each possible key.
Chances are fairly good that no two keys in a particular hash
get the same hash number, and then one can find the value
corresponding to a key by computing its hash number and looking
up the corresponding entry in the hash table vector.
Of course, sometimes two keys ''do'' get the same number, and
then one need some strategy for handling such situations. Older
literature often consider schemes for walking around the hash table
vector until an empty entry is found an putting key+value pairs there
when inserting; similarly one would when looking up a key have to
walk ahead until finding an empty entry before concluding that the
key wasn't in the hash. Newer implementations rather tend to have
each vector entry (a "bucket") be a pointer to a linked list of the
key+value pairs that should fall in this entry.
Efficiency in hashes depend on having sufficiently many buckets
that collisions don't happen too often, but having too many buckets
is a waste of memory. Tcl hashes automatically adjust to the number
of keys by occationally resizing themselves.
See also:
* [array statistics]
* [Judy arrays]
* [skiplist]
----
[DKF]: The traditional hashing function used in Tcl is (abbreviated):
======
unsigned result = 0, ch;
for (ch=*string++ ; ch ; ch=*string++) {
result += (result<<3) + ch;
}
return result;
======
That's not bad, because it both keeps the bits low down and mixes them up higher up.
In 8.6, we are trialling the [http://isthe.com/chongo/tech/comp/fnv/%|%FNV hash%|%] function instead:
======
unsigned result = 0x811c9dc5, ch;
#define FNV_32_PRIME ((unsigned) 0x01000193)
for (ch=*string++ ; ch ; ch=*string++) {
result = (result * FNV_32_PRIME) ^ ch;
}
return result;
======
''[Lars H]: Is that exactly what you're using, or a slightly simplified form for demonstration purposes? In FNV-1, one is not supposed to start at 0.''
<
>''[DKF]: That was a bug, subsequently corrected.''
It seems to be a bit slower (not that the hash function is a bottleneck before or after) but it has impeccable credentials as a hash function. (The better ones are all much more complex.) Distribution results (i.e., the pattern of bucket chain lengths reported by [array statistics]) are inconclusive.
One benefit of the new algorithm is that it makes it trickier to create hash collisions. This code (thanks to [Joe English]) demonstrates:
======
proc collide {level {accum ""}} {
global a
if {$level} {
incr level -1
collide $level "aj$accum"
collide $level "ba$accum"
} else {
set a($accum) 1
}
}
puts [time {collide 15}]
puts [array statistics a]
======
For 8.5, it produces this:
======
35404625 microseconds per iteration
32768 entries in table, 16384 buckets
number of buckets with 0 entries: 16383
number of buckets with 1 entries: 0
number of buckets with 2 entries: 0
number of buckets with 3 entries: 0
number of buckets with 4 entries: 0
number of buckets with 5 entries: 0
number of buckets with 6 entries: 0
number of buckets with 7 entries: 0
number of buckets with 8 entries: 0
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 1
average search distance for entry: 16384.5
======
With the new function:
======
978683 microseconds per iteration
32768 entries in table, 16384 buckets
number of buckets with 0 entries: 8345
number of buckets with 1 entries: 601
number of buckets with 2 entries: 1144
number of buckets with 3 entries: 1657
number of buckets with 4 entries: 1616
number of buckets with 5 entries: 1271
number of buckets with 6 entries: 848
number of buckets with 7 entries: 487
number of buckets with 8 entries: 237
number of buckets with 9 entries: 111
number of buckets with 10 or more entries: 67
average search distance for entry: 3.0
======
----
Could someone please explain the "buckets with 0 entries" item? ([RT] 10Feb2010)
[DKF]: Sure. Tcl uses a hash scheme which assigns each key being hashed to a “bucket”, which is an array field that points to a linked list of “hash entries”. (Each entry holds the mapping from one key to one value.) When a bucket has no entries attached to it, it's empty; no keys are actually mapped to it. Sure, it's inefficient that there are empty buckets but it's not as terrible as all that; insisting that there are none is going towards perfect hashing, a different hashing scheme that typically requires deep knowledge of the keys – both quantity and format – being hashed to generate the hash function since it's vital with perfect hashing to have no collisions. Perfect hashing is totally unsuitable for the Tcl library as we don't want to constrain the keys at all.
The genius of hashing is that, providing the hash function generates a uniformly random distribution of hashcodes for the input keys, you can get a very efficient lookup scheme that avoids having to do a lot of comparisons or manage complex data structures like balanced trees. We don't get perfect distributions in practice of course (they require cryptographic hashing, which is relatively expensive) but the results are still Good Enough™ in most cases.
<> Concept | Data Structure