A '''hash''' is a data structure. In Tcl, it is used for [array]s, [dict]s, and many other things. For [C] programmers, the convenient and generic hash functions are one of the advantages of using the [Tcl Portable Runtime Library]. A hash associates a ''value'' to each of zero or more ''keys'', which are typically not sequential. The basic idea is to store the pairs of key and value in a vector (like a Tcl list), at a position computed by applying the ''hash function'' to the key. A good hash function thoroughly mixes the bits in the key and thereby produces a seemingly random number from each possible key. Chances are fairly good that no two keys in a particular hash get the same hash number, and then one can find the value corresponding to a key by computing its hash number and looking up the corresponding entry in the hash table vector. Of course, sometimes two keys ''do'' get the same number, and then one need some strategy for handling such situations. Older literature often consider schemes for walking around the hash table vector until an empty entry is found an putting key+value pairs there when inserting; similarly one would when looking up a key have to walk ahead until finding an empty entry before concluding that the key wasn't in the hash. Newer implementations rather tend to have each vector entry (a "bucket") be a pointer to a linked list of the key+value pairs that should fall in this entry. Efficiency in hashes depend on having sufficiently many buckets that collisions don't happen too often, but having too many buckets is a waste of memory. Tcl hashes automatically adjust to the number of keys by occationally resizing themselves. See also: * [array statistics] * [Judy arrays] * [skiplist] ---- [DKF]: The traditional hashing function used in Tcl is (abbreviated): ====== unsigned result = 0, ch; for (ch=*string++ ; ch ; ch=*string++) { result += (result<<3) + ch; } return result; ====== That's not bad, because it both keeps the bits low down and mixes them up higher up. In 8.6, we are trialling the FNV hash function instead: ====== unsigned result = 0, ch; #define FNV_32_PRIME ((unsigned) 0x01000193) for (ch=*string++ ; ch ; ch=*string++) { result = (result * FNV_32_PRIME) ^ ch; } return result; ====== It seems to be a bit slower (not that the hash function is a bottleneck before or after) but it has impeccable credentials as a hash function. (The better ones are all much more complex.) Distribution results (i.e., the pattern of bucket chain lengths reported by [array statistics]) are inconclusive. One benefit of the new algorithm is that it makes it trickier to create hash collisions. This code (thanks to [Joe English]) demonstrates: ====== proc collide {level {accum ""}} { global a if {$level} { incr level -1 collide $level "aj$accum" collide $level "ba$accum" } else { set a($accum) 1 } } puts [time {collide 15}] puts [array statistics a] ====== For 8.5, it produces this: ====== 35404625 microseconds per iteration 32768 entries in table, 16384 buckets number of buckets with 0 entries: 16383 number of buckets with 1 entries: 0 number of buckets with 2 entries: 0 number of buckets with 3 entries: 0 number of buckets with 4 entries: 0 number of buckets with 5 entries: 0 number of buckets with 6 entries: 0 number of buckets with 7 entries: 0 number of buckets with 8 entries: 0 number of buckets with 9 entries: 0 number of buckets with 10 or more entries: 1 average search distance for entry: 16384.5 ====== With the new function: ====== 104840 microseconds per iteration 32768 entries in table, 16384 buckets number of buckets with 0 entries: 8337 number of buckets with 1 entries: 626 number of buckets with 2 entries: 1190 number of buckets with 3 entries: 1568 number of buckets with 4 entries: 1602 number of buckets with 5 entries: 1330 number of buckets with 6 entries: 826 number of buckets with 7 entries: 470 number of buckets with 8 entries: 252 number of buckets with 9 entries: 123 number of buckets with 10 or more entries: 60 average search distance for entry: 3.0 ====== <> Concept | Data Structure