Hash

Difference between version 26 and 27 - Previous - Next
A '''[https://en.wikipedia.org/wiki/Hash_function%|%hash]''' is a value,
deterministically produced by a '''hash function''' from its input value, that
serves to segment values into groups.  When there is nearly no
chance that there will ever be more than one value in a group, the hash is
called a '''signature'''.  When the same hash is produced from two values, it
is a '''collision'''.

A '''hash table''' is a [data structure].  In [Tcl] it is used for
[array%|%arrays], [dict%|%dictionaries], and many other things.

For [C] programmers, the [Tcl Portable Runtime Library] provides performant
generic hash functions that support various types of keys, and are useful even
if a project doesn't include a Tcl interpreter.



** See Also **

   [array statistics]:   

   [crypt]:   The *nix system for hashing passwords.

   [Judy arrays]:   

   Tcllib's [md5crypt]:   Creates MD5 hased passwords as used in more modern *nix systems. Also compatible with Apache's htpasswd command.

   [skiplist]:   

   http://web.eecs.umich.edu/~mckay/computer/index.html%|%hash table extension%|%:   

   [http://playcontrol.net/opensource/LuaHashMap/benchmarks.html%|%Hash Table Shootout 2: Rise of the Interpreter Machines] Eric Wing, 2012-12-23:   Eric tries to figure out why the Tcl hash table implementation is beating the pants off everything else.

   [https://code.activestate.com/lists/tcl-core/8719/%|%TCLCORE Upgrade of Tcl's String Hash Algorithm], 2010-10-07:   

   [http://code.activestate.com/lists/tcl-core/12949/%|%Re: TCLCORE A hash table benchmark - or why did we really discuss last year the hash table algorithm if tcl is that good]:   



** Documentation **

   [http://www.tcl.tk/man/tcl/TclLib/Hash.htm%|%official reference:  C API]:   



** Description **

The hash of a value can be used to group values.  A hash may have some clear
relationship to the value it was derived from, or it may appear to have no
relationship at all.  It generally requires more effort to produce a hash that
seems to be more random, and apparent lack of any relation to the input value
is a requirement, a '''cryptographic hash''', which must also be
'''collision-resistant''', and entirely unamenable to being deduced by any
other means.

A hash can be used to quickly look up a key for a value:  Hash each key and
store all keys with the same hash together.  When the value for a key is
requested, hash the key to determine its group, scan through the group to find
the stored key, and retrieve the value associated with it.  This is how
`[dict%|%dictionaries]` and `[array%|%arrays]` are implemented.  If the groups
are getting overpopulated, switch to a hash function that provides more groups,
and then move the existing data into the new groups.

A common way to store groups of values is as a linked list, a pointer to which
is stored in an array which is used as the '''hash table'''.   From time to
time the number of buckets is adjusted to maintain both performance and storage
efficiency. In Tcl, hash tables are automatically adjusted in this manner.

An older way of implementing a hash table is to store a value directly at the
position indicated by its hash, and if that position is already occupied, to
follow some strategy for handling the collision. Older literature often
consider schemes for walking around the hash table vector until an empty entry
is found and putting key+value pairs there when inserting; similarly one would
when looking up a key have to walk ahead until finding an empty entry before
concluding that the key wasn't in the hash.  This approach makes the delete
operation perilous, since replacing the deleted entry with an empty entry means
that like-hashed entries past it can no longer be found.  Thus there must be a
"tombstone" entry which is empty in the sense that it can be overwritten but
not empty in the sense that it doesn't terminate the chain.



** Cryptographic Hash Functions **
   * [https://core.tcl-lang.org/akupries/ankh/doc/trunk/doc/README.md%|%AnKH, a critcl wrapper around a collection of cryptographic hashes]
   * [blake3]:
   * [sha-256]
   * [sha-512]
   [sha-256]:   

   [sha-512]:   



** Less Secure Cryptographic Hash Functions **

   [RIPEMD]:   


** Broken Cryptographic Hash Functions **

   [MD4]:   

   [md5]:   

   [sha-1]:   


** Tcl Internal Hashing Function **

[DKF]: The traditional hashing function used in Tcl is (abbreviated):

======c
unsigned result = 0, ch;
for (ch=*string++ ; ch ; ch=*string++) {
    result += (result<<3) + ch;
}
return result;
======

That's not bad, because it both keeps the bits low down and mixes them up
higher up.

In 8.6, we are trialling the [http://isthe.com/chongo/tech/comp/fnv/%|%FNV
hash%|%] function instead:

======c
unsigned result = 0x811c9dc5, ch;
#define FNV_32_PRIME    ((unsigned) 0x01000193)
for (ch=*string++ ; ch ; ch=*string++) {
    result = (result * FNV_32_PRIME) ^ ch;
}
return result;
======

It seems to be a bit slower (not that the hash function is a bottleneck before
or after) but it has impeccable credentials as a hash function. (The better
ones are all much more complex.) Distribution results (i.e., the pattern of
bucket chain lengths reported by [array statistics]) are inconclusive.

[AMG]: The referenced site recommends the alternative FNV-1a which XORs before multiplying:

======c
    result = (result ^ ch) * FNV_32_PRIME;
======

[DKF]: One benefit of the new algorithm is that it makes it trickier to create hash
collisions. This code (thanks to [Joe English]) demonstrates:

======
proc collide {level {accum {}}} {
    global a
    if {$level} {
        incr level -1
        collide $level aj$accum
        collide $level ba$accum
    } else {
        set a($accum) 1
    }
}
puts [time {collide 15}]
puts [array statistics a]
======

For 8.5, it produces this:

======
35404625 microseconds per iteration
32768 entries in table, 16384 buckets
number of buckets with 0 entries: 16383
number of buckets with 1 entries: 0
number of buckets with 2 entries: 0
number of buckets with 3 entries: 0
number of buckets with 4 entries: 0
number of buckets with 5 entries: 0
number of buckets with 6 entries: 0
number of buckets with 7 entries: 0
number of buckets with 8 entries: 0
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 1
average search distance for entry: 16384.5
======

With the new function:

======
978683 microseconds per iteration
32768 entries in table, 16384 buckets
number of buckets with 0 entries: 8345
number of buckets with 1 entries: 601
number of buckets with 2 entries: 1144
number of buckets with 3 entries: 1657
number of buckets with 4 entries: 1616
number of buckets with 5 entries: 1271
number of buckets with 6 entries: 848
number of buckets with 7 entries: 487
number of buckets with 8 entries: 237
number of buckets with 9 entries: 111
number of buckets with 10 or more entries: 67
average search distance for entry: 3.0
======

----

[RT] 2010-02-10:  Could someone please explain the "buckets with 0 entries" item?

[DKF]: Sure. Tcl uses a hash scheme which assigns each key being hashed to a
“bucket”, which is an array field that points to a linked list of “hash
entries”. (Each entry holds the mapping from one key to one value.) When a
bucket has no entries attached to it, it's empty; no keys are actually mapped
to it. Sure, it's inefficient that there are empty buckets but it's not as
terrible as all that; insisting that there are none is going towards perfect
hashing, a different hashing scheme that typically requires deep knowledge of
the keys – both quantity and format – being hashed to generate the hash
function since it's vital with perfect hashing to have no collisions. Perfect
hashing is totally unsuitable for the Tcl library as we don't want to constrain
the keys at all.

The genius of hashing is that, providing the hash function generates a
uniformly random distribution of hashcodes for the input keys, you can get a
very efficient lookup scheme that avoids having to do a lot of comparisons or
manage complex data structures like balanced trees. We don't get perfect
distributions in practice of course (they require cryptographic hashing, which
is relatively expensive) but the results are still Good Enough™ in most cases.

[DKF]: Also, Tcl's hashing is actually very good for numeric-string keys.

[AMG]: Unless I misunderstand the actual hash function being used, "19" and "20" have the same hash value.  Many other collisions abound.  This doesn't impact correctness, only performance.  But consider also that "fixes" to this "problem" may have larger performance costs, so it's probably best to leave well enough alone.

----

[SBP]:  My apologies if this is the wrong place, but I didn't think it deserved its own page... I am using a `Tcl_HashTable` in C to manage a collection of binary blobs, and I wish to efficiently get the number of blobs stored within the hash table (as a C integer). The header instructs me to "never access any fields in this structure", so accessing the numEntries field seems off limits. What is the proper way to determine the size of the hash table from C? Parsing the 'stats' string seems wrong. Thank you for any insight you share!

[DKF]: It's a little-advertised field of the `Tcl_HashTable`, which is a public data structure. In fact, it's probably the ''only'' externally-useful field of that structure (unless you're keen on estimating loading factors and stuff like that, which you probably should ignore).

======c
Tcl_HashTable *hashTablePtr = ... /* get it from somewhere, OK? */

int size = hashTablePtr->numEntries;
======

Can't be much easier than that.



<<categories>> Concept | Data Structure