Adding a hashed datatype

This page is the result of a comment on the Better Arrays for Tcl9 page -- JCW


This demo illustrates an experimental "hash" command and datatype for Tcl, which uses the usual odd-even representation to store data, plus an internal integer "map" vector to represent a hashing layer placed on top.

The "hash" command has been renamed to "ihash" and is now part of CritLib [L1 ]


This sounds from an API point of view similar to TclX's keyed lists. The mapping of course improves performance.

Yes, but TclX's keyed lists are lists of 2-item lists, while the odd-even lists below are a single list... makes quite a difference in speed and mem-use --jcw


Differences between hash objects and arrays:

  • hash entries are not variables, this model does not support traces
  • order can be preserved, though inserts/deletes will take O(N) time
  • memory use is lower than arrays (2 ints per item, plus free slots)

Performance-wise, the output below seems to indicate the following:

  • access is similar (the 29 vs. 20 uSec is maybe just call overhead)
  • creation from an odd-even list is over 25x faster than for arrays
  • full traversal takes the same amount of time for both
  • full traversal of the odd-even list is obviously faster for hashes
  • set/unset performance has not been measured (not implemented in C)

The current code does not support set/unset yet. Setting would either add or replace, depending on whether the key already exists.

Correction: the most recent upload [L2 ] now DOES support set and unset (and exists), see new examples in [L3 ] -jcw

Script:


  puts "version = [package require jolt]"

  if {[info commands nop] != ""} {
    puts "10 nops = [time {nop;nop;nop;nop;nop;nop;nop;nop;nop;nop} 10000]"
  }

  proc hash_try {} {
    set a {1 one 2 two 3 three}
    puts "data = [hash a]"
    puts "length = [hash a length]"
    puts "keys = [hash a keys]"
    puts "values = [hash a values]"
    puts "map = [hash a map]"
    puts "get 1 = [hash a get 1]"
    puts "get 2 = [hash a get 2]"
    puts "get 3 = [hash a get 3]"

    foreach x {1 10 100 1000 10000 100000} {
      set v {}
      for {set y 0} {$y < $x} {incr y} {
        lappend v $y [expr {$y+$y}]
      }
      set n [expr {1000000/$x}]
      if {$n > 100} { set n 100 }
      set r [lindex [time {set a $v; hash a} ${n}0] 0]
      set s [lindex [time {array unset b; array set b $v} $n] 0]
      puts [format "create %6d = %5d hash, vs. %d array" $x $r $s]
    }

    foreach x {0 1000 abc} {
      catch {hash a get $x} e1
      catch {set b($x)} e2
      puts "get $x: $e1 <-> $e2"
    }

    set x 1
    puts "10x h = [time {
        hash a get $x; hash a get $x; hash a get $x; hash a get $x;
        hash a get $x; hash a get $x; hash a get $x; hash a get $x;
                                      hash a get $x; hash a get $x} 10000]"
    puts "10x a = [time {
        set b($x); set b($x); set b($x); set b($x);
        set b($x); set b($x); set b($x); set b($x);
                              set b($x); set b($x)} 10000]"

    puts "h all = [time {foreach x [hash a keys] { hash a get $x }}]"
    puts "a all = [time {foreach x [array names b] { set b($x) }}]"
    puts "same = [expr {[lsort [hash a]] == [lsort [array get b]]}]"
    puts "h list = [time {foreach {x y} [hash a] { }}]"
    puts "a list = [time {foreach {x y} [array get b] { }}]"
  }

Output (SuSE 7.1 Linux, PIII/650):


  version = 2001.11.06.170201
  10 nops = 7 microseconds per iteration

    HASH
    ====

  data = 1 one 2 two 3 three
  length = 3
  keys = 1 2 3
  values = one two three
  map = 0 -1 0 -1 0 -1 0 -1 2105051955 2 -2061914958 4 0 -1 1977051568 0
  get 1 = one
  get 2 = two
  get 3 = three
  create      1 =     5 hash, vs. 12 array
  create     10 =     5 hash, vs. 31 array
  create    100 =     6 hash, vs. 323 array
  create   1000 =    30 hash, vs. 2310 array
  create  10000 =  1370 hash, vs. 37315 array
  create 100000 = 15341 hash, vs. 415132 array
  get 0: 0 <-> 0
  get 1000: 2000 <-> 2000
  get abc: no such key <-> can't read "b(abc)": no such element in array
  10x h = 29 microseconds per iteration
  10x a = 20 microseconds per iteration
  h all = 557448 microseconds per iteration
  a all = 556954 microseconds per iteration
  same = 1
  h list = 216386 microseconds per iteration
  a list = 542398 microseconds per iteration                                      

This is pretty much the same as the new dictionary type that's in 8.5, non? Why rewrite it?

It's not a "rewrite" - it has been around since Nov 2001... -jcw

FW: Good point. ;) Come to think of it, frederic bonnet's interface (the one which I assume is the one that's going into the Tcl core) is pretty similar to yours. Is it (or at least the interface) based on your code?

I don't think so (ihash uses a Python-derived hash algorithm, not Tcl's), but that's not so important - just wanted to mention ihash as idea for forward-compatible dict. I really don't care to lobby for ihash, having it buried in wiki-dust is fine too! -jcw