Decoupling hashing from keys

This is a comment about managing data in Tcl. No code, just some thoughts for now... jcw

Tcl has arrays, and as of 8.5 dicts. Together, these let you manage tabular data. There are several ways to do this:

1) A list of lists (row-wise is the most common)

  • this is relatively compact (list overhead, no extra string rep if done carefully)
  • the main trade-off here is that positional access may not always be convenient

2) A list of dicts

  • each row is a dict with field names as keys
  • this can be made to share key strings
  • overhead is the dict hash for each row

3) An array of lists

  • row key is array key (with comma's or as list)
  • as in first approach, this uses positional column access

4) An array of dicts

  • map each row to an array entry
  • use the row key (or keys, with commas or as list) as key in the array
  • put the rest of the fields as a dict, with field names
  • as with second approach, the overhead is one dict hash per row

Can we do better, yet stay in the Tcl realm?

I think we could. The refinement we would need is a change in the way dicts are hashed. Instead of hashing across a list of interleaved keys and values, we need a hash which takes a key, and returns its position in the list. Then, instead of having each row carry its internal hash table, we could have one hash for all rows. After all, there is not necessarily a need to use different row orders across a table.

The required change to dicts would be that they do not lose their hash table when an lset is done on a non-key (i.e. odd) item in the list (or is that already the case? I have not checked it). Or if lset is not used: that when an unshared copy is made of a dict, they continue to share the hash table as long as no changes are made to the set of keys (again, have not examined dict's logic).

The other way to work around this issue and reduce memory use, is to store tabular data as an array of lists, and to manually manage the field-name-to-column-position mapping. Then access to row x field y would be:

    puts [lindex $data(x) [dict get $fieldmap y]]

For comparison, the normal array-of-dict approach is:

    puts [dict get $data(x) y]

The point of this all is to minimize memory usage with large numbers of rows or fields. Rows cannot be "less" than a Tcl list, but this way we could at least avoid the overhead of (shared) field names and the dict hash on each row.

Using arrays as the main store (and dicts or lists inside) also plays nice with Tequila, I expect.


NEM Interesting stuff. I was thinking about a similar thing just yesterday (looking at R and some of your Vlerq stuff). If I understand you correctly, the key idea here is to be able to set an element in a multiply-nested data structure without requiring that each level of data structure be of the same type? So, for instance you can have a dict of lists of dicts of... and some operation can query/set to arbitrary depth without causing shimmering? That would be useful, but generally goes against Tcl's (non-)typing. You could maybe do something with TOOT. Need to think more about this.

As an aside, is the following situation covered in your cases above:

  • array of lists: key of array is the column name, lists are the columns. Either you require lists to correspond (need some explicit representation of NULL in this case), or you form rows by (expensive) joins (each list item is {key value}). The latter option essentially reduces the table to a series of binary relations attr(key,value). I think this is some normal form (6NF?).

Or perhaps I need some coffee to think about this more clearly.


DKF: Learning from the experience of metakit, I'd suggest:

5) A dict of lists

  • Represent each column as a a list
  • Dictionary allows using named columns
  • Can do efficient update of a value like this:
     dict update data $columnName col {
        dict set data $columnName {}   ;# Column refcount should now be 1 (i.e. efficient!)
        lset col $row $newValue
     }
     set col {}                        ;# Column refcount should now be 1 again

NEM: How do you handle missing data though?