Version 0 of Decoupling hashing from keys

Updated 2005-01-19 14:06:13 by jcw

This is a comment about managing data in Tcl. No code, just some thoughts for now... jcw

Tcl has arrays, and as of 8.5 dicts. Together, these let you manage tabular data:

There are several ways to do this:


* a list of lists (row-wise is the most common) * this is relatively compact (list overhead, no extra string rep if done carefully) * the main trade-off here is that positional access may not always be convenient


* a list of dicts * each row is a dict with field names as keys * this can be made to share key strings * overhead is the dict hash for each row


* an array of lists * row key is array key (with comma's or as list) * as in first approach, this uses positional column access


* an array of dicts * map each row to an array entry * use the row key (or keys, with commas or as list) as key in the array * put the rest of the fields as a dict, with field names * as with second approach, the overhead is one dict hash per row

Can we do better, yet stay in the Tcl realm?

I think we could. The refinement we would need is a change in the way dicts are hashed. Instead of hashing across a list of interleaved keys and values, we need a hash which takes a key, and returns its position in the list. Then, instead of having each row carry its internal hash table, we could have one hash for all rows. After all, there is not necessarily a need to use different row orders across a table.

The required change to dicts would be that they do not lose their hash table when an lset is done on a non-key (i.e. odd) item in the list (or is that already the case? I have not checked it).

The other way to work around this issue and reduce memory use, is to store tabular data as an array of lists, and to manually manage the field-name-to-column-position mapping. Then access to row x field y would be:

    puts [lindex $data($x) [dict get $fieldmap $y]]

For comparison, the normal array-of-dict approach is:

    puts [dict get $data(x) $y]

The point of this all is to minimize memory usage with large numbers of rows or fields. Rows cannot be "less" than a Tcl list, but this way we could at least avoid the overhead of (shared) field names and the dict hash on each row.

Using arrays as the main store (and dicts or lists inside) also plays nice with Tequila, I expect.