Version 2 of Compact data storage

Updated 2002-11-11 23:54:32

Purpose: To collect and discuss programming techniques for avoiding unnecessary allocation of memory. Example: Decide whether a value is a string or a list, and treat it accordingly. If you treat it as both, it will become both a string and a list, and that means your data is stored twice.


Some hard results on how many bytes Tcl uses for various kinds of things can be found on Memory costs with Tcl.

Some scripts that can be used for measuring how much memory is allocated by a sequence of commands can be found on Measuring memory usage. For example the growth and Delta procedures used below can be found on that page.


100000 random digits

is the name of a classical table of random numbers -- depending on one's point of view, it is either one of the most boring books ever published, or one of the least boring books ever published (since you never can tell what is going to happen next) -- but it also makes a nice example. Suppose you want to make a list of 100000 (pseudo-)random digits. The obvious way of doing this is

 set L [list]
 for {set n 0} {$n<100000} {incr n} {
    lappend L [expr {int(floor(rand()*10))}]
 }

and the test

 Delta [growth {
    set L [list]
    for {set n 0} {$n<100000} {incr n} {
       lappend L [expr {int(floor(rand()*10))}]
    }
 }]

will tell you that this causes Tcl to allocate 3207168 (or thereabout) bytes. However, since equal values will appear over and over again in this list, it is more efficient to make sure that they really are the same (as objects) values. One way of doing this is

 set L [list]
 set digits [list]
 for {set n 0} {$n<10} {incr n} {lappend digits $n}
 for {set n 0} {$n<100000} {incr n} {
    lappend L [lindex $digits [expr {int(floor(rand()*10))}]]
 }

The test

 Delta [growth {
    set L [list]
    set digits [list]
    for {set n 0} {$n<10} {incr n} {lappend digits $n}
    for {set n 0} {$n<100000} {incr n} {
       lappend L [lindex $digits [expr {int(floor(rand()*10))}]]
    }
 }]

returns (for me) a mere 798720 bytes. The number of correct digits in these measurements is probably rather low, but the difference in magnitude is quite significant. -- Lars Hellstr�m


11nov02 jcw - You could use Metakit - it's a database, but it also works with in memory data. Here's an example to show how it adaptively stores ints in 1..32 bits per value:

 package require Mk4tcl

 foreach x {2 4 10 100 200 30000 40000} {
   mk::file open db
   mk::view layout db.v i:I

   puts "100,000 ints in range \[0..$x) take:"

   for {set i 0} {$i < 100000} {incr i} {
     mk::row append db.v i [expr {int(rand()*$x)}]
   }

   set fd [open tempfile w]
   mk::file save db $fd
   close $fd

   puts "   [file size tempfile] bytes"

   file delete tempfile
   mk::file close db
 }

The output is:

 100,000 ints in range [0..2) take:
   12543 bytes
 100,000 ints in range [0..4) take:
   25045 bytes
 100,000 ints in range [0..10) take:
   50045 bytes
 100,000 ints in range [0..100) take:
   100045 bytes
 100,000 ints in range [0..200) take:
   200045 bytes
 100,000 ints in range [0..30000) take:
   200045 bytes
 100,000 ints in range [0..40000) take:
   400045 bytes

FWIW, writing the data to file is done just to find out how much space it consumes. In-memory consumption will add a few Kb (plus a per-row overhead of 0.1% ...).

Lars H: In this case, it is an absolutely essential piece of information. Being compact on file is, if not easy, so at least nothing that is harder in Tcl than it is in other languages. Being compact in memory is what is important here. The need for a standard extension that provides near-optimal (1 bit of useful information costs 1+epsilon bits of memory) storage of information in memory is something that I have felt a couple of times during the last year. From the looks of the above figures, it looks like Metakit could fill this need. Excellent! (Not that it means we shouldn't care about being efficient when storing ordinary Tcl objects too, though.)