if {0} {
Arjen Markus (26 may 2004) From time to time I need to deal with large timeseries, for instance hourly data of the water level at a number of stations in a river system extending over four years - roughly 30000 values. To handle the files and the data they contain I write little ad hoc scripts:
And then I start wondering ... wouldn't it be nice to have a package that will deal with many of these manipulations.
Such a package should be able to deal with timeseries and also with spatial data, it should be able to deal with simple and more complex computations - if possible in a way that the user does not have to program the inevitable loops, but rather can deal with the collection of data as a whole.
For instance:
set plot [create-a-plot] $plot add data_a $plot add data_b
set averages [average [group $temp 24]]
set totn [sum $kjdn $nitrate]
set total_nutrients [construct $table totn {$kdn+$nitrate} totp {$totp}]
describe $table
set new_table [slice $table 0 1000 10]
Ambitious? Perhaps.
The main problem is not the structure in which to keep the data, the main problem is the collection of commands! I can think of at least four different underlying structures:
If we define a comprehensive set of primitive operations, it should be irrelevant to the user of such a package what structure lies underneath.
But it is important to realise that the data sets that are created are volatile: they should not hang around in memory - they should behave as ordinary variables.
So, let us try a few such operations:
Well, these are just a few operations of course. Now let us try them
}
# Create a proper namespace for the data manipulations and # declare the public data # namespace eval ::dml { namespace export dataset sum filter slice contents setnames namespace export getnames getrow addrow print # Private namespace - for filter and others namespace eval v {} } # dataset -- # Create a new empty data set # Arguments: # names List of names # Result: # A new dataset # proc ::dml::dataset {names} { return [list $names {}] } # contents -- # Return a readable description of the dataset # Arguments: # dataset The dataset in question # Result: # String describing the dataset # proc ::dml::contents {dataset} { set string "Columns in the dataset: [join [getnames $dataset] ", "]\n\ Number of rows: [llength [lindex $dataset 1]]" return $string } # getnames -- # Get the column names of an existing dataset # Arguments: # dataset The dataset to be examined # Result: # List of column names # proc ::dml::getnames {dataset} { return [lindex $dataset 0] } # setnames -- # Set the column names of an existing dataset # Arguments: # dataset The dataset to be examined # newnames The new names for the columns - number must match # Result: # None # proc ::dml::setnames {dataset newnames} { upvar $dataset theset set names [lindex $theset 0] if { [[length $names] != [[length $newnames] } { return -code error "Number of names does not match the number of columns" } lset theset 0 $newnames } # addrow -- # Add a row of data to an existing dataset # Arguments: # dataset Name of the dataset # values The row to be added # Result: # None # Note: # The number of values must match the number of columns # proc ::dml::addrow {dataset values} { upvar $dataset theset set names [lindex $theset 0] if { [llength $names] != [llength $values] } { return -code error "Number of values does not match the number of columns" } set data [lindex $theset 1] lappend data $values lset theset 1 $data } # getrow -- # Get a row of data to an existing dataset # Arguments: # dataset The dataset to be examined # row The index of the row to be returned # Result: # None # Note: # The number of values must match the number of columns # proc ::dml::getrow {dataset row} { return [lindex [lindex $dataset 1] $row] } # slice -- # Select the rows of a new dataset by stepping through an existing # Arguments: # dataset Name of the dataset # start The first row to be added # stop The last row to be added # step The step size for stepping through the rows (optional) # Result: # New dataset # proc ::dml::slice {dataset start stop {step 1}} { set names [lindex $dataset 0] set data [lindex $dataset 1] if { $step <= 0 } { return -code error "Step size must be positive" } set newset [dataset $names] set row $start while { $row <= $stop } { addrow newset [getrow $dataset $row] incr row $step } return $newset } # sum -- # Sum two datasets column by column # Arguments: # data1 The first dataset # data2 The second dataset or a plain list # Result: # New dataset # proc ::dml::sum {data1 data2} { # # Determine the number of columns first # Tricky, though # set names1 [lindex $data1 0] set values1 [lindex $data1 1] set norows1 [llength $values1] if { [llength $data2] != 2 } { set norows2 1 set names2 $data2 set values2 $data2 } else { # # Note: this logic is _not_ complete! # It fails for a proper dataset with 1 column and 1 row # if { [llength [lindex $data2 1]] == 1 } { set norows2 1 set names2 $data2 set values2 $data2 } else { set names2 [lindex $data2 0] set values2 [lindex $data2 1] set norows2 [llength $values2] } } if { $norows1 != $norows2 && $norows2 != 1 } { return -code error "Numbers of rows do not match" } if { [llength $names1] != [llength $names2] } { return -code error "Numbers of columns do not match" } set newset [dataset $names1] if { $norows2 > 1 } { foreach row1 $values1 row2 $values2 { set newrow {} foreach c1 $row1 c2 $row2 { lappend newrow [expr {$c1+$c2}] } addrow newset $newrow } } else { foreach row1 $values1 { set newrow {} foreach c1 $row1 c2 $values2 { lappend newrow [expr {$c1+$c2}] } addrow newset $newrow } } return $newset } # filter -- # Filter the rows of a dataset # Arguments: # dataset The dataset to be filtered # condition The condition for keeping a row # Result: # New dataset # Note: # To avoid possible conflicts between local # variables and the column names, we use a # private namespace for the local variables # proc ::dml::filter {dataset condition} { set v::names [lindex $dataset 0] set v::newset [dataset $v::names] set v::values [lindex $dataset 1] set v::cond $condition foreach v::row $v::values { foreach $v::names $v::row {break} if $v::cond { addrow v::newset $v::row } } return $v::newset } # print -- # Print the contents of a dataset to stdout # Arguments: # dataset The dataset to be printed # Result: # None # Side effect: # Contents shown on screen # proc ::dml::print {dataset} { set row 0 puts "Columns: [join [getnames $dataset] \t]" while {1} { set values [getrow $dataset $row] if { [llength $values] == 0 } { break } puts "$row: [join $values \t]" incr row } } if {0} { Let us test this code: } namespace import ::dml::* set table [dataset {A B}] # # Hm, candidate for a new command? # for {set i 0} {$i < 10} {incr i} { addrow table [list [expr {$i+1}] [expr {2*$i}]] } puts "Contents: [contents $table]" puts "Columns: [getnames $table]" puts "Records with A > 2" set newtable [filter $table {$A>2}] print $newtable puts "Records with A > 2 and B < 10" set newtable [filter $table {$A>2 && $B<10}] print $newtable puts "Summed table:" print [sum $table $table] puts "Slice of a table:" print [slice $table 2 4]
if {0} {
A few remarks about the above code are appropriate:
}