if {0} { [Arjen Markus] (26 may 2004) From time to time I need to deal with large timeseries, for instance hourly data of the water level at a number of stations in a river system extending over four years - roughly 30000 values. To handle the files and the data they contain I write little ad hoc scripts: * Read the data into one or more lists * Extract the data for a much shorter period of time * Determine the minimum and maximum values * Make a quick and dirty plot * Write some data in a suitable format to another file * ... And then I start wondering ... wouldn't it be nice to have a package that will deal with many of these manipulations. Such a package should be able to deal with timeseries and also with spatial data, it should be able to deal with simple and more complex computations - if possible in a way that the user does not have to program the inevitable loops, but rather can deal with the collection of data as a whole. For instance: * I want to plot the data for station A and station B in the same plot. This could be done by code like: set plot [create-a-plot] $plot add data_a $plot add data_b * I want to determine daily averages of the temperature: set averages [average [group $temp 24]] * I want to compute the total amount of nitrogen (present in the chemical forms "Kjeldahl-N" and "nitrate" in the water body): set totn [sum $kjdn $nitrate] * A more ambitious form: all the data are stored in the same table containing ammonia, kjeldahl-N, nitrate, ortho-phosphate and total-P, some typical water quality parameters: set total_nutrients [construct $table totn {$kdn+$nitrate} totp {$totp}] * Give a statistical summary of the data in the table: describe $table * Select a certain range of data (retain every 10th record): set new_table [slice $table 0 1000 10] Ambitious? Perhaps. The main problem is ''not'' the structure in which to keep the data, the main problem is the collection of commands! I can think of at least four different underlying structures: * As a list or a list of lists - simple and straightforward * As a matrix (from the Tcllib matrix module) - provides a flexible API for manipulating rows and columns * As binary arrays - compact, easy to use by a compiled extension ([vkit] actually uses this approach) * As tables in a database - Metakit could serve here very very well If we define a comprehensive set of primitive operations, it should be irrelevant to the user of such a package what structure lies underneath. But it is important to realise that the data sets that are created are volatile: they should not hang around in memory - they should behave as ordinary variables. So, let us try a few such operations: * [[dataset {list of names}]] creates an empty table with columns that can be addressed by name * [[sum $data1 $data2]] returns a new dataset whose columns consist of the sums of the two datasets (the second may also be a plain list). The number of columns and rows must match * [[filter $data1 {condition}]] returns a new dataset of which the rows have been filtered through the condition * [[slice $data1 $start $stop ?step?]] returns a new dataset which has only the rows that are requested * [[contents $data]] returns summary information about the dataset (printable) * [[setnames data {list of names}]] set the names of the columns of a dataset * [[getnames $data]] returns the names of the columns of a dataset * [[getrow $data $row]] returns a list with the values at the given row * [[addrow data $values]] adds a new row of data at the end * [[print data]] prints the contents of the dataset to screen Well, these are just a few operations of course. Now let us try them ---- } # Create a proper namespace for the data manipulations and # declare the public data # namespace eval ::dml { namespace export dataset sum filter slice contents setnames namespace export getnames getrow addrow print # Private namespace - for filter and others namespace eval v {} } # dataset -- # Create a new empty data set # Arguments: # names List of names # Result: # A new dataset # proc ::dml::dataset {names} { return [list $names {}] } # contents -- # Return a readable description of the dataset # Arguments: # dataset The dataset in question # Result: # String describing the dataset # proc ::dml::contents {dataset} { set string "Columns in the dataset: [join [getnames $dataset] ", "]\n\ Number of rows: [llength [lindex $dataset 1]]" return $string } # getnames -- # Get the column names of an existing dataset # Arguments: # dataset The dataset to be examined # Result: # List of column names # proc ::dml::getnames {dataset} { return [lindex $dataset 0] } # setnames -- # Set the column names of an existing dataset # Arguments: # dataset The dataset to be examined # newnames The new names for the columns - number must match # Result: # None # proc ::dml::setnames {dataset newnames} { upvar $dataset theset set names [lindex $theset 0] if { [[length $names] != [[length $newnames] } { return -code error "Number of names does not match the number of columns" } lset theset 0 $newnames } # addrow -- # Add a row of data to an existing dataset # Arguments: # dataset Name of the dataset # values The row to be added # Result: # None # Note: # The number of values must match the number of columns # proc ::dml::addrow {dataset values} { upvar $dataset theset set names [lindex $theset 0] if { [llength $names] != [llength $values] } { return -code error "Number of values does not match the number of columns" } set data [lindex $theset 1] lappend data $values lset theset 1 $data } # getrow -- # Get a row of data to an existing dataset # Arguments: # dataset The dataset to be examined # row The index of the row to be returned # Result: # None # Note: # The number of values must match the number of columns # proc ::dml::getrow {dataset row} { return [lindex [lindex $dataset 1] $row] } # slice -- # Select the rows of a new dataset by stepping through an existing # Arguments: # dataset Name of the dataset # start The first row to be added # stop The last row to be added # step The step size for stepping through the rows (optional) # Result: # New dataset # proc ::dml::slice {dataset start stop {step 1}} { set names [lindex $dataset 0] set data [lindex $dataset 1] if { $step <= 0 } { return -code error "Step size must be positive" } set newset [dataset $names] set row $start while { $row <= $stop } { addrow newset [getrow $dataset $row] incr row $step } return $newset } # sum -- # Sum two datasets column by column # Arguments: # data1 The first dataset # data2 The second dataset or a plain list # Result: # New dataset # proc ::dml::sum {data1 data2} { # # Determine the number of columns first # Tricky, though # set names1 [lindex $data1 0] set values1 [lindex $data1 1] set norows1 [llength $values1] if { [llength $data2] != 2 } { set norows2 1 set names2 $data2 set values2 $data2 } else { # # Note: this logic is _not_ complete! # It fails for a proper dataset with 1 column and 1 row # if { [llength [lindex $data2 1]] == 1 } { set norows2 1 set names2 $data2 set values2 $data2 } else { set names2 [lindex $data2 0] set values2 [lindex $data2 1] set norows2 [llength $values2] } } if { $norows1 != $norows2 && $norows2 != 1 } { return -code error "Numbers of rows do not match" } if { [llength $names1] != [llength $names2] } { return -code error "Numbers of columns do not match" } set newset [dataset $names1] if { $norows2 > 1 } { foreach row1 $values1 row2 $values2 { set newrow {} foreach c1 $row1 c2 $row2 { lappend newrow [expr {$c1+$c2}] } addrow newset $newrow } } else { foreach row1 $values1 { set newrow {} foreach c1 $row1 c2 $values2 { lappend newrow [expr {$c1+$c2}] } addrow newset $newrow } } return $newset } # filter -- # Filter the rows of a dataset # Arguments: # dataset The dataset to be filtered # condition The condition for keeping a row # Result: # New dataset # Note: # To avoid possible conflicts between local # variables and the column names, we use a # private namespace for the local variables # proc ::dml::filter {dataset condition} { set v::names [lindex $dataset 0] set v::newset [dataset $v::names] set v::values [lindex $dataset 1] set v::cond $condition foreach v::row $v::values { foreach $v::names $v::row {break} if $v::cond { addrow v::newset $v::row } } return $v::newset } # print -- # Print the contents of a dataset to stdout # Arguments: # dataset The dataset to be printed # Result: # None # Side effect: # Contents shown on screen # proc ::dml::print {dataset} { set row 0 puts "Columns: [join [getnames $dataset] \t]" while {1} { set values [getrow $dataset $row] if { [llength $values] == 0 } { break } puts "$row: [join $values \t]" incr row } } if {0} { Let us test this code: } namespace import ::dml::* set table [dataset {A B}] # # Hm, candidate for a new command? # for {set i 0} {$i < 10} {incr i} { addrow table [list [expr {$i+1}] [expr {2*$i}]] } puts "Contents: [contents $table]" puts "Columns: [getnames $table]" puts "Records with A > 2" set newtable [filter $table {$A>2}] print $newtable puts "Records with A > 2 and B < 10" set newtable [filter $table {$A>2 && $B<10}] print $newtable puts "Summed table:" print [sum $table $table] puts "Slice of a table:" print [slice $table 2 4] ---- if {0} { A few remarks about the above code are appropriate: * There is far less error checking than needed in a truly useful package, especially if it is to be used in a semi-interactive way. * The above code ignores the possibility of missing values, most naturally represented as an empty stirng or list ("" or {}). * The command [[filter]] does not deal with parameters (that is, a condition like {$A>$threshold}, where "threshold" is a local variable. One naive way of dealing with such conditions would be to have the user use a global or namespace variable instead, like: {$A>$::threshold}. * With the current set of commands you can not manipulate columns. } ---- [[ [Category Essay] | [Category Numerical Analysis] ]]