Thoughts about a data manipulation language

if {0} {

Arjen Markus (26 may 2004) From time to time I need to deal with large timeseries, for instance hourly data of the water level at a number of stations in a river system extending over four years - roughly 30000 values. To handle the files and the data they contain I write little ad hoc scripts:

Read the data into one or more lists
Extract the data for a much shorter period of time
Determine the minimum and maximum values
Make a quick and dirty plot
Write some data in a suitable format to another file
...

And then I start wondering ... wouldn't it be nice to have a package that will deal with many of these manipulations.

Such a package should be able to deal with timeseries and also with spatial data, it should be able to deal with simple and more complex computations - if possible in a way that the user does not have to program the inevitable loops, but rather can deal with the collection of data as a whole.

For instance:

I want to plot the data for station A and station B in the same plot. This could be done by code like:

    set plot [create-a-plot]
    $plot add data_a
    $plot add data_b

I want to determine daily averages of the temperature:

    set averages [average [group $temp 24]]

I want to compute the total amount of nitrogen (present in the chemical forms "Kjeldahl-N" and "nitrate" in the water body):

    set totn [sum $kjdn $nitrate]

A more ambitious form: all the data are stored in the same table containing ammonia, kjeldahl-N, nitrate, ortho-phosphate and total-P, some typical water quality parameters:

    set total_nutrients [construct $table totn {$kdn+$nitrate} totp {$totp}]

Give a statistical summary of the data in the table:

    describe $table

Select a certain range of data (retain every 10th record):

    set new_table [slice $table 0 1000 10]

Ambitious? Perhaps.

The main problem is not the structure in which to keep the data, the main problem is the collection of commands! I can think of at least four different underlying structures:

As a list or a list of lists - simple and straightforward
As a matrix (from the Tcllib matrix module) - provides a flexible API for manipulating rows and columns
As binary arrays - compact, easy to use by a compiled extension (vkit actually uses this approach)
As tables in a database - Metakit could serve here very very well

If we define a comprehensive set of primitive operations, it should be irrelevant to the user of such a package what structure lies underneath.

But it is important to realise that the data sets that are created are volatile: they should not hang around in memory - they should behave as ordinary variables.

So, let us try a few such operations:

[dataset {list of names}] creates an empty table with columns that can be addressed by name
[sum $data1 $data2] returns a new dataset whose columns consist of the sums of the two datasets (the second may also be a plain list). The number of columns and rows must match
[filter $data1 {condition}] returns a new dataset of which the rows have been filtered through the condition
[slice $data1 $start $stop ?step?] returns a new dataset which has only the rows that are requested
[contents $data] returns summary information about the dataset (printable)
[setnames data {list of names}] set the names of the columns of a dataset
[getnames $data] returns the names of the columns of a dataset
[getrow $data $row] returns a list with the values at the given row
[addrow data $values] adds a new row of data at the end
[print data] prints the contents of the dataset to screen

Well, these are just a few operations of course. Now let us try them

}

 # Create a proper namespace for the data manipulations and
 # declare the public data
 #
 namespace eval ::dml {
    namespace export dataset sum filter slice contents setnames
    namespace export getnames getrow addrow print

    # Private namespace - for filter and others
    namespace eval v {}
 }

 # dataset --
 #    Create a new empty data set
 # Arguments:
 #    names        List of names
 # Result:
 #    A new dataset
 #
 proc ::dml::dataset {names} {
    return [list $names {}]
 }

 # contents --
 #    Return a readable description of the dataset
 # Arguments:
 #    dataset      The dataset in question
 # Result:
 #    String describing the dataset
 #
 proc ::dml::contents {dataset} {
    set string "Columns in the dataset: [join [getnames $dataset] ", "]\n\
 Number of rows: [llength [lindex $dataset 1]]"
    return $string
 }

 # getnames --
 #    Get the column names of an existing dataset
 # Arguments:
 #    dataset      The dataset to be examined
 # Result:
 #    List of column names
 #
 proc ::dml::getnames {dataset} {
    return [lindex $dataset 0]
 }

 # setnames --
 #    Set the column names of an existing dataset
 # Arguments:
 #    dataset      The dataset to be examined
 #    newnames     The new names for the columns - number must match
 # Result:
 #    None
 #
 proc ::dml::setnames {dataset newnames} {
    upvar $dataset theset
    set names [lindex $theset 0]
    if { [[length $names] != [[length $newnames] } {
       return -code error "Number of names does not match the number of columns"
    }
    lset theset 0 $newnames
 }

 # addrow --
 #    Add a row of data to an existing dataset
 # Arguments:
 #    dataset      Name of the dataset
 #    values       The row to be added
 # Result:
 #    None
 # Note:
 #    The number of values must match the number of columns
 #
 proc ::dml::addrow {dataset values} {
    upvar $dataset theset
    set names [lindex $theset 0]
    if { [llength $names] != [llength $values] } {
       return -code error "Number of values does not match the number of columns"
    }
    set data [lindex $theset 1]
    lappend data $values
    lset theset 1 $data
 }

 # getrow --
 #    Get a row of data to an existing dataset
 # Arguments:
 #    dataset      The dataset to be examined
 #    row          The index of the row to be returned
 # Result:
 #    None
 # Note:
 #    The number of values must match the number of columns
 #
 proc ::dml::getrow {dataset row} {
    return [lindex [lindex $dataset 1] $row]
 }

 # slice --
 #    Select the rows of a new dataset by stepping through an existing
 # Arguments:
 #    dataset      Name of the dataset
 #    start        The first row to be added
 #    stop         The last row to be added
 #    step         The step size for stepping through the rows (optional)
 # Result:
 #    New dataset
 #
 proc ::dml::slice {dataset start stop {step 1}} {
    set names [lindex $dataset 0]
    set data  [lindex $dataset 1]

    if { $step <= 0 } {
       return -code error "Step size must be positive"
    }

    set newset [dataset $names]
    set row    $start
    while { $row <= $stop } {
       addrow newset [getrow $dataset $row]
       incr row $step
    }

    return $newset
 }

 # sum --
 #    Sum two datasets column by column
 # Arguments:
 #    data1        The first dataset
 #    data2        The second dataset or a plain list
 # Result:
 #    New dataset
 #
 proc ::dml::sum {data1 data2} {
    #
    # Determine the number of columns first
    # Tricky, though
    #
    set names1  [lindex $data1 0]
    set values1 [lindex $data1 1]
    set norows1 [llength $values1]

    if { [llength $data2] != 2 } {
       set norows2 1
       set names2  $data2
       set values2 $data2
    } else {
       #
       # Note: this logic is _not_ complete!
       # It fails for a proper dataset with 1 column and 1 row
       #
       if { [llength [lindex $data2 1]] == 1 } {
          set norows2 1
          set names2  $data2
          set values2 $data2
       } else {
          set names2  [lindex $data2 0]
          set values2 [lindex $data2 1]
          set norows2 [llength $values2]
       }
    }

    if { $norows1 != $norows2 && $norows2 != 1 } {
       return -code error "Numbers of rows do not match"
    }
    if { [llength $names1] != [llength $names2] } {
       return -code error "Numbers of columns do not match"
    }

    set newset [dataset $names1]
    if { $norows2 > 1 } {
       foreach row1 $values1 row2 $values2 {
          set newrow {}
          foreach c1 $row1 c2 $row2 {
             lappend newrow [expr {$c1+$c2}]
          }
          addrow newset $newrow
       }
    } else {
       foreach row1 $values1 {
          set newrow {}
          foreach c1 $row1 c2 $values2 {
             lappend newrow [expr {$c1+$c2}]
          }
          addrow newset $newrow
       }
    }

    return $newset
 }

 # filter --
 #    Filter the rows of a dataset
 # Arguments:
 #    dataset      The dataset to be filtered
 #    condition    The condition for keeping a row
 # Result:
 #    New dataset
 # Note:
 #    To avoid possible conflicts between local
 #    variables and the column names, we use a
 #    private namespace for the local variables
 #
 proc ::dml::filter {dataset condition} {

    set v::names   [lindex $dataset 0]
    set v::newset  [dataset $v::names]
    set v::values  [lindex $dataset 1]
    set v::cond    $condition

    foreach v::row $v::values {
       foreach $v::names $v::row {break}

       if $v::cond {
          addrow v::newset $v::row
       }
    }

    return $v::newset
 }

 # print --
 #    Print the contents of a dataset to stdout
 # Arguments:
 #    dataset      The dataset to be printed
 # Result:
 #    None
 # Side effect:
 #    Contents shown on screen
 #
 proc ::dml::print {dataset} {
    set row  0
    puts "Columns: [join [getnames $dataset] \t]"
    while {1} {
       set values [getrow $dataset $row]
       if { [llength $values] == 0 } {
          break
       }
       puts "$row: [join $values \t]"
       incr row
    }
 }

 if {0} {
 Let us test this code:
 }

 namespace import ::dml::*

 set table [dataset {A B}]

 #
 # Hm, candidate for a new command?
 #
 for {set i 0} {$i < 10} {incr i} {
    addrow table [list [expr {$i+1}] [expr {2*$i}]]
 }
 puts "Contents: [contents $table]"
 puts "Columns: [getnames $table]"

 puts "Records with A > 2"
 set newtable [filter $table {$A>2}]
 print $newtable

 puts "Records with A > 2 and B < 10"
 set newtable [filter $table {$A>2 && $B<10}]
 print $newtable

 puts "Summed table:"
 print [sum $table $table]

 puts "Slice of a table:"
 print [slice $table 2 4]

if {0} {

A few remarks about the above code are appropriate:

There is far less error checking than needed in a truly useful package, especially if it is to be used in a semi-interactive way.
The above code ignores the possibility of missing values, most naturally represented as an empty stirng or list ("" or {}).
The command [filter] does not deal with parameters (that is, a condition like {$A>$threshold}, where "threshold" is a local variable. One naive way of dealing with such conditions would be to have the user use a global or namespace variable instead, like: {$A>$::threshold}.
With the current set of commands you can not manipulate columns.

}

[ Category Essay | Category Numerical Analysis ]