Data analysis with Tcl

Data analysis with Tcl/Tk is a tutorial originally written a for website Torsten Berg was designing.

Description

This is an introduction to Tcl/Tk with an emphasis on data analysis. Familiarity with the basics of Tcl and programming in general is assumed: Variables, loops, procedures, conditional operations, and other programming constructs should be familiar already.

This tutorial is mostly concerned with with numerical data, and the examples focus on gathering statistical information and plotting graphs from these data. There is no attempt to show the construction of a fully-featured data analysis package. That is certainly doable in Tcl, and there are already many options.

Tcl is a good choice for performing data analysis because it makes it easy to whip up a few dedicated computational procedures suitable for a particular needs.

A first example: dealing with text

The first task is to find a way to classify the containts of a given file, e.g. as a short story, a legal document, or a scientific article. One approach could be to take an inventory of the words in a text and then, based on word frequency, to use some scheme to classify the text. This tutorial explains how to take the inventory, leaving the classifiction scheme as an exercise for the reader.

The task at hand is almost trivial:

Read the file.
Split its contents in words, i.e individual sequences of characters.
Count the occurrences for each word.
Print each in descending order, the most frequently-used word first.

First, a suitable data structure is needed for the data. The relevant data structures built into Tcl are:

A list is similar to an array in Fortran, Perl, C, or Java: It is an ordered sequence of values. Each value in a list can be retrieved by the index of that value. List is the most ubiquitous data structure in Tcl, and in fact, each list has a syntax that makes it suitable for interpretation by Tcl as a command.

Here are examples of a few list commands that will be used later:

# Construct a list and assign it to $list
set list {x y z}

# Get the second value from $list
set second [lindex $list 1]

# Append a new element to the end of the list
lappend list w

An associative array on the other hand is an unordered collection of variables, similar to a hash in Perl. A dictionary is even more similar since keys in a dictionary are just keys, not variables. A variable in an array can be retrieved using its name. For example:

set count(occurrence) 1      ;# Set element "occurrence" in the
                             ;# array "count" to 1

parray count                 ;# Quick and dirty print of all array
                             ;# elements and their values

array names count            ;# Get a list of all array elements

For the above counting problem associative arrays are suitable. The general idea is to do something like:

readfile contents input.txt
foreach word $contents {
    incr count($word)
}
parray count

This is of course only very rough code: It is necessary to first read the data, splitting the contents into rows, and splitting rows into words. Create procedure to read the entire file and store its contents in a variable called contents.

refine the above code:

proc readfile {contentsvar filename} {
    upvar $contentsvar contents
    set infile [open $filename]
    try {
        set contents [read $infile]
    } finally {
        close $infile
    }
    return
}

Before splitting the content into words, remove the punctuation. To keep things simple, assume that punctuation in the content is limited to period, apostrophe, quotation mark, question mark, exclamation point, comma, semicolon and colon. Here, string map, replaces each occurrence of the given punctuation characters with a space character, which is convenient later when the content is treated as a list:

set contents [string map {
    . { }
    ; { }
    : { }
    ' { }
    ! { }
    ? { }
    , { }
    \" { }
} $contents[set contents {}]]

This formatting of the command is only meant to make the structure of the list of replacements clearer. There is no special meaning to doing it this way.

The purpose of set contents {} is to make the value unshared to that Tcl doesn't have to make a copy of it before modfying it.

Since e.g. "The" and "the" represent the same word, all letters should be standardized to either lower case or upper case:

set contents [string tolower $contents]

And now for the grand finale: regsub reduces each run of whitespace characters into a single space character, split splits the content into words, and then foreach iterates throught the words to count them:

regsub -all {\s+} $contents[set contents {}] { } contents
foreach word [split $contents { }] {
    if {[info exists count($word)]} {
        incr count($word)
    } else {
        set count($word) 1
    }
}

info exists indicates whether the variable already exists in the array so that it can be properly initialised.

As of Tcl8.5, incr no longer returns an error when a variable does not yet exist. Instead, it initializes the variable to 0 and then increments it. The construct above is still valid of course, and could be useful if something other than initialisation to zero is desired.

The last part of the task is to print the list of words in the correct order. That is a bit tricky: It would be easy to order the words alphabetically, but to order the words by frequency of occurrence the data should be "inverted". lsort can do this if put the frequency data in a list, actually a list of lists:

set frequencies {}
foreach word [array names count] {
    lappend frequencies [list $word $count($word)]
}

#
# The -index option is used to make [lsort] look inside the list
# elements
# The -integer option is used because by default [lsort] regards
# list elements as strings
#
set sorted [lsort -index 1 -integer $frequencies]

# Print them to the screen

foreach frequency $sorted {
    puts $frequency
}

Since different words may occur with same frequency in a text, it might be nicer to sort words having the same frequency in alphabetical order. Since lsort performs a mergersort, this is easy:

set sorted [lsort -index 1 -integer [
    lsort -index 0 $frequencies]]

The first lsort sorts on the word, which is item 0 in each sublist, and the second sorts on the frequency, which i item 1 in each sublist. Since the intermediate result is unused, both operations are combined into one command. The mergesort algorithm makes sure that any previous order among elements that have the same "sort value" is preserved.

The complete code:

#
# Read the entire file
#
set infile [open input.txt]
set contents [read $infile]
close $infile

# Polish the contents (could be combined)
set contents [string map {
    . { }
    ; { }
    : { }
    ' { }
    ! { }
    ? { }
    , { }
    \" { }
} $contents]
set contents [string tolower $contents]

#
# Count the words
#
foreach word $contents {
    if {[info exists count($word)]} {
        incr count($word)
    } else {
        set count($word) 1
    }
}

#
# Convert the data into a form easily sorted
#
set frequencies {}
foreach word [array names count] {
    lappend frequencies [list $word $count($word)]
}

set sorted [lsort -index 1 -integer -decreasing [
    lsort -index 0 $frequencies]]

#
# Print them to the screen
#
foreach frequency $sorted {
    puts $frequency
}

The result, when used on the first three paragraphs of this text is:

for 5
is 5
a 4
be 4
data 4
that 4
with 4
are 3
it 3
not 3
of 3
tcl 3
the 3
to 3
we 3
will 3
an 2
analysis 2
and 2
because 2
familiar 2
in 2
so 2
this 2
you 2
- 1
actually 1
all 1
assumed 1
at 1
attempt 1
basics 1
build 1
but 1
choice 1
computational 1
considerable 1
could 1
dealing 1
dedicated 1
do 1
doing 1
easy 1
emphasis 1
examples 1
few 1
fully-featured 1
functions 1
gathering 1
general 1
good 1
graphs 1
here 1
if-statements 1
indeed 1
information 1
introduction 1
kind 1
least 1
less 1
loops 1
many 1
more 1
mostly 1
needs 1
numerical 1
on 1
or 1
package 1
particular 1
plotting 1
possibilities 1
procedures 1
programming 1
should 1
shown 1
statistical 1
suitable 1
task 1
tcl/tk 1
there 1
these 1
they 1
up 1
variables 1
very 1
whip 1
why 1
work 1
would 1
your 1

We forgot one punctuation character: the dash, so this shows up as a word as well, and of course we need to take care of hyphens and numerous other details. But that is not the point: we managed to do this analysis in a mere 30 lines of code.

Note: This algorithm is very easy to parallelize. In a future version we might show how to do that using the Threads package.

The second example: a table of data

As promised at the beginning, our main focus is numeric data. So let us look at some simple table of fictitious data:

    X        Y       Z
  -3.52    -3.17   -3.68
  -2.09    -1.76   -3.37
  -2.12    -0.76    1.69
  -0.20    -1.99    0.54
  -0.65     0.92    6.67
   1.65     1.02    1.91
   1.09     2.75    6.42
   3.61     4.79    6.16
   4.73     4.84    6.43
   4.75     5.30    8.98

This data was produced by the following script:

for {set i 0} {$i < 10} {incr i} {
    set x [expr {$i-5.0+2.0*rand()}]
    set y [expr {$i-5.0+3.0*rand()}]
    set z [expr {$i-5.0+10.0*rand()}]
    puts [format "%5.2f %5.2f %5.2f" $x $y $z]
}

Operations one might perform on this data include:

Creating a graph of X versus Y or Z to see if there is any obvious relation.
Selecting those records that fulfill some relation (only positive data for instance).
Performing statistical analysis on the data: linear regression for instance.
Sorting the records wrt X, assuming that is the independent variable.

Place the data into some data structure in order to manipulate this data programmatically. Although an associative array could work,

#
# A variable in an array can be named anything, even a number a real number
#
set array(-3.52) {-3.17 -3.68}
...

it does not appear to be convenient: Associative arrays are unordered. For instance, it would be necessary for example to examine all element names to select only positive X-values. That is:

Get a list of element names
Select the positive values only
Create a new array

In this case using a list of lists would save code: no need to get a list of X-values first, we already have that.

Furthermore, we include the names of the columns, so that the table is self-describing:

set table {
    {  X        Y       Z   }
    {-3.52    -3.17   -3.68 }
    {-2.09    -1.76   -3.37 }
    {-2.12    -0.76    1.69 }
    {-0.20    -1.99    0.54 }
    {-0.65     0.92    6.67 }
    { 1.65     1.02    1.91 }
    { 1.09     2.75    6.42 }
    { 3.61     4.79    6.16 }
    { 4.73     4.84    6.43 }
    { 4.75     5.30    8.98 }
}

puts "Column names: [lindex $table 0]"
puts "Column 1:"
foreach record $table {
    puts [lindex $record 0]
}

Now that the data is structured, define a format for storing large sets of data:

The file can contain comment lines, starting with a # in the first position
The first line that is not a comment line contains the names of the columns
All other lines contain the values per record (separated by one or more spaces)

Here is a procedure to create and fill a table:

proc loadTable filename {
    set infile [open $filename]
    set table {}

    #
    # Read the file line by line until the end
    # ([gets] will return a negative value then)
    #
    while {[gets $infile line] >= 0} {
        if {[string index $line 0] eq {#}} {
            continue
        } else {
            lappend table $line
        }
    }
    close $infile
    return $table
}

In the above procedure gets is used to read one line at a time from the file. In the form,

gets $infile line

it returns the number of characters read, or a negative number if it is at the end of the file. By checking for a positive value or zero (condition: >= 0), we read empty lines too. If an empty line signifies another section (for instance the beginning of the next table), then a condition like "> 0" could be used.

This procedure does not take care of:

non-existent files (Tcl returns an error when trying to open the file)
lines that do not contain a proper list of numbers

That may harm us later on, so let us revise this code a bit:

proc loadTable filename {
    #
    # Open the file, catch any errors
    #
    if {[catch {
        set infile [open $filename]
    } msg]} {
        return -code error "Error opening $filename: $msg"
    }

    set table {}
    set names 0

    while {[gets $infile line] >= 0} {
        if {[string index $line 0] eq {#}} {
            continue
        } else {
            #
            # We need to examine the lines:
            # The first one contains strings, all others numbers
            #
            if {[catch {
                     set number_data [llength $line]
            } msg]} {
                return -code error "Error reading $filename:\
                    line can not be properly interpreted\nLine: $line"
            }

            if {$names == 0} {
                set names 1
                set number_columns [llength $line]
            } else {
                if {$number_columns != [llength $line]} {
                    return -code error "Error reading $filename:\
                        incorrect number of columns\nLine: $line"
                }
                foreach v $line {
                    if {![string is double $v]} {
                        return -code error "Error reading $filename:\
                            invalid number encountered - $v\nLine: $line"
                    }
                }
            }
            lappend table $line
        }
    }
    close $infile
    return $table
}

Checking the data increases the length of the code, but we can be reasonably sure now that an ill-formed file will not surprise us later - for instance, when trying to determine the maximum value of a column the result is "A" instead of a number.

Auxiliary procedures

This procedure can easily be used in an interactive session with tclsh or wish:

% set table [loadTable "input.data"]
% puts "Columns: [lindex $table 0]"
% puts "Number of rows: [expr {[llength $table]-1}]"

The drawback of the last two statements is that the implemented structure of the table must be used directly. Instead, some auxiliary procedures can provide an abstraction:

proc columnNames table {
    return [lindex $table 0]
}


proc numberRows {table} {
    expr {[llength $table]-1}
}


proc row {table index} {
    if {$index < 0 || $index >= [llength $table]} {
        return -code error "Row index out of range"
    } else {
        return [lindex $table [expr {$index+1}]]
    }
}


proc column {table index {withname 1}} {
    if {$index < 0 || $index >= [llength [columnNames $table]]} {
        return -code error "Column index out of range"
    } else {
        set result {}
        foreach row $table {
            lappend result [lindex $row $index]
        }
    }

    # Keep the name?
    if {$withname} {
        return $result
    } else {
        return [lrange $result 1 end]
    }
}

For convenience we added an option to the "column" procedure: we may or may not want the name of the column included in the result. We can use it as follows:

If we just want the data:

% columnNames $table 0 0

If we want the name included (it could actually be used as a table

with just one column):

% columnNames $table 0

To make $withname optional, a default value of 1 is provided in the procedure definition.

Displaying the table

So now that we can load a table and examine its contents, it may be a good idea to view it in a window, just to get some feeling for the data. We will use the Tablelist package for this:

#
# Load the package and for convenience, import all procedures
#
package require tablelist
namespace import ::tablelist::*

#
# Variable to create a unique window name
#
set wcount 0

proc display table {
    global wcount

    #
    # Out of precaution, check if "wcount" has been set
    #
    if {![info exists wcount]} {
        set wcount 0
    }

    #
    # Create a list of columns for the tablelist widget
    # (the numbers will be right aligned)
    #
    set columns {}
    foreach name [columnNames $table] {
        lappend columns 0 $name right
    }

    #
    # A new toplevel window
    #
    incr wcount
    set t .t_$wcount
    toplevel $t

    #
    # The table list in that window and two scrollbars
    #
    set tbl $t.tbl
    set vsb $t.vsb
    set hsb $t.hsb

    tablelist $tbl -columns $columns -height 10 -width 100 \
        -xscrollcommand [list $hsb set] \
        -yscrollcommand [list $vsb set] \
        -stripebackground lightblue
    scrollbar $vsb -orient vertical   -command [list $tbl yview]
    scrollbar $hsb -orient horizontal -command [list $tbl xview]

    #
    # Put them in the toplevel window
    #
    grid $tbl $vsb -sticky news
    grid $hsb      -sticky news
    grid rowconfigure    $t 0 -weight 1
    grid rowconfigure    $t 1 -weight 0
    grid columnconfigure $t 0 -weight 1
    grid columnconfigure $t 1 -weight 0

    #
    # Now put the data in
    #
    for {set i 0} {$i < [numberRows $table]} {incr i} {
        $tbl insert end [row $table $i]
    }
}

Nothing very difficult, except perhaps the creation of the three widgets and their cooperation:

the -xscrollcommand option determines what command to run when the

visible part of the Tablelist widget is changed programmatically. In this case the horizontal scrollbar is updated. Similarly for the vertical scrollbar.

of course, something similar must happen when the user operates the

scrollbars. So each scrollbar has a command option too, to help with this.

the grid geometry manager is used to place the three widgets side

by side and the -weight options make sure that the scrollbars can not get thicker (weight 0), they can only grow in the direction of their longer axis. The tablelist widget can grow in both ways.

Statistics

So far, so good: we can load a table, extract information and display the whole thing. What about statistical properties of the data? Well, we have the math::statistics package to deal with the raw data.

Let us write a small convenience procedure to create a table with some basic statistical parameters. (Note: the first column will be one with labels, rather than numbers, but that does not matter!):

package require math::statistics
namespace import ::math::statistics::*

proc descriptiveStats table {
    #
    # Determine the mean, maximum, minimum, standard deviation
    # etc per column
    #
    set index 0
    set means {Mean}
    set min   {Minimum}
    set max   {Maximum}
    set ndata {"Number of data"}
    set stdev {"Standard deviation"}
    set var   {Variance}

    foreach name [columnNames $table] {
        set data     [column $table $index 0] ;# Now the option is useful!
        set statdata [basic-stats $data]

        lappend means [lindex $statdata 0]
        lappend min   [lindex $statdata 1]
        lappend max   [lindex $statdata 2]
        lappend ndata [lindex $statdata 3]
        lappend stdev [lindex $statdata 4]
        lappend var   [lindex $statdata 5]
        incr index
    }

    #
    # Put them in a table ...
    #
    set result [list [concat "Parameter" [columnNames $table]]]
    lappend result $means $min $max $ndata $var

    return $result
}

And because Tablelist does not really care about the type of data we want to display:

% display [descriptiveStats $table]

will display the statistical information just as it the complete table.

Manipulating the data

Sometimes, we will want to sort the data by column and proceed with a table sorted by one or more columns. For one column this is pretty easy, as we saw in the previous example:

proc sort {table column} {
    set names [columnNames $table]
    set index [lsearch $names $column]
    if {$index < 0} {
        return -code error "No column by that name: $column"
    }
    set newtable [list $names [
        lsort -index $index -double [
        lrange $table 1 end]]]
    return $newtable
}

The above is suitable for a single column. It is only slightly more complicated to write a procedure that will sort the table to two or more columns - we will leave this as an exercise.

A sorted table is for instance useful when you want to plot the data in a graph: lines through data points are less messy when the data are sorted along, say, the x axis. And that is our next task.

Plotting the data

More than anything else, you will gain insight in the data if you can make a suitable picture. Let us try and create a XY-plot then. Again, there is a package for that sort of tasks - in fact, there are several (see the references at the end) - Plotchart, part of Tklib.

The procedure we have in mind will plot some column against another column:

plot $table X Y

would plot the column with label Y against the column with label X in a suitably scaled graph. Here is the full code:

package require Plotchart
package require math::statistics
namespace import ::Plotchart::* ;# For convenience only

proc plot {table xvar yvar} {
    global wcount

    if {![info exists wcount]} {
        set wcount 0
    }

    # First find the column indices ...

    set xid [lsearch -exact [columnNames $table] $xvar]
    set yid [lsearch -exact [columnNames $table] $yvar]

    if { $xid < 0 || $yid < 0 } {
        return -code error \
            "Either $xvar or $yvar is not a valid column for this table"
    }

    # Determine the extremes

    set xdata [column $table $xid 0]
    set ydata [column $table $yid 0]
    set xmax  [::math::statistics::max $xdata]
    set xmin  [::math::statistics::min $xdata]
    set ymax  [::math::statistics::max $ydata]
    set ymin  [::math::statistics::min $ydata]

    # Set up the plot - similar to the display proc

    incr wcount
    set t .t_$wcount
    toplevel $t
    pack [canvas $t.c -width 500 -height 300] -fill both

    set plot [createXYPlot $t.c [
        determineScale $xmin $xmax] [
        determineScale $ymin $ymax]
    ]

    # Plot the data

    foreach x $xdata y $ydata {
        $plot plot xy $x $y
    }
}

In this procedure we used lsearch -exact] because lsearch by default uses so-called glob patterns for identifying the element. It is just a precaution, column names containing ? or * are probably rare, but it may lead to surprises if this behaviour is not taken into consideration.

The command to add a data point to the plot uses a label ("xy" in the above code) to identify to which series the data point belongs. That way we can combine multiple series in one plot or change the colour or line style.

Our present procedure will create a new window with each call, but that can easily be changed, for instance by using an optional argument that indicates a new window should be created or the last one reused (but then the label should be changed too!).

Advanced facilities

As this tutorial is meant to be a tutorial on data analysis with Tcl, and not on data analysis perse, we will not develop a complete data analysis package. Filtering the data should, however, be part of this tutorial and will be the last topic.

Since Tcl is a dynamic language, it is easy to filter procedure will allow us to do:

set newtable [filter $table {$X > 0 && 2.0*$Y < 1.5*$Z}]

This particular condition may not be very useful, but it is almost trivial to implement in Tcl:

proc filter {tablename fields condition} {
    upvar $tablename table
    foreach row [lrange $table[
        set table [list [columnNames $table]]
        list
    ] 1 end] {
        foreach field $fields val $row {
            upvar 1 $field var
            set var $val
        }
        if {[uplevel 1 [list ::expr $condition]]} {
            lappend table $row
        }
    }
    return
}

OK, so not entirely trivial, but not too bad once one learns a few nuances of Tcl.

The first argument to filter is the name of the variable the table is stored in. filter reads the table from this variable, and also stores the resulting filtered table in this variable. The second argument is a list of names to give the fields in each row. Each of these names becomes a variable, so they should be chosen so as not to conflict with any existing variables. The final argument is a condition just like the first argument to if. upvar connects the table variable at the caller's level to a variable at the local level. foreach iterates through the rows of the table and evaluates the given condition at the level of the caller. A substitution trick is used to reset $table right after it its contents are obtained so that it is ready to hold the rows that pass the condition:

$table[set table [
    list [columnNames $table]]
    list
]

The substitutions are processed from left to right: $table is first replaced with the contents of that variable. Then $table is reset to just a list containing the names of the fields. The final list command with no arguments is just used to ensure that whole the substitution results in the empty string, adding nothing to the table contents. This illustrates an underappreciated feature of Tcl: Command substitution doesn't have to be a single command -- it can be an entire script, and the result of the substitution is the result of the last command in the script.

Usage example:

set a 2.0
set newtable [filter $table {X Y Z} {$X > $a}]

Further possibilities

A procedure similar to filter could be used to compute the result of one or more expressions using the column variables. For instance, something like the following could lead lead to a new table with columns X, Sum and Cosfunc whose values are computed using the columns from the original table.

set newtable table
compute newtable {X Y Z} {X Sum Cosfunc} {
    set row [list $X [expr {$X+$Y}] [expr {$X/$Y+cos($Z)}]]
}

We could also create commands to combine tables into new tables or remove rows directly. We will leave such commands as an exercise.

References

Manipulating data sets like the one above is a very common task, and many specialised extensions exist for such things. For instance:

Popular database extensions are:

SQLite: A light-weight database management system, supporting SQL, transactions and other advanced features.

Metakit: A database management system geared to managing timeseries.

Ratcl: A continuation of the work done on Metakit with an emphasis on high performance manipulation of large amounts of data.

Other extensions include:

NAP: For N-dimensional array processing. Useful for instance for gridded data, such as those arising in oceanographic or meteorologic applications.

BLT: A collection of procedures for plotting data and manipulating vectors of data, among other things.

dataplot: A standalone application for analysing data using sophisticated statistical methods.

Page Authors

Arjen Markus: The original author.

pyk

Category Tutorial