[Arjen Markus] 2007-12-20:  This page is meant as a tutorial for the website Torsten Berg is designing.

''[EKB] FYI the dataplot link is broken. Is it [Dataplot]?''


** Data analysis with Tcl/Tk **

This is an introduction to Tcl/Tk with an emphasis on ''data analysis''.
Familiarity with the basics of Tcl and programming in general is assumed:
variables, loops, functions, if-statements and other programming constructs
should be familiar already.

This tutorial is mostly concerned with with numerical data, and the examples
are of gathering statistical information and plotting graphs from these data.
This is no attempt to show the construction of a fully-featured data
analysis package:  That would be a considerable task indeed, not because it
couldn't be done in Tcl, but because there are so many possibilities.

Tcl is a good choice for doing this kind of work because it makes it easy to
whip up a few dedicated computational procedures suitable for a particular
needs.


** A first example: dealing with text **

Here is the first example task:

Given a file containing ordinary text, how could one classify its contents as a
short story, a legal document or a scientific article?  One approach could be
to take an inventory of the words in each article, and then, based on their
frequencies, to use some scheme to classify them.  This tutorial shows how to
take the inventory, leaving the classifiction scheme as an exercise for the
reader.

The task at hand is almost trivial:

   * Read the file.
   * Split its contents in words, i.e individual sequences of characters.
   * Count the occurrences for each word.
   * Print each in descending order, the most frequently-used word first.

First, a suitable [data structure] is needed for the data. The primary choices
in Tcl are:

   * [list%|%Lists]
   * [array%|%Associative arrays]
   * [dict%|%Dictionaries]

A list is similar to an ''array'' in [Fortran], [Perl], [C] or [Java]:  It is an
ordered collection of data.  Each element of a list can be retrieved by the
index of that element.  A list is the most ubiquitous data structure in Tcl,
and in fact, each list has a syntax that makes it suitable for interpretation
by Tcl as a command.

Here are examples of a few list commands that will be used later:

======
set alist {x y z}            ;# Construct a list and store it in
                             ;# variable "alist"
set second [lindex $alist 1] ;# Get the second element from the list
                             ;# "alist" - counting starts at 0

lappend alist w              ;# Append a new element to the end of the list
======

An [array%|%associative array] on the other hand is an '''unordered'''
collection of variables and is similar to a '''hash''' in Perl.  A
[dict%|%dictionary] is even more similar since keys in a dictionary are just
keys, not variables.  A variable in an array can be retrieved using its name:
For example:

======
set count(occurrence) 1      ;# Set element "occurrence" in the
                             ;# array "count" to 1

parray count                 ;# Quick and dirty print of all array
                             ;# elements and their values

array names count            ;# Get a list of all array elements
======

For the above counting problem, associative arrays are suitable.  we can then
do something like:

======
set contents [readfile input.txt]
foreach word $contents {
    incr count($word)
}
parray count
======

This is of course only very rough code:  It is necessary to first read the
data, splitting the contents first into rows and from there into words.
Therefore, the next step is to refine the above code:

======
proc readfile filename {
    set infile [open $filename]
    set data [read $infile]
    close $infile
    return $data
}
======

That is enough to read the entire file and store the contents in a variable
called `contents`.

To split the content into words, punctuation must be cleaned from the data.  To
keep things simple we assume that the only punctuation that occurs is period,
apostrophe, quotation mark, question mark, exclamation mark, comma, semicolon
and colon.  Here, `[string map]`, replaces each occurrence of the given
punctuation characters with a space character, which is convenient later when
the content is treated as a list:

======
set contents [string map {
    . { }
    ; { }
    : { }
    ' { }
    ! { }
    ? { }
    , { }
    \" { }
} $contents]
======


This formatting of the command is only meant to make the structure of
the list of replacements clearer. There is no special meaning to doing
it this way.

We should remember to convert the letters to lower-case (or
upper-case) as "We" and "we" clearly represent the same word.

======
set contents [string tolower $contents]
======

And now, the grand finale: $contents also be interpreted as a list of words,
and we can count them by iterating through them using `[foreach]`:

======
foreach word $contents {
    if {[info exists count($word)]} {
        incr count($word)
    } else {
        set count($word) 1
    }
}
======

`[info exists]` indicates whether the variable already
exists in the array so that it can be properly initialised.

As of [Changes in Tcl/Tk 8.5%|%Tcl8.5], `[incr]` no longer returns an error
when a variable does not yet exist.  Instead, it initializes the varible to
`0`, and then increments it.  The construct above is still valid of course, and
could be useful if something other than initialisation to zero is desired.

The last part of the task is to print the list of words in the correct order.
That is a bit tricky: It would be easy to order the words alphabetically, but
to order the words by frequency of occurrence, the data should be "inverted".
`[lsort]` can do this if put the frequency data in a list, actually a list of
lists:

======
set frequencies {}
foreach word [array names count] {
    lappend frequencies [list $word $count($word)]
}

#
# The -index option is used to make [lsort] look inside the list
# elements
# The -integer option is used because by default [lsort] regards
# list elements as strings
#
set sorted [lsort -index 1 -integer $frequencies]

# Print them to the screen

foreach frequency $sorted {
    puts $frequency
}
======

Of course, words can occur in the same frequency in the text we are
examining, it might therefore be nicer to sort the words with the
same frequency in alphabetical order. And that is actually very simple,
thanks to the sorting algorithm that is used - mergesort:

======
set sorted [lsort -index 1 -integer [
    lsort -index 0 $frequencies]]
======

We first sort with respect to the words (element 0 in each sublist) and
then to the frequency (element 1). As we do not need the
intermediate result, we combine the two actions in one command. The
mergesort algorithm makes sure that any previous order among elements
that have the same "sort value" is preserved.

The complete code:

======
#
# Read the entire file
#
set infile [open input.txt]
set contents [read $infile]
close $infile

# Polish the contents (could be combined)
set contents [string map {
    . { }
    ; { }
    : { }
    ' { }
    ! { }
    ? { }
    , { }
    \" { }
} $contents]
set contents [string tolower $contents]

#
# Count the words
#
foreach word $contents {
    if {[info exists count($word)]} {
        incr count($word)
    } else {
        set count($word) 1
    }
}

#
# Convert the data into a form easily sorted
#
set frequencies {}
foreach word [array names count] {
    lappend frequencies [list $word $count($word)]
}

set sorted [lsort -index 1 -integer -decreasing [
    lsort -index 0 $frequencies]]

#
# Print them to the screen
#
foreach frequency $sorted {
    puts $frequency
}
======

The result, when used on the first three paragraphs of this text is:

======none
for 5
is 5
a 4
be 4
data 4
that 4
with 4
are 3
it 3
not 3
of 3
tcl 3
the 3
to 3
we 3
will 3
an 2
analysis 2
and 2
because 2
familiar 2
in 2
so 2
this 2
you 2
- 1
actually 1
all 1
assumed 1
at 1
attempt 1
basics 1
build 1
but 1
choice 1
computational 1
considerable 1
could 1
dealing 1
dedicated 1
do 1
doing 1
easy 1
emphasis 1
examples 1
few 1
fully-featured 1
functions 1
gathering 1
general 1
good 1
graphs 1
here 1
if-statements 1
indeed 1
information 1
introduction 1
kind 1
least 1
less 1
loops 1
many 1
more 1
mostly 1
needs 1
numerical 1
on 1
or 1
package 1
particular 1
plotting 1
possibilities 1
procedures 1
programming 1
should 1
shown 1
statistical 1
suitable 1
task 1
tcl/tk 1
there 1
these 1
they 1
up 1
variables 1
very 1
whip 1
why 1
work 1
would 1
your 1
======

We forgot one punctuation character: the dash, so this shows up as a
word as well, and of course we need to take care of hyphens and numerous
other details. But that is not the point: we managed to do this analysis
in a mere 30 lines of code.

''Note:'' This algorithm is very easy to parallelize. In a future
version we might show how to do that using the Threads package.

** The second example: a table of data **

As promised at the beginning, our main focus is numeric data. So let us
look at some simple table of fictitious data:

    X        Y       Z
  -3.52    -3.17   -3.68
  -2.09    -1.76   -3.37
  -2.12    -0.76    1.69
  -0.20    -1.99    0.54
  -0.65     0.92    6.67
   1.65     1.02    1.91
   1.09     2.75    6.42
   3.61     4.79    6.16
   4.73     4.84    6.43
   4.75     5.30    8.98

The following script produced this data:

======
for {set i 0} {$i < 10} {incr i} {
    set x [expr {$i-5.0+2.0*rand()}]
    set y [expr {$i-5.0+3.0*rand()}]
    set z [expr {$i-5.0+10.0*rand()}]
    puts [format "%5.2f %5.2f %5.2f" $x $y $z]
}
======


Operations one might perform on this data include:

   * Creating a graph of X versus Y or Z to see if there is any obvious relation.
   * Selecting those records that fulfill some relation (only positive data for instance).
   * Performing statistical analysis on the data: linear regression for instance.
   * Sorting the records wrt X, assuming that is the independent variable.

In order to manipulate this data programmatically, it should placed into some
structure.  Alhtough an associative array could work,

======
#
# A variable in an array can be named anything, even a number a real number
#
set array(-3.52) {-3.17 -3.68}
...
======

it does not appear to be convenient: Associative
arrays are unordered.  It would be necessary for example to examine all element names
want to select only positive X-values for instance, that is:

   * Get a list of element names
   * Select the positive values only
   * Create a new array

In this case using a list of lists would save code: no need to
get a list of X-values first, we already have that.

Furthermore, if '''include the names of the columns''', we get a
self-describing table:

======
set table {
    {  X        Y       Z   }
    {-3.52    -3.17   -3.68 }
    {-2.09    -1.76   -3.37 }
    {-2.12    -0.76    1.69 }
    {-0.20    -1.99    0.54 }
    {-0.65     0.92    6.67 }
    { 1.65     1.02    1.91 }
    { 1.09     2.75    6.42 }
    { 3.61     4.79    6.16 }
    { 4.73     4.84    6.43 }
    { 4.75     5.30    8.98 }
}

puts "Column names: [lindex $table 0]"
puts "Column 1:"
foreach record $table {
    puts [lindex $record 0]
}
======

Now that the data is structured, we define a format for storing large sets of data:

   * The file can contain comment lines, starting with a # in the first position
   * The first line that is not a comment line contains the names of the columns
   * All other lines contain the values per record (separated by one or more spaces)

Here is a procedure to create and fill a table:

======
proc loadTable filename {
    set infile [open $filename]
    set table {}

    #
    # Read the file line by line until the end
    # ([gets] will return a negative value then)
    #
    while {[gets $infile line] >= 0} {
        if {[string index $line 0] eq {#}} {
            continue
        } else {
            lappend table $line
        }
    }
    close $infile
    return $table
}
======

''Note:'' in the above procedure `[gets]` is used to read one line at a time
from the file. In this form ("gets $infile line")
it returns the ''number of characters'' it read or a negative number if
it read past the end of the file. By checking for a positive value or
zero (condition: >= 0), we read empty lines too. If an empty line
signifies another section (for instance the beginning of the next
table), then a condition like "> 0" could be used.

This procedure does not take care of:

   * non-existent files (Tcl returns an error when trying to open the file)
   * lines that do not contain a proper list of numbers

That may harm us later on, so let us revise this code a bit:

======
proc loadTable filename {
    #
    # Open the file, catch any errors
    #
    if {[catch {
        set infile [open $filename]
    } msg]} {
        return -code error "Error opening $filename: $msg"
    }

    set table {}
    set names 0

    while {[gets $infile line] >= 0} {
        if {[string index $line 0] eq {#}} {
            continue
        } else {
            #
            # We need to examine the lines:
            # The first one contains strings, all others numbers
            #
            if {[catch {
                     set number_data [llength $line]
            } msg]} {
                return -code error "Error reading $filename:\
                    line can not be properly interpreted\nLine: $line"
            }

            if {$names == 0} {
                set names 1
                set number_columns [llength $line]
            } else {
                if {$number_columns != [llength $line]} {
                    return -code error "Error reading $filename:\
                        incorrect number of columns\nLine: $line"
                }
                foreach v $line {
                    if {![string is double $v]} {
                        return -code error "Error reading $filename:\
                            invalid number encountered - $v\nLine: $line"
                    }
                }
            }
            lappend table $line
        }
    }
    close $infile
    return $table
}
======

Checking the data increases the length of the code, but we can be
reasonably sure now that an ill-formed file will not surprise us later -
for instance, when trying to determine the maximum value of a column the
result is "A" instead of a number.

'''Auxiliary procedures'''

This procedure can easily be used in an interactive session with tclsh
or wish:

======none
% set table [loadTable "input.data"]
% puts "Columns: [lindex $table 0]"
% puts "Number of rows: [expr {[llength $table]-1}]"
======

The drawback of the last two statements is that the
implemented structure of the table must be used directly.  Instead, some
auxiliary procedures can provide an abstraction:

======
proc columnNames table {
    return [lindex $table 0]
}


proc numberRows {table} {
    expr {[llength $table]-1}
}


proc row {table index} {
    if {$index < 0 || $index >= [llength $table]} {
        return -code error "Row index out of range"
    } else {
        return [lindex $table [expr {$index+1}]]
    }
}


proc column {table index {withname 1}} {
    if {$index < 0 || $index >= [llength [columnNames $table]]} {
        return -code error "Column index out of range"
    } else {
        set result {}
        foreach row $table {
            lappend result [lindex $row $index]
        }
    }

    # Keep the name?
    if {$withname} {
        return $result
    } else {
        return [lrange $result 1 end]
    }
}
======

For convenience we added an option to the "column" procedure: we may or
may not want the name of the column included in the result. We can use
it as follows:

   * If we just want the data:

======none
% columnNames $table 0 0
======

   * If we want the name included (it could actually be used as a table
with just one column):

======none
% columnNames $table 0
======

To make $withname optional, a default value of `1` is provided in the procedure
definition.

''Displaying the table''

So now that we can load a table and examine its contents, it may be
a good idea to view it in a window, just to get some feeling for the
data. We will use the Tablelist package for this:

======
#
# Load the package and for convenience, import all procedures
#
package require tablelist
namespace import ::tablelist::*

#
# Variable to create a unique window name
#
set wcount 0

proc display table {
    global wcount

    #
    # Out of precaution, check if "wcount" has been set
    #
    if {![info exists wcount]} {
        set wcount 0
    }

    #
    # Create a list of columns for the tablelist widget
    # (the numbers will be right aligned)
    #
    set columns {}
    foreach name [columnNames $table] {
        lappend columns 0 $name right
    }

    #
    # A new toplevel window
    #
    incr wcount
    set t .t_$wcount
    toplevel $t

    #
    # The table list in that window and two scrollbars
    #
    set tbl $t.tbl
    set vsb $t.vsb
    set hsb $t.hsb

    tablelist $tbl -columns $columns -height 10 -width 100 \
        -xscrollcommand [list $hsb set] \
        -yscrollcommand [list $vsb set] \
        -stripebackground lightblue
    scrollbar $vsb -orient vertical   -command [list $tbl yview]
    scrollbar $hsb -orient horizontal -command [list $tbl xview]

    #
    # Put them in the toplevel window
    #
    grid $tbl $vsb -sticky news
    grid $hsb      -sticky news
    grid rowconfigure    $t 0 -weight 1
    grid rowconfigure    $t 1 -weight 0
    grid columnconfigure $t 0 -weight 1
    grid columnconfigure $t 1 -weight 0

    #
    # Now put the data in
    #
    for {set i 0} {$i < [numberRows $table]} {incr i} {
        $tbl insert end [row $table $i]
    }
}
======

Nothing very difficult, except perhaps the creation of the three
widgets and their cooperation:

   * the -xscrollcommand option determines what command to run when the
visible part of the Tablelist widget is changed programmatically. In
this case the horizontal scrollbar is updated. Similarly for the
vertical scrollbar.
   * of course, something similar must happen when the user operates the
scrollbars. So each scrollbar has a command option too, to help with
this.
   * the grid geometry manager is used to place the three widgets side
by side and the -weight options make sure that the scrollbars can not
get thicker (weight 0), they can only grow in the direction of their
longer axis. The tablelist widget can grow in both ways.

'''Statistics'''

So far, so good: we can load a table, extract information and display
the whole thing. What about statistical properties of the data? Well, we
have the math::statistics package to deal with the raw data.

Let us write a small convenience procedure to create a table with some
basic statistical parameters. (Note: the first column will be one with
labels, rather than numbers, but that does not matter!):

======
package require math::statistics
namespace import ::math::statistics::*

proc descriptiveStats table {
    #
    # Determine the mean, maximum, minimum, standard deviation
    # etc per column
    #
    set index 0
    set means {Mean}
    set min   {Minimum}
    set max   {Maximum}
    set ndata {"Number of data"}
    set stdev {"Standard deviation"}
    set var   {Variance}

    foreach name [columnNames $table] {
        set data     [column $table $index 0] ;# Now the option is useful!
        set statdata [basic-stats $data]

        lappend means [lindex $statdata 0]
        lappend min   [lindex $statdata 1]
        lappend max   [lindex $statdata 2]
        lappend ndata [lindex $statdata 3]
        lappend stdev [lindex $statdata 4]
        lappend var   [lindex $statdata 5]
        incr index
    }

    #
    # Put them in a table ...
    #
    set result [list [concat "Parameter" [columnNames $table]]]
    lappend result $means $min $max $ndata $var

    return $result
}
======

And because Tablelist does not really care about the type of data we
want to display:

======
% display [descriptiveStats $table]
======

will display the statistical information just as it the complete table.


''Manipulating the data''

Sometimes, we will want to sort the data by column and proceed with a
table sorted by one or more columns. For one column this is pretty easy,
as we saw in the previous example:

======
proc sort {table column} {

    set names [columnNames $table]
    set index [lsearch $names $column]
    if { $index < 0 } {
        return -code error "No column by that name: $column"
    }
    set newtable [list $names \
                     [lsort -index $index -double \
                         [lrange $table 1 end]]]
    return $newtable
}
======

The above is suitable for a single column. It is only slightly more
complicated to write a procedure that will sort the table to two or more
columns - we will leave this as an exercise.

A sorted table is for instance useful when you want to plot the data in
a graph: lines through data points are less messy when the data are
sorted along, say, the x axis. And that is our next task.


''Plotting the data''

More than anything else, you will gain insight in the data if you can
make a suitable picture. Let us try and create a XY-plot then. Again,
there is a package for that sort of tasks - in fact, there are several
(see the references at the end) - Plotchart, part of Tklib.

The procedure we have in mind will plot some column against another
column:

======
plot $table X Y
======

would plot the column with label Y against the column with label X in a
suitably scaled graph. Here is the full code:

======
package require Plotchart
package require math::statistics
namespace import ::Plotchart::* ;# For convenience only

proc plot {table xvar yvar} {
    global wcount

    if {![info exists wcount]} {
        set wcount 0
    }

    # First find the column indices ...

    set xid [lsearch -exact [columnNames $table] $xvar]
    set yid [lsearch -exact [columnNames $table] $yvar]

    if { $xid < 0 || $yid < 0 } {
        return -code error \
            "Either $xvar or $yvar is not a valid column for this table"
    }

    # Determine the extremes

    set xdata [column $table $xid 0]
    set ydata [column $table $yid 0]
    set xmax  [::math::statistics::max $xdata]
    set xmin  [::math::statistics::min $xdata]
    set ymax  [::math::statistics::max $ydata]
    set ymin  [::math::statistics::min $ydata]

    # Set up the plot - similar to the display proc

    incr wcount
    set t .t_$wcount
    toplevel $t
    pack [canvas $t.c -width 500 -height 300] -fill both

    set plot [createXYPlot $t.c [
        determineScale $xmin $xmax] [
        determineScale $ymin $ymax]
    ]

    # Plot the data

    foreach x $xdata y $ydata {
        $plot plot xy $x $y
    }
}
======

In this procedure we used `[lsearch] -exact]` because `[lsearch]` by
default uses so-called glob patterns for identifying the element. It is
just a precaution, column names containing ? or * are probably rare, but it
may lead to surprises if this behaviour is not taken into consideration.

The command to add a data point to the plot uses a label ("xy" in the
above code) to identify to which series the data point belongs. That way
we can combine multiple series in one plot or change the colour or
line style.

Our present procedure will create a new window with each call, but that
can easily be changed, for instance by using an optional argument that
indicates a new window should be created or the last one reused (but
then the label should be changed too!).


''Advanced facilities''

As this tutorial is meant to be a tutorial on data analysis with Tcl,
and not on data analysis perse, we will not develop a complete data
analysis package. Filtering the data should, however, be part of this
tutorial and will be the last topic.

Since Tcl is a dynamic language, it is easy to filter
procedure will allow us to do:

======
set newtable [filter $table {$X > 0 && 2.0*$Y < 1.5*$Z}]
======

This particular condition may not be very useful, but it is almost
trivial to implement in Tcl:

======
proc filter {table condition} {
    set names [columnNames $table]
    set body [string map [
        list @NAMES@ [list $names] @COND@ [list $condition]] {
            set v::result [list @NAMES@]
            foreach v::row [lrange $table 1 end] {
                foreach @NAMES@ $v::row break
                if @COND@ {
                    lappend v::result $v::row
                }
            }
            return $v::result
        }
    ]

    namespace eval v {}
    proc filter_private table $body
    filter_private $table
}
======

Okay, claiming this to be an almost trivial procedure is an
exaggeration. Some sophisticated things are going on here!

First, `[filter]` recreates `filter_private` having a custom body.  In Tcl,
`[proc]` can be used at any time to create a new procedure.  `[string map]`,
generates the desired body of by replacing `@NAMES@` and `@COND@` with the
values of the `$names` and `$condition`, which are each placed in a [list] so
that each is properly quoted as a single word in the body.

`$result` and `$row` are each located in the namespace "v" so that there are no
local variables to conflict with the names of the columns in the table, which
each become a local variable.  Otherwise, a column named "row" or "result"
would cause trouble.

''Note:'' There is one serious limitation with this fragment: we can not
use variables from the calling environment. For instance:

======
set a 1.0
set newtable [filter $tabel {$X > $a}]
======

does not work because in the private procedure the variable "a" is
unknown.

Several methods are available to circumvent this limitation, but they
are fairly sophisticated and we shall not discuss them here.


''Further possibilities''

The above method to create a temporary procedure can also be used to
compute the result of one or more expressions using the column
variables. For instance:

======
set newtable [compute $table X {$X} Sum {$X+$Y} Cosfunc {$X/$Y+cos($Z)}]
======

could lead to a new table with columns X, Sum and Cosfunc whose values
are computed using the columns from the original table.

We could also create commands to combine tables into new tables or
remove rows directly. We will leave such commands as an exercise.


''References''

Manipulating data sets like the one above is a very common task and many
specialised extensions exist. For instance:

Popular database extensions are:

   * SQLite, a light-weight database management system, supporting SQL, transactions and the like [http://wiki.tcl.tk/sqlite]
   * Metakit, a database management system geared to managing timeseries [http://wiki.tcl.tk/metakit]

A new extension is [http://wiki.tcl.tk/ratcl%|%Ratcl]. This is a
continuation of the work done on Metakit with an emphasis on high
performance manipulation of large amounts of data.

Other extensions we want to mention are:

   * NAP, N-dimensional array processing, useful for instance for gridded data, such as those arising in oceanographic or meteorologic applications, [http://wiki.tcl.tk/nap]
   * BLT, a collection of procedures for plotting data and manipulating vectors of data, among other things [http://wiki.tcl.tk/blt]
   * dataplot, a standalone application for analysing data using sophisticated statistical methods [http://www.nist.gov/dataplot]


** Page Authors **

   [Arjen Markus]:   


   [pyk]:   


<<categories>> Tutorial