Data analysis with Tcl

Arjen Markus (20 december 2007) This page is meant as a tutorial for the website Torsten Berg is designing.

EKB FYI the dataplot link is broken. Is it this one: [1 ]?

Data analysis with Tcl/Tk

This is an introduction to Tcl/Tk with an emphasis on data analysis. It is assumed that you are more or less familiar with the basics of Tcl and at the very least with programming in general - variables, loops, functions, if-statements, they should all be familiar for you.

We will be mostly dealing with numerical data, so the examples shown here are for gathering statistical information and for plotting graphs of these data. We will not attempt to build a fully-featured data analysis package: that would be a considerable task indeed. Not because we could not do it in Tcl, but because there are so many possibilities.

That is why Tcl is actually a good choice for doing this kind of work: it is easy to whip up a few dedicated computational procedures that will be suitable for your particular needs.

Note: We will assume Tcl 8.4 in this introduction with a number of packages. Some remarks will be made about Tcl 8.5, as that offers even more facilities.

A first example: dealing with text

Here is the first problem we want to look at:

We have a file containing ordinary text and we want to know how we can classify its contents: is it a short story? a legal document? a scientific article?

To decide this, we will look at the words that are being used. We count them and then, based on some classification scheme, we will be able to put it in a particular category. Well, in this introduction we will stop after counting them, as we do not have a classification scheme.

The task at hand is (almost) trivial:

• Read the file
• Split its contents in words (sequences of alphabetical characters)
• Count the occurrences for each word
• Print them in descending order, the most frequently used word first

To implement the steps above, we need to decide on a suitable data structure. We basically have two choices in Tcl:

• Lists
• Associative arrays

Lists work more or less like arrays work in Fortran, Perl, C or Java. They are ordered collections of data. You can refer to elements in a list by their index. Lists are the most ubiquitous data structure in Tcl, because simply put, every command is a list.

Here are a few list commands, we will need them later:

set alist {x y z}            ;# Construct a list and store it in
;# variable "alist"
set second [lindex \$alist 1] ;# Get the second element from the list
;# "alist" - counting starts at 0

lappend alist w              ;# Append a new element to the end of the list

Associative arrays on the other hand are unordered collections of data and work like hashes in Perl. Individual elements are referred to by a string, the name of the element. For instance:

set count("occurrence") 1    ;# Set element "occurrence" in the
;# array "count" to 1

parray count                 ;# Quick and dirty print of all array
;# elements and their values

array names count            ;# Get a list of all array elements

Note: In Tcl 8.5 we will have a third basic data structure, the dictionary, which is a mixture between lists and associative arrays.

For the above counting problem, associative arrays are most suitable: we can then do (roughly):

set contents [readfile "input.txt"]
foreach word \$contents {
incr count(\$word)
}

parray count

This is of course only very rough code: we still need to read the file and to split its contents in words. Also increasing the count for a word that we have encountered yet will result in an error.

The next step is therefore to refine the above code:

proc readfile { filename } {
set infile [open \$filename]
set data [read \$infile]
close \$infile
return \$data
}

That is enough to read the entire file and store the contents in a variable called "contents".

But to split it in words, we will need to get rid of punctuation characters like a period. Assume that the only ones we need to deal with are: period, apostrophe, quotation mark, question mark, exclamation mark, comma, semicolon and colon, the more usual punctuation marks.

We can do that most easily with the command "[string map]". This command replaces substrings in a larger string by other strings. For instance, the command:

set contents [string map {. " "} \$contents]

will replace all periods in the string contained in the variable contents by spaces. (Why spaces? Because they are nicely taken care of by the automatic conversion from string to list. See below). So, to get rid of all of them:

set contents [string map {. " "
; " "
: " "
' " "
! " "
? " "
, " "
\" " " } \$contents]

(This formatting of the command is only meant to make the structure of the list of replacements clearer. There is no special meaning to doing it this way.)

We should not forget to convert the letters to lower-case (or upper-case) as "We" and "we" clearly represent the same word.

set contents [string tolower \$contents]

And now, the grand finale: the string contained in "contents" can also be interpreted as a list of words! That means that via the [foreach] command we can loop over all words in the text and count them:

foreach word \$contents {
if { [info exists count(\$word)] } {
incr count(\$word)
} else {
set count(\$word) 1
}
}

We use the [info exists] command to check if the array element already exists or not. This way it can be properly initialised.

Note on Tcl8.5: The [incr] command behaves slightly differently in Tcl8.5. Rather than throwing an error when a variable has not been set yet, it will implicitly initialise it to zero (0) and then do the increment. The above construction is still valid of course and could be useful if an initialisation to zero is not what you need.

The last part of our task is to print the list of words in the correct order. That is a bit tricky: we can easily print an alphabetical list of words. But the most frequent word first means we need to "invert" the data. Let us put the frequency data in a list, actually a list of lists, so that we can use the [lsort] command to do the hard work for us:

set frequencies {}
foreach word [array names count] {
lappend frequencies [list \$word \$count(\$word)]
}

#
# The -index option is used to make [lsort] look inside the list
# elements
# The -integer option is used because by default [lsort] regards
# list elements as strings
#
set sorted [lsort -index 1 -integer \$frequencies]

# Print them to the screen

foreach frequency \$sorted {
puts \$frequency
}

Of course, words can occur in the same frequency in the text we are examining, it might therefore be nicer to sort the words with the same frequency in alphabetical order. And that is actually very simple, thanks to the sorting algorithm that is used - mergesort:

set sorted [lsort -index 1 -integer \
[lsort -index 0 \$frequencies]]

We first sort with respect to the words (element 0 in each sublist) and then to the frequency (element 1). As we do not need the intermediate result, we combine the two actions in one command. The mergesort algorithm makes sure that any previous order among elements that have the same "sort value" is preserved.

The complete code:

#
# Read the entire file
#
set infile [open "input.txt"]
set contents [read \$infile]
close \$infile

#
# Polish the contents (could be combined)
#
set contents [string map {. " "
; " "
: " "
' " "
! " "
? " "
, " "
\" " " } \$contents]
set contents [string tolower \$contents]

#
# Count the words
#
foreach word \$contents {
if { [info exists count(\$word)] } {
incr count(\$word)
} else {
set count(\$word) 1
}
}

#
# Convert the data into a form easily sorted
#
set frequencies {}
foreach word [array names count] {
lappend frequencies [list \$word \$count(\$word)]
}

set sorted [lsort -index 1 -integer -decreasing \
[lsort -index 0 \$frequencies]]

#
# Print them to the screen
#
foreach frequency \$sorted {
puts \$frequency
}

The result, when used on the first three paragraphs of this text is:

for 5
is 5
a 4
be 4
data 4
that 4
with 4
are 3
it 3
not 3
of 3
tcl 3
the 3
to 3
we 3
will 3
an 2
analysis 2
and 2
because 2
familiar 2
in 2
so 2
this 2
you 2
- 1
actually 1
all 1
assumed 1
at 1
attempt 1
basics 1
build 1
but 1
choice 1
computational 1
considerable 1
could 1
dealing 1
dedicated 1
do 1
doing 1
easy 1
emphasis 1
examples 1
few 1
fully-featured 1
functions 1
gathering 1
general 1
good 1
graphs 1
here 1
if-statements 1
indeed 1
information 1
introduction 1
kind 1
least 1
less 1
loops 1
many 1
more 1
mostly 1
needs 1
numerical 1
on 1
or 1
package 1
particular 1
plotting 1
possibilities 1
procedures 1
programming 1
should 1
shown 1
statistical 1
suitable 1
tcl/tk 1
there 1
these 1
they 1
up 1
variables 1
very 1
whip 1
why 1
work 1
would 1

We forgot one punctuation character: the dash, so this shows up as a word as well, and of course we need to take care of hyphens and numerous other details. But that is not the point: we managed to do this analysis in a mere 30 lines of code.

Note: This algorithm is very easy to parallelize. In a future version we might show how to do that using the Threads package.

The second example: a table of data

As promised at the beginning, our main focus is numeric data. So let us look at some simple table of fictitious data:

X        Y       Z
-3.52    -3.17   -3.68
-2.09    -1.76   -3.37
-2.12    -0.76    1.69
-0.20    -1.99    0.54
-0.65     0.92    6.67
1.65     1.02    1.91
1.09     2.75    6.42
3.61     4.79    6.16
4.73     4.84    6.43
4.75     5.30    8.98

(These were in fact produced by the following small program:

for {set i 0} {\$i < 10} {incr i} {
set x [expr {\$i-5.0+2.0*rand()}]
set y [expr {\$i-5.0+3.0*rand()}]
set z [expr {\$i-5.0+10.0*rand()}]
puts [format "%5.2f %5.2f %5.2f" \$x \$y \$z]
}

as it is very tough and tedious to produce "arbitrary" numbers by hand)

The kind of things we want to be able to do with such data include:

• Create a graph of X versus Y or Z to see if there is any obvious relation
• Select those records that fulfill some relation (only positive data for instance)
• Perform statistical analysis on the data: linear regression for instance
• Sort the records wrt X, assuming that is the independent variable

First, however we must be able to deal with them. We can put them in an associative array:

#
# Element names can be anything, including a floating-point number
#
set array(-3.52) {-3.17 -3.68}
...

for instance, but that does not appear to be convenient: associative arrays are unordered, we would always have to examine the element names if we want to select only positive X-values for instance, that is:

• Get a list of element names
• Select the positive values only
• Create a new array

In this case using a list of lists would save code: no need to get a list of X-values first, we already have that.

Furthermore, if we take the above table and include the names of the columns we get a self-describing table:

set table {
{  X        Y       Z   }
{-3.52    -3.17   -3.68 }
{-2.09    -1.76   -3.37 }
{-2.12    -0.76    1.69 }
{-0.20    -1.99    0.54 }
{-0.65     0.92    6.67 }
{ 1.65     1.02    1.91 }
{ 1.09     2.75    6.42 }
{ 3.61     4.79    6.16 }
{ 4.73     4.84    6.43 }
{ 4.75     5.30    8.98 }}

puts "Column names: [lindex \$table 0]"
puts "Column 1:"
foreach record \$table {
puts [lindex \$record 0]
}

Okay, we have a data structure now. Let's define a file format that allows us to get large amounts of data:

• The file can contain comment lines, starting with a # in the first position
• The first line that is not a comment line contains the names of the columns
• All other lines contain the values per record (separated by one or more spaces)

Here is a procedure to create and fill a table:

proc loadTable {filename} {
set infile [open \$filename]

set table {}

#
# Read the file line by line until the end
# ([gets] will return a negative value then)
#
while { [gets \$infile line] >= 0 } {
if { [string index \$line 0] == "#" } {
continue
} else {
lappend table \$line
}
}
close \$infile

return \$table
}

Note: in the above procedure the [gets] command is used to get a single line from the file at a time. In this form ("gets \$infile line") it returns the number of characters it read or a negative number if it read past the end of the file. By checking for a positive value or zero (condition: >= 0), we read empty lines too. If an empty line signifies another section (for instance the beginning of the next table), then a condition like "> 0" could be used.

This procedure does not take care of:

• non-existent files (Tcl will throw an error when trying to open the file)
• lines that do not contain a proper list of numbers

That may harm us later on, so let us revise this code a bit:

proc loadTable {filename} {

#
# Open the file, catch any errors
#
if { [catch {
set infile [open \$filename]
} msg] } {
return -code error "Error opening \$filename: \$msg"
}

set table {}
set names 0

while { [gets \$infile line] >= 0 } {
if { [string index \$line 0] == "#" } {
continue
} else {
#
# We need to examine the lines:
# The first one contains strings, all others numbers
#
if { [catch {
set number_data [llength \$line]
} msg] } {
return -code error "Error reading \$filename: line can not be properly interpreted\nLine: \$line"
}

if { \$names == 0 } {
set names 1
set number_columns [llength \$line]
} else {
if { \$number_columns != [llength \$line] } {
return -code error "Error reading \$filename: incorrect number of columns\nLine: \$line"
}
foreach v \$line {
if { ![string is double \$v] } {
return -code error "Error reading \$filename: invalid number encountered - \$v\nLine: \$line"
}
}
}
lappend table \$line
}
}
close \$infile

return \$table
}

Checking the data increases the length of the code, but we can be reasonably sure now that an ill-formed file will not surprise us later - for instance, when trying to determine the maximum value of a column the result is "A" instead of a number.

Auxiliary procedures

This procedure can easily be used in an interactive session with tclsh or wish:

% set table [loadTable "input.data"]
% puts "Columns: [lindex \$table 0]"
% puts "Number of rows: [expr {[llength \$table]-1}]"

The drawback of the last two statements is that we need to use the implemented structure of our table directly. Rather than that, let us write a few auxiliary procedures that hide these details:

proc columnNames {table} {
return [lindex \$table 0]
}
proc numberRows {table} {
expr {[llength \$table]-1}
}
proc row {table index} {
if { \$index < 0 || \$index >= [llength \$table] } {
return -code error "Row index out of range"
} else {
return [lindex \$table [expr {\$index+1}]]
}
}
proc column {table index {withname 1}} {
if { \$index < 0 || \$index >= [llength [columnNames \$table]] } {
return -code error "Column index out of range"
} else {
set result {}
foreach row \$table {
lappend result [lindex \$row \$index]
}
}

# Keep the name?
if {\$withname} {
return \$result
} else {
return [lrange \$result 1 end]
}
}

For convenience we added an option to the "column" procedure: we may or may not want the name of the column included in the result. We can use it as follows:

• If we just want the data:
% columnNames \$table 0 0
• If we want the name included (it could actually be used as a table

with just one column):

% columnNames \$table 0

The construction "{withname 1}" in the argument list makes the argument "withname" optional.

Displaying the table

So now that we can load a table and examine its contents, it may be a good idea to view it in a window, just to get some feeling for the data. We will use the Tablelist package for this:

#
# Load the package and for convenience, import all procedures
#
package require tablelist
namespace import ::tablelist::*

#
# Variable to create a unique window name
#
set wcount 0

proc display {table} {
global wcount

#
# Out of precaution, check if "wcount" has been set
#
if { ![info exists wcount] } {
set wcount 0
}

#
# Create a list of columns for the tablelist widget
# (the numbers will be right aligned)
#
set columns {}
foreach name [columnNames \$table] {
lappend columns 0 \$name right
}

#
# A new toplevel window
#
incr wcount
set t .t_\$wcount
toplevel \$t

#
# The table list in that window and two scrollbars
#
set tbl \$t.tbl
set vsb \$t.vsb
set hsb \$t.hsb

tablelist \$tbl -columns \$columns -height 10 -width 100 \
-xscrollcommand [list \$hsb set] \
-yscrollcommand [list \$vsb set] \
-stripebackground lightblue
scrollbar \$vsb -orient vertical   -command [list \$tbl yview]
scrollbar \$hsb -orient horizontal -command [list \$tbl xview]

#
# Put them in the toplevel window
#
grid \$tbl \$vsb -sticky news
grid \$hsb      -sticky news
grid rowconfigure    \$t 0 -weight 1
grid rowconfigure    \$t 1 -weight 0
grid columnconfigure \$t 0 -weight 1
grid columnconfigure \$t 1 -weight 0

#
# Now put the data in
#
for { set i 0 } { \$i < [numberRows \$table] } { incr i } {
\$tbl insert end [row \$table \$i]
}
}

Nothing very difficult, except perhaps the creation of the three widgets and their cooperation:

• the -xscrollcommand option determines what command to run when the

visible part of the Tablelist widget is changed programmatically. In this case the horizontal scrollbar is updated. Similarly for the vertical scrollbar.

• of course, something similar must happen when the user operates the

scrollbars. So each scrollbar has a command option too, to help with this.

• the grid geometry manager is used to place the three widgets side

by side and the -weight options make sure that the scrollbars can not get thicker (weight 0), they can only grow in the direction of their longer axis. The tablelist widget can grow in both ways.

Statistics

So far, so good: we can load a table, extract information and display the whole thing. What about statistical properties of the data? Well, we have the math::statistics package to deal with the raw data.

Let us write a small convenience procedure to create a table with some basic statistical parameters. (Note: the first column will be one with labels, rather than numbers, but that does not matter!):

package require math::statistics
namespace import ::math::statistics::*

proc descriptiveStats {table} {
#
# Determine the mean, maximum, minimum, standard deviation
# etc per column
#
set index 0
set means {Mean}
set min   {Minimum}
set max   {Maximum}
set ndata {"Number of data"}
set stdev {"Standard deviation"}
set var   {Variance}

foreach name [columnNames \$table] {
set data     [column \$table \$index 0] ;# Now the option is useful!
set statdata [basic-stats \$data]

lappend means [lindex \$statdata 0]
lappend min   [lindex \$statdata 1]
lappend max   [lindex \$statdata 2]
lappend ndata [lindex \$statdata 3]
lappend stdev [lindex \$statdata 4]
lappend var   [lindex \$statdata 5]
incr index
}

#
# Put them in a table ...
#
set result [list [concat "Parameter" [columnNames \$table]]]
lappend result \$means \$min \$max \$ndata \$var

return \$result
}

And because Tablelist does not really care about the type of data we want to display:

% display [descriptiveStats \$table]

will display the statistical information just as it the complete table.

Manipulating the data

Sometimes, we will want to sort the data by column and proceed with a table sorted by one or more columns. For one column this is pretty easy, as we saw in the previous example:

proc sort {table column} {

set names [columnNames \$table]
set index [lsearch \$names \$column]
if { \$index < 0 } {
return -code error "No column by that name: \$column"
}
set newtable [list \$names \
[lsort -index \$index -double \
[lrange \$table 1 end]]]
return \$newtable
}

The above is suitable for a single column. It is only slightly more complicated to write a procedure that will sort the table to two or more columns - we will leave this as an exercise.

A sorted table is for instance useful when you want to plot the data in a graph: lines through data points are less messy when the data are sorted along, say, the x axis. And that is our next task.

Plotting the data

More than anything else, you will gain insight in the data if you can make a suitable picture. Let us try and create a XY-plot then. Again, there is a package for that sort of tasks - in fact, there are several (see the references at the end) - Plotchart, part of Tklib.

The procedure we have in mind will plot some column against another column:

plot \$table X Y

would plot the column with label Y against the column with label X in a suitably scaled graph. Here is the full code:

package require Plotchart
package require math::statistics
namespace import ::Plotchart::* ;# For convenience only

proc plot {table xvar yvar} {
global wcount

if { ![info exists wcount] } {
set wcount 0
}

# First find the column indices ...

set xid [lsearch -exact [columnNames \$table] \$xvar]
set yid [lsearch -exact [columnNames \$table] \$yvar]

if { \$xid < 0 || \$yid < 0 } {
return -code error "Either \$xvar or \$yvar is not a valid column for this table"
}

# Determine the extremes

set xdata [column \$table \$xid 0]
set ydata [column \$table \$yid 0]
set xmax  [::math::statistics::max \$xdata]
set xmin  [::math::statistics::min \$xdata]
set ymax  [::math::statistics::max \$ydata]
set ymin  [::math::statistics::min \$ydata]

# Set up the plot - similar to the display proc

incr wcount
set t .t_\$wcount
toplevel \$t
pack [canvas \$t.c -width 500 -height 300] -fill both

set plot [createXYPlot \$t.c [determineScale \$xmin \$xmax] \
[determineScale \$ymin \$ymax]]

# Plot the data

foreach x \$xdata y \$ydata {
\$plot plot "xy" \$x \$y
}
}

In this procedure we used [lsearch -exact] because [lsearch] will by default use so-called glob patterns for identifying the element. It is just a precaution, column names with ? or * are probably rare, but it may lead to surprises if you are not aware of this behaviour.

The command to add a data point to the plot uses a label ("xy" in the above code) to identify to which series the data point belongs. That way we can combine multiple series in one plot or change the colour or line style.

Our present procedure will create a new window with each call, but that can easily be changed, for instance by using an optional argument that indicates a new window should be created or the last one reused (but then the label should be changed too!).

As this tutorial is meant to be a tutorial on data analysis with Tcl, and not on data analysis perse, we will not develop a complete data analysis package. Filtering the data should, however, be part of this tutorial and will be the last topic.

With Tcl being a dynamic language, we can very easily create a filter procedure that will allow us to do:

set newtable [filter \$table {\$X > 0 && 2.0*\$Y < 1.5*\$Z}]

This particular condition may not be very useful, but it is (almost) trivial to implement in Tcl:

proc filter {table condition} {

set names [columnNames \$table]
set body  \
[string map [list NAMES \$names COND \$condition] {
set v::result [list {NAMES}]
foreach v::row [lrange \$table 1 end] {
foreach {NAMES} \$v::row {break}
if {COND} {
lappend v::result \$v::row
}
}
return \$v::result
}]

namespace eval v {}
proc filter_private {table} \$body

filter_private \$table
}

Okay, claiming this to be an (almost) trivial procedure is an exaggeration. Some sophisticated things are going on here!

First of all, we create a private namespace and a new procedure when we run [filter]. This is possible because creating a procedure with [proc] is nothing but running yet another command. [proc] is not special in any way (not like the definition of a function in C or Java).

Then to get a procedure with the right body, we use the [string map] command; it simply replaces the strings NAMES and COND with the values of the variables "names" and "condition".

The namespace "v" allows us to use private variables (a conflict with the column names of the table becomes very improbable that way).

Note: There is one serious limitation with this fragment: we can not use variables from the calling environment. For instance:

set a 1.0
set newtable [filter \$tabel {\$X > \$a}]

will not work, because in the private procedure the variable "a" is unknown.

Several methods are available to circumvent this limitation, but they are fairly sophisticated and we shall not discuss them here.

Further possibilities

The above method to create a temporary procedure can also be used to compute the result of one or more expressions using the column variables. For instance:

set newtable [compute \$table X {\$X} Sum {\$X+\$Y} Cosfunc {\$X/\$Y+cos(\$Z)}]

could lead to a new table with columns X, Sum and Cosfunc whose values are computed using the columns from the original table.

We could also create commands to combine tables into new tables or remove rows directly. We will leave such commands as an exercise.

References

Manipulating data sets like the above is a very common task and many specialised extensions exist. For instance:

Popular database extensions are:

• SQLite, a light-weight database management system, supporting SQL, transactions and the like [2 ]
• Metakit, a database management system geared to managing timeseries [3 ]

A new extension is Ratcl [4 ]. This is a continuation of the work done on Metakit with an emphasis on high performance manipulation of large amounts of data.

Other extensions we want to mention are:

• NAP, N-dimensional array processing, useful for instance for gridded data, such as those arising in oceanographic or meteorologic applications, [5 ]
• BLT, a collection of procedures for plotting data and manipulating vectors of data, among other things [6 ]
• dataplot, a standalone application for analysing data using sophisticated statistical methods [7 ]

 Category Tutorial