[Arjen Markus] 2007-12-20: This page is meant as a tutorial for the website Torsten Berg is designing. ''[EKB] FYI the dataplot link is broken. Is it [Dataplot]?'' ** Data analysis with Tcl/Tk ** This is an introduction to Tcl/Tk with an emphasis on ''data analysis''. Familiarity with the basics of Tcl and programming in general is assumed: variables, loops, functions, if-statements and other programming constructs should be familiar already. This tutorial is mostly concerned with with numerical data, and the examples are of gathering statistical information and plotting graphs from these data. This is no attempt to show the construction of a fully-featured data analysis package: That would be a considerable task indeed, not because it couldn't be done in Tcl, but because there are so many possibilities. Tcl is a good choice for doing this kind of work because it makes it easy to whip up a few dedicated computational procedures suitable for a particular needs. ** A first example: dealing with text ** Here is the first example task: Given a file containing ordinary text, how could one classify its contents as a short story, a legal document or a scientific article? One approach could be to take an inventory of the words in each article, and then, based on their frequencies, to use some scheme to classify them. This tutorial shows how to take the inventory, leaving the classifiction scheme as an exercise for the reader. The task at hand is almost trivial: * Read the file. * Split its contents in words, i.e individual sequences of characters. * Count the occurrences for each word. * Print each in descending order, the most frequently-used word first. First, a suitable [data structure] is needed for the data. The primary choices in Tcl are: * [list%|%Lists] * [array%|%Associative arrays] * [dict%|%Dictionaries] A list is similar to an ''array'' in [Fortran], [Perl], [C] or [Java]: It is an ordered collection of data. Each element of a list can be retrieved by the index of that element. A list is the most ubiquitous data structure in Tcl, and in fact, each list has a syntax that makes it suitable for interpretation by Tcl as a command. Here are examples of a few list commands that will be used later: ====== set alist {x y z} ;# Construct a list and store it in ;# variable "alist" set second [lindex $alist 1] ;# Get the second element from the list ;# "alist" - counting starts at 0 lappend alist w ;# Append a new element to the end of the list ====== An [array%|%associative array] on the other hand is an '''unordered''' collection of variables and is similar to a '''hash''' in Perl. A [dict%|%dictionary] is even more similar since keys in a dictionary are just keys, not variables. A variable in an array can be retrieved using its name: For example: ====== set count(occurrence) 1 ;# Set element "occurrence" in the ;# array "count" to 1 parray count ;# Quick and dirty print of all array ;# elements and their values array names count ;# Get a list of all array elements ====== For the above counting problem, associative arrays are suitable. we can then do something like: ====== set contents [readfile input.txt] foreach word $contents { incr count($word) } parray count ====== This is of course only very rough code: It is necessary to first read the data, splitting the contents first into rows and from there into words. Therefore, the next step is to refine the above code: ====== proc readfile filename { set infile [open $filename] set data [read $infile] close $infile return $data } ====== That is enough to read the entire file and store the contents in a variable called `contents`. To split the content into words, punctuation must be cleaned from the data. To keep things simple we assume that the only punctuation that occurs is period, apostrophe, quotation mark, question mark, exclamation mark, comma, semicolon and colon. Here, `[string map]`, replaces each occurrence of the given punctuation characters with a space character, which is convenient later when the content is treated as a list: ====== set contents [string map { . { } ; { } : { } ' { } ! { } ? { } , { } \" { } } $contents] ====== This formatting of the command is only meant to make the structure of the list of replacements clearer. There is no special meaning to doing it this way. We should remember to convert the letters to lower-case (or upper-case) as "We" and "we" clearly represent the same word. ====== set contents [string tolower $contents] ====== And now, the grand finale: $contents also be interpreted as a list of words, and we can count them by iterating through them using `[foreach]`: ====== foreach word $contents { if {[info exists count($word)]} { incr count($word) } else { set count($word) 1 } } ====== `[info exists]` indicates whether the variable already exists in the array so that it can be properly initialised. As of [Changes in Tcl/Tk 8.5%|%Tcl8.5], `[incr]` no longer returns an error when a variable does not yet exist. Instead, it initializes the varible to `0`, and then increments it. The construct above is still valid of course, and could be useful if something other than initialisation to zero is desired. The last part of the task is to print the list of words in the correct order. That is a bit tricky: It would be easy to order the words alphabetically, but to order the words by frequency of occurrence, the data should be "inverted". `[lsort]` can do this if put the frequency data in a list, actually a list of lists: ====== set frequencies {} foreach word [array names count] { lappend frequencies [list $word $count($word)] } # # The -index option is used to make [lsort] look inside the list # elements # The -integer option is used because by default [lsort] regards # list elements as strings # set sorted [lsort -index 1 -integer $frequencies] # Print them to the screen foreach frequency $sorted { puts $frequency } ====== Of course, words can occur in the same frequency in the text we are examining, it might therefore be nicer to sort the words with the same frequency in alphabetical order. And that is actually very simple, thanks to the sorting algorithm that is used - mergesort: ====== set sorted [lsort -index 1 -integer [ lsort -index 0 $frequencies]] ====== We first sort with respect to the words (element 0 in each sublist) and then to the frequency (element 1). As we do not need the intermediate result, we combine the two actions in one command. The mergesort algorithm makes sure that any previous order among elements that have the same "sort value" is preserved. The complete code: ====== # # Read the entire file # set infile [open input.txt] set contents [read $infile] close $infile # Polish the contents (could be combined) set contents [string map { . { } ; { } : { } ' { } ! { } ? { } , { } \" { } } $contents] set contents [string tolower $contents] # # Count the words # foreach word $contents { if {[info exists count($word)]} { incr count($word) } else { set count($word) 1 } } # # Convert the data into a form easily sorted # set frequencies {} foreach word [array names count] { lappend frequencies [list $word $count($word)] } set sorted [lsort -index 1 -integer -decreasing [ lsort -index 0 $frequencies]] # # Print them to the screen # foreach frequency $sorted { puts $frequency } ====== The result, when used on the first three paragraphs of this text is: ======none for 5 is 5 a 4 be 4 data 4 that 4 with 4 are 3 it 3 not 3 of 3 tcl 3 the 3 to 3 we 3 will 3 an 2 analysis 2 and 2 because 2 familiar 2 in 2 so 2 this 2 you 2 - 1 actually 1 all 1 assumed 1 at 1 attempt 1 basics 1 build 1 but 1 choice 1 computational 1 considerable 1 could 1 dealing 1 dedicated 1 do 1 doing 1 easy 1 emphasis 1 examples 1 few 1 fully-featured 1 functions 1 gathering 1 general 1 good 1 graphs 1 here 1 if-statements 1 indeed 1 information 1 introduction 1 kind 1 least 1 less 1 loops 1 many 1 more 1 mostly 1 needs 1 numerical 1 on 1 or 1 package 1 particular 1 plotting 1 possibilities 1 procedures 1 programming 1 should 1 shown 1 statistical 1 suitable 1 task 1 tcl/tk 1 there 1 these 1 they 1 up 1 variables 1 very 1 whip 1 why 1 work 1 would 1 your 1 ====== We forgot one punctuation character: the dash, so this shows up as a word as well, and of course we need to take care of hyphens and numerous other details. But that is not the point: we managed to do this analysis in a mere 30 lines of code. ''Note:'' This algorithm is very easy to parallelize. In a future version we might show how to do that using the Threads package. ** The second example: a table of data ** As promised at the beginning, our main focus is numeric data. So let us look at some simple table of fictitious data: X Y Z -3.52 -3.17 -3.68 -2.09 -1.76 -3.37 -2.12 -0.76 1.69 -0.20 -1.99 0.54 -0.65 0.92 6.67 1.65 1.02 1.91 1.09 2.75 6.42 3.61 4.79 6.16 4.73 4.84 6.43 4.75 5.30 8.98 The following script produced this data: ====== for {set i 0} {$i < 10} {incr i} { set x [expr {$i-5.0+2.0*rand()}] set y [expr {$i-5.0+3.0*rand()}] set z [expr {$i-5.0+10.0*rand()}] puts [format "%5.2f %5.2f %5.2f" $x $y $z] } ====== Operations one might perform on this data include: * Creating a graph of X versus Y or Z to see if there is any obvious relation. * Selecting those records that fulfill some relation (only positive data for instance). * Performing statistical analysis on the data: linear regression for instance. * Sorting the records wrt X, assuming that is the independent variable. In order to manipulate this data programmatically, it should placed into some structure. Alhtough an associative array could work, ====== # # A variable in an array can be named anything, even a number a real number # set array(-3.52) {-3.17 -3.68} ... ====== it does not appear to be convenient: Associative arrays are unordered. It would be necessary for example to examine all element names want to select only positive X-values for instance, that is: * Get a list of element names * Select the positive values only * Create a new array In this case using a list of lists would save code: no need to get a list of X-values first, we already have that. Furthermore, if '''include the names of the columns''', we get a self-describing table: ====== set table { { X Y Z } {-3.52 -3.17 -3.68 } {-2.09 -1.76 -3.37 } {-2.12 -0.76 1.69 } {-0.20 -1.99 0.54 } {-0.65 0.92 6.67 } { 1.65 1.02 1.91 } { 1.09 2.75 6.42 } { 3.61 4.79 6.16 } { 4.73 4.84 6.43 } { 4.75 5.30 8.98 } } puts "Column names: [lindex $table 0]" puts "Column 1:" foreach record $table { puts [lindex $record 0] } ====== Now that the data is structured, we define a format for storing large sets of data: * The file can contain comment lines, starting with a # in the first position * The first line that is not a comment line contains the names of the columns * All other lines contain the values per record (separated by one or more spaces) Here is a procedure to create and fill a table: ====== proc loadTable filename { set infile [open $filename] set table {} # # Read the file line by line until the end # ([gets] will return a negative value then) # while {[gets $infile line] >= 0} { if {[string index $line 0] eq {#}} { continue } else { lappend table $line } } close $infile return $table } ====== ''Note:'' in the above procedure `[gets]` is used to read one line at a time from the file. In this form ("gets $infile line") it returns the ''number of characters'' it read or a negative number if it read past the end of the file. By checking for a positive value or zero (condition: >= 0), we read empty lines too. If an empty line signifies another section (for instance the beginning of the next table), then a condition like "> 0" could be used. This procedure does not take care of: * non-existent files (Tcl returns an error when trying to open the file) * lines that do not contain a proper list of numbers That may harm us later on, so let us revise this code a bit: ====== proc loadTable filename { # # Open the file, catch any errors # if {[catch { set infile [open $filename] } msg]} { return -code error "Error opening $filename: $msg" } set table {} set names 0 while {[gets $infile line] >= 0} { if {[string index $line 0] eq {#}} { continue } else { # # We need to examine the lines: # The first one contains strings, all others numbers # if {[catch { set number_data [llength $line] } msg]} { return -code error "Error reading $filename:\ line can not be properly interpreted\nLine: $line" } if {$names == 0} { set names 1 set number_columns [llength $line] } else { if {$number_columns != [llength $line]} { return -code error "Error reading $filename:\ incorrect number of columns\nLine: $line" } foreach v $line { if {![string is double $v]} { return -code error "Error reading $filename:\ invalid number encountered - $v\nLine: $line" } } } lappend table $line } } close $infile return $table } ====== Checking the data increases the length of the code, but we can be reasonably sure now that an ill-formed file will not surprise us later - for instance, when trying to determine the maximum value of a column the result is "A" instead of a number. '''Auxiliary procedures''' This procedure can easily be used in an interactive session with tclsh or wish: ======none % set table [loadTable "input.data"] % puts "Columns: [lindex $table 0]" % puts "Number of rows: [expr {[llength $table]-1}]" ====== The drawback of the last two statements is that the implemented structure of the table must be used directly. Instead, some auxiliary procedures can provide an abstraction: ====== proc columnNames table { return [lindex $table 0] } proc numberRows {table} { expr {[llength $table]-1} } proc row {table index} { if {$index < 0 || $index >= [llength $table]} { return -code error "Row index out of range" } else { return [lindex $table [expr {$index+1}]] } } proc column {table index {withname 1}} { if {$index < 0 || $index >= [llength [columnNames $table]]} { return -code error "Column index out of range" } else { set result {} foreach row $table { lappend result [lindex $row $index] } } # Keep the name? if {$withname} { return $result } else { return [lrange $result 1 end] } } ====== For convenience we added an option to the "column" procedure: we may or may not want the name of the column included in the result. We can use it as follows: * If we just want the data: ======none % columnNames $table 0 0 ====== * If we want the name included (it could actually be used as a table with just one column): ======none % columnNames $table 0 ====== To make $withname optional, a default value of `1` is provided in the procedure definition. ''Displaying the table'' So now that we can load a table and examine its contents, it may be a good idea to view it in a window, just to get some feeling for the data. We will use the Tablelist package for this: ====== # # Load the package and for convenience, import all procedures # package require tablelist namespace import ::tablelist::* # # Variable to create a unique window name # set wcount 0 proc display table { global wcount # # Out of precaution, check if "wcount" has been set # if {![info exists wcount]} { set wcount 0 } # # Create a list of columns for the tablelist widget # (the numbers will be right aligned) # set columns {} foreach name [columnNames $table] { lappend columns 0 $name right } # # A new toplevel window # incr wcount set t .t_$wcount toplevel $t # # The table list in that window and two scrollbars # set tbl $t.tbl set vsb $t.vsb set hsb $t.hsb tablelist $tbl -columns $columns -height 10 -width 100 \ -xscrollcommand [list $hsb set] \ -yscrollcommand [list $vsb set] \ -stripebackground lightblue scrollbar $vsb -orient vertical -command [list $tbl yview] scrollbar $hsb -orient horizontal -command [list $tbl xview] # # Put them in the toplevel window # grid $tbl $vsb -sticky news grid $hsb -sticky news grid rowconfigure $t 0 -weight 1 grid rowconfigure $t 1 -weight 0 grid columnconfigure $t 0 -weight 1 grid columnconfigure $t 1 -weight 0 # # Now put the data in # for {set i 0} {$i < [numberRows $table]} {incr i} { $tbl insert end [row $table $i] } } ====== Nothing very difficult, except perhaps the creation of the three widgets and their cooperation: * the -xscrollcommand option determines what command to run when the visible part of the Tablelist widget is changed programmatically. In this case the horizontal scrollbar is updated. Similarly for the vertical scrollbar. * of course, something similar must happen when the user operates the scrollbars. So each scrollbar has a command option too, to help with this. * the grid geometry manager is used to place the three widgets side by side and the -weight options make sure that the scrollbars can not get thicker (weight 0), they can only grow in the direction of their longer axis. The tablelist widget can grow in both ways. '''Statistics''' So far, so good: we can load a table, extract information and display the whole thing. What about statistical properties of the data? Well, we have the math::statistics package to deal with the raw data. Let us write a small convenience procedure to create a table with some basic statistical parameters. (Note: the first column will be one with labels, rather than numbers, but that does not matter!): ====== package require math::statistics namespace import ::math::statistics::* proc descriptiveStats table { # # Determine the mean, maximum, minimum, standard deviation # etc per column # set index 0 set means {Mean} set min {Minimum} set max {Maximum} set ndata {"Number of data"} set stdev {"Standard deviation"} set var {Variance} foreach name [columnNames $table] { set data [column $table $index 0] ;# Now the option is useful! set statdata [basic-stats $data] lappend means [lindex $statdata 0] lappend min [lindex $statdata 1] lappend max [lindex $statdata 2] lappend ndata [lindex $statdata 3] lappend stdev [lindex $statdata 4] lappend var [lindex $statdata 5] incr index } # # Put them in a table ... # set result [list [concat "Parameter" [columnNames $table]]] lappend result $means $min $max $ndata $var return $result } ====== And because Tablelist does not really care about the type of data we want to display: ====== % display [descriptiveStats $table] ====== will display the statistical information just as it the complete table. ''Manipulating the data'' Sometimes, we will want to sort the data by column and proceed with a table sorted by one or more columns. For one column this is pretty easy, as we saw in the previous example: ====== proc sort {table column} { set names [columnNames $table] set index [lsearch $names $column] if { $index < 0 } { return -code error "No column by that name: $column" } set newtable [list $names \ [lsort -index $index -double \ [lrange $table 1 end]]] return $newtable } ====== The above is suitable for a single column. It is only slightly more complicated to write a procedure that will sort the table to two or more columns - we will leave this as an exercise. A sorted table is for instance useful when you want to plot the data in a graph: lines through data points are less messy when the data are sorted along, say, the x axis. And that is our next task. ''Plotting the data'' More than anything else, you will gain insight in the data if you can make a suitable picture. Let us try and create a XY-plot then. Again, there is a package for that sort of tasks - in fact, there are several (see the references at the end) - Plotchart, part of Tklib. The procedure we have in mind will plot some column against another column: ====== plot $table X Y ====== would plot the column with label Y against the column with label X in a suitably scaled graph. Here is the full code: ====== package require Plotchart package require math::statistics namespace import ::Plotchart::* ;# For convenience only proc plot {table xvar yvar} { global wcount if {![info exists wcount]} { set wcount 0 } # First find the column indices ... set xid [lsearch -exact [columnNames $table] $xvar] set yid [lsearch -exact [columnNames $table] $yvar] if { $xid < 0 || $yid < 0 } { return -code error \ "Either $xvar or $yvar is not a valid column for this table" } # Determine the extremes set xdata [column $table $xid 0] set ydata [column $table $yid 0] set xmax [::math::statistics::max $xdata] set xmin [::math::statistics::min $xdata] set ymax [::math::statistics::max $ydata] set ymin [::math::statistics::min $ydata] # Set up the plot - similar to the display proc incr wcount set t .t_$wcount toplevel $t pack [canvas $t.c -width 500 -height 300] -fill both set plot [createXYPlot $t.c [ determineScale $xmin $xmax] [ determineScale $ymin $ymax] ] # Plot the data foreach x $xdata y $ydata { $plot plot xy $x $y } } ====== In this procedure we used `[lsearch] -exact]` because `[lsearch]` by default uses so-called glob patterns for identifying the element. It is just a precaution, column names containing ? or * are probably rare, but it may lead to surprises if this behaviour is not taken into consideration. The command to add a data point to the plot uses a label ("xy" in the above code) to identify to which series the data point belongs. That way we can combine multiple series in one plot or change the colour or line style. Our present procedure will create a new window with each call, but that can easily be changed, for instance by using an optional argument that indicates a new window should be created or the last one reused (but then the label should be changed too!). ''Advanced facilities'' As this tutorial is meant to be a tutorial on data analysis with Tcl, and not on data analysis perse, we will not develop a complete data analysis package. Filtering the data should, however, be part of this tutorial and will be the last topic. Since Tcl is a dynamic language, it is easy to filter procedure will allow us to do: ====== set newtable [filter $table {$X > 0 && 2.0*$Y < 1.5*$Z}] ====== This particular condition may not be very useful, but it is almost trivial to implement in Tcl: ====== proc filter {table condition} { set names [columnNames $table] set body [string map [ list @NAMES@ [list $names] @COND@ [list $condition]] { set v::result [list @NAMES@] foreach v::row [lrange $table 1 end] { foreach @NAMES@ $v::row break if @COND@ { lappend v::result $v::row } } return $v::result } ] namespace eval v {} proc filter_private table $body filter_private $table } ====== Okay, claiming this to be an almost trivial procedure is an exaggeration. Some sophisticated things are going on here! First, `[filter]` recreates `filter_private` having a custom body. In Tcl, `[proc]` can be used at any time to create a new procedure. `[string map]`, generates the desired body of by replacing `@NAMES@` and `@COND@` with the values of the `$names` and `$condition`, which are each placed in a [list] so that each is properly quoted as a single word in the body. `$result` and `$row` are each located in the namespace "v" so that there are no local variables to conflict with the names of the columns in the table, which each become a local variable. Otherwise, a column named "row" or "result" would cause trouble. ''Note:'' There is one serious limitation with this fragment: we can not use variables from the calling environment. For instance: ====== set a 1.0 set newtable [filter $tabel {$X > $a}] ====== does not work because in the private procedure the variable "a" is unknown. Several methods are available to circumvent this limitation, but they are fairly sophisticated and we shall not discuss them here. ''Further possibilities'' The above method to create a temporary procedure can also be used to compute the result of one or more expressions using the column variables. For instance: ====== set newtable [compute $table X {$X} Sum {$X+$Y} Cosfunc {$X/$Y+cos($Z)}] ====== could lead to a new table with columns X, Sum and Cosfunc whose values are computed using the columns from the original table. We could also create commands to combine tables into new tables or remove rows directly. We will leave such commands as an exercise. ''References'' Manipulating data sets like the one above is a very common task and many specialised extensions exist. For instance: Popular database extensions are: * SQLite, a light-weight database management system, supporting SQL, transactions and the like [http://wiki.tcl.tk/sqlite] * Metakit, a database management system geared to managing timeseries [http://wiki.tcl.tk/metakit] A new extension is [http://wiki.tcl.tk/ratcl%|%Ratcl]. This is a continuation of the work done on Metakit with an emphasis on high performance manipulation of large amounts of data. Other extensions we want to mention are: * NAP, N-dimensional array processing, useful for instance for gridded data, such as those arising in oceanographic or meteorologic applications, [http://wiki.tcl.tk/nap] * BLT, a collection of procedures for plotting data and manipulating vectors of data, among other things [http://wiki.tcl.tk/blt] * dataplot, a standalone application for analysing data using sophisticated statistical methods [http://www.nist.gov/dataplot] ** Page Authors ** [Arjen Markus]: [pyk]: <> Tutorial