Arjen Markus (20 july 2009) The math::statistics package in Tcllib contains a procedure to determine basic statistical parameters, mean, maximum, standard deviation and so on. This procedure (like the rest of the procedures) assumes that missing values are given by empty strings, so it contains constructs like:
if { $elem != {} } { ... assume it is a valid number and do the computations ... }
It would be nicer to first filter the data. That way the code can be much simpler. However, that means extra work. How does this impact the performance?
Another aspect of the code is the very check itself: it forces all elements in the list to acquire a string representation. So, the first step is to investigate the merits of various ways to check whether a variable holds a number or not.
The script below uses three methods:
# notanumber.tcl -- # Try three ways of establishing something is not a number # # One way: # A missing value is represented by an empty string # # Second way: # A missing value is simply not representable as a number, # so arithmetic will fail # # Third way: # Can it be interpreted as a double value? proc noFiltering {size threshold} { # # Create a list of numbers and empty strings # set dummy [expr {srand(10000)}] set list {} for {set i 0} {$i < $size} {incr i} { set value [expr {rand()}] if { $value < $threshold } { lappend list $value } else { lappend list {} } } } proc emptyString {size threshold} { # # Create a list of numbers and empty strings # set dummy [expr {srand(10000)}] set list {} for {set i 0} {$i < $size} {incr i} { set value [expr {rand()}] if { $value < $threshold } { lappend list $value } else { lappend list {} } } # # Note: # Implicit conversion of numbers to strings ... # set filtered {} foreach elem $list { if { $elem != {} } { lappend filtered $elem } } } proc failingArithmetic {size threshold} { # # Create a list of numbers and empty strings # set dummy [expr {srand(10000)}] set list {} for {set i 0} {$i < $size} {incr i} { set value [expr {rand()}] if { $value < $threshold } { lappend list $value } else { lappend list {} } } # # Note: # No conversion, but catch may be expensive # set filtered {} foreach elem $list { catch { if { $elem+0 == $elem } { lappend filtered $elem } } } } proc stringIsDouble {size threshold} { # # Create a list of numbers and empty strings # set dummy [expr {srand(10000)}] set list {} for {set i 0} {$i < $size} {incr i} { set value [expr {rand()}] if { $value < $threshold } { lappend list $value } else { lappend list {} } } # # Note: # No conversion, but catch may be expensive # set filtered {} foreach elem $list { catch { if { [string is double -strict $elem } { lappend filtered $elem } } } } # main -- # Measure the performance # puts "Threshold: 1.1 - no missing elements" set threshold 1.1 foreach size {100 1000 10000 100000} { noFiltering $size $threshold ;# Try to eliminate initial effects puts "Size: $size" puts " No filtering: [time {noFiltering $size $threshold} 100]" puts " Empty string: [time {emptyString $size $threshold} 100]" puts " Arithmetic: [time {failingArithmetic $size $threshold} 100]" puts " IsDouble: [time {stringIsDouble $size $threshold} 100]" } puts "Threshold: 0.9 - 10% missing elements" set threshold 0.9 foreach size {100 1000 10000 100000} { noFiltering $size $threshold puts "Size: $size" puts " No filtering: [time {noFiltering $size $threshold} 100]" puts " Empty string: [time {emptyString $size $threshold} 100]" puts " Arithmetic: [time {failingArithmetic $size $threshold} 100]" puts " IsDouble: [time {stringIsDouble $size $threshold} 100]" }
The results for Tcl 8.4 are:
Threshold: 1.1 - no missing elements Size: 100 No filtering: 48.55 microseconds per iteration Empty string: 270.8 microseconds per iteration Arithmetic: 97.16 microseconds per iteration IsDouble: 776.25 microseconds per iteration Size: 1000 No filtering: 488.72 microseconds per iteration Empty string: 2821.12 microseconds per iteration Arithmetic: 967.46 microseconds per iteration IsDouble: 8123.06 microseconds per iteration Size: 10000 No filtering: 5295.6 microseconds per iteration Empty string: 28864.72 microseconds per iteration Arithmetic: 10243.75 microseconds per iteration IsDouble: 78343.68 microseconds per iteration Size: 100000 No filtering: 52861.96 microseconds per iteration Empty string: 290535.2 microseconds per iteration Arithmetic: 103031.38 microseconds per iteration IsDouble: 787822.18 microseconds per iteration Threshold: 0.9 - 10% missing elements Size: 100 No filtering: 49.97 microseconds per iteration Empty string: 252.07 microseconds per iteration Arithmetic: 140.11 microseconds per iteration IsDouble: 779.23 microseconds per iteration Size: 1000 No filtering: 490.99 microseconds per iteration Empty string: 2673.54 microseconds per iteration Arithmetic: 1415.15 microseconds per iteration IsDouble: 7788.95 microseconds per iteration Size: 10000 No filtering: 5181.49 microseconds per iteration Empty string: 26969.98 microseconds per iteration Arithmetic: 14954.56 microseconds per iteration IsDouble: 78017.08 microseconds per iteration Size: 100000 No filtering: 52016.61 microseconds per iteration Empty string: 268409.3 microseconds per iteration Arithmetic: 148799.84 microseconds per iteration IsDouble: 781611.65 microseconds per iteration
The results for Tcl 8.5 are:
Threshold: 1.1 - no missing elements Size: 100 No filtering: 71.16 microseconds per iteration Empty string: 2027.37 microseconds per iteration Arithmetic: 147.0 microseconds per iteration IsDouble: 761.81 microseconds per iteration Size: 1000 No filtering: 698.12 microseconds per iteration Empty string: 20335.29 microseconds per iteration Arithmetic: 1474.66 microseconds per iteration IsDouble: 7698.57 microseconds per iteration Size: 10000 No filtering: 7146.59 microseconds per iteration Empty string: 202507.52 microseconds per iteration Arithmetic: 14508.37 microseconds per iteration IsDouble: 76518.43 microseconds per iteration Size: 100000 No filtering: 71542.95 microseconds per iteration Empty string: 2026748.83 microseconds per iteration Arithmetic: 147370.6 microseconds per iteration IsDouble: 776199.29 microseconds per iteration Threshold: 0.9 - 10% missing elements Size: 100 No filtering: 72.4 microseconds per iteration Empty string: 1861.33 microseconds per iteration Arithmetic: 222.35 microseconds per iteration IsDouble: 756.02 microseconds per iteration Size: 1000 No filtering: 695.34 microseconds per iteration Empty string: 18853.98 microseconds per iteration Arithmetic: 2235.62 microseconds per iteration IsDouble: 7719.51 microseconds per iteration Size: 10000 No filtering: 7012.26 microseconds per iteration Empty string: 186777.61 microseconds per iteration Arithmetic: 25364.57 microseconds per iteration IsDouble: 78288.05 microseconds per iteration Size: 100000 No filtering: 74202.32 microseconds per iteration Empty string: 1891164.67 microseconds per iteration Arithmetic: 230277.45 microseconds per iteration IsDouble: 768754.55 microseconds per iteration
In both cases the arithmetic method is much faster than the other two, but the empty string method is very slow in Tcl 8.5.
Donal Fellows was able to boost the [string is double] method to give almost the same performance as the arithmetic method.
DKF: From my testing of the HEAD immediately prior to working on this:
% time {string is integer -strict 123} 100000 1.7875280199999999 microseconds per iteration % time {string is integer -strict 123; incr i} 100000 2.10694826 microseconds per iteration % time {string is integer -strict [incr i]} 100000 2.3593656 microseconds per iteration
And just before committing that fix (i.e., with it applied to my sandbox):
% set x 123; time {catch {incr x}} 1000 0.813359 microseconds per iteration % time {catch {expr {$x+0}}} 1000 0.869574 microseconds per iteration % time {catch {incr x 0}} 100000 0.54518436 microseconds per iteration % time {catch {expr {$x+0}}} 100000 0.58866935 microseconds per iteration % set x foo foo % time {catch {incr x 0}} 100000 5.1383236 microseconds per iteration % time {catch {expr {$x+0}}} 100000 6.09238139 microseconds per iteration % time {string is integer -strict $x} 100000 1.86606379 microseconds per iteration % time {string is integer -strict $x} 100000 1.85336441 microseconds per iteration % set x 123 123 % time {catch {expr {$x+0}}} 100000 0.58738074 microseconds per iteration % set x "" % time {catch {expr {$x+0}}} 100000 5.9340998 microseconds per iteration % time {string is integer -strict $x} 100000 1.87591083 microseconds per iteration
As you can see, the is-an-integer case is now much faster than before, but the not-an-integer case (when using catch with incr or a suitable expression is very slow) is, though slower than the is-an-integer case, is much more rapid. This is because it is not an error case; there is no building of an errorInfo trace.