if 0 {[Richard Suchenwirth] 2004-03-28 - This weekend I took some work home,
and I'm even reporting about it in the Wiki :) But no disclosure
problems involved, this is about the C4.5 method of "data mining" -
creating a decision tree from learning data, as published by J.R.
Quinlan in (?).

The idea is: given a sample set of learning cases, each having a number of
attributes ("features"), and being labeled with a target class that it belongs to;
create a tree-shaped construct of conditions and outcomes (sounds magic,
but is basically a more or less big nesting of ''[if]...elseif...else'' clauses)
such that for a new case, the correct class is returned as often as possible
(i.e., with a low error rate). Attributes can be ''discrete'', where you
can test only for (in)equality or value subset membership, or numeric and
''continuous'', where you can test whether a certain threshold is surpassed.

[Decision trees] follow a "divide and conquer" strategy: at every
condition (node in the tree), try to exclude as many possibilities as
possible, so that you quickly reach either a single possibility (entropy=0), or a
probabilistic result (alternatives weighted with percentage) that cannot
further be subdivided. The reduction of possibilities is here measured
by entropy. For details, see the comments below.

When I'm faced with mathematical matter, my frequent
reflex is to implement it in Tcl, to see whether I understood it
correctly. So here's my Tcl implementation of central parts of the C4.5 system. C4.5 can
also be downloaded as C sources from (*), but I didn't want to rewrite C code to
Tcl, rather build it up from scratch, and practice minimalism...
}
 #-- Entropy is a classic information measure (unit is bits)
 proc entropy cases {
    set sum 0
    foreach class [lsort -unique $cases] {
        set freqN [expr {1.0*[freq $class $cases]/[llength $cases]}]
        set sum [expr {$sum + $freqN * -[log2 $freqN]}]
    }
    set sum
 }

 #-- How often does a class occur in a set?
 proc freq {class set} {llength [lsearch -all $set $class]}

 #-- Logarithm to base 2
 proc log2 x {expr {log($x) / log(2)}}

 #-- Entropy of a partition, a list of sublists of cases
 proc pEntropy partition {
    set sum 0.0
    set N [llength [join $partition]]
    foreach part $partition {
       set sum [expr {$sum + 1.*[llength $part]/$N * [entropy $part]}]
    }
    set sum
 }
if 0 {For measuring how well a partition reduces entropy over a case
set, Quinlan first proposes the ''gain'' function: the gain of a given
partition (the result of applying a condition to the learn cases in
question) is simply the difference between "before" and "after":
}
 proc gain partition {
    expr {[entropy [join $partition]] - [pEntropy $partition]}
 }
if 0 {However, on multi-valued features (the extreme being unique IDs),
the gain measure acts as if a perfect partition was found - just that
it's little usable, because it's over-adapted to the learn set. So
Quinlan proposes as alternative the ''gain ratio'', which mostly represents the
potential information of a partition better - except in very unbalanced cases:
}
 proc gainRatio partition {
    if {[llength $partition] <= 1} {return 999}
    expr {[gain $partition] / [split_info $partition]}
 }
 proc split_info partition {
    set sum 0.0
    set N [llength [join $partition]]
    foreach part $partition {
        set share [expr {1.0*[llength $part]/$N}]
        set sum   [expr {$sum + $share * -[log2 $share]}]
    }
    set sum
 }

#-- Testing examples, should match the examples in the book
 foreach partition {
    {{1 0 0 0 1} {1 1 1 1} {0 0 1 1 1}}
    {{1 1 1 0 0 0} {1 1 1 1 1 1 0 0}}
 } {
    puts [list partition $partition]
    puts entropy=[entropy [join $partition]]
    puts pEntropy=[pEntropy $partition]
    puts gain=[gain $partition]
    puts gainRatio=[gainRatio $partition]
 }
if 0 {So far, so good. The tools are at hand, and work like the doctor ordered.
Let's jump into the water with the first toy example in Quinlan's book,
where a decision tree for "to play or not to play" (some unspecified ball
game) is constructed from the following meteorological data (temperature being
obviously in degrees Fahrenheit):
}
 set features {
    outlook temp humidity windy Result
 }
 set sample {
    {sunny    75   70     yes   Play}
    {sunny    80   90     yes   Don't}
    {sunny    85   85     no    Don't}
    {sunny    72   95     no    Don't}
    {sunny    69   70     no    Play}
    {overcast 72   90     yes   Play}
    {overcast 83   78     no    Play}
    {overcast 64   65     yes   Play}
    {overcast 81   75     no    Play}
    {rain     71   80     yes   Don't}
    {rain     65   70     yes   Don't}
    {rain     75   80     no    Play}
    {rain     68   80     no    Play}
    {rain     70   96     no    Play}
 }
if 0 {We have the tri-state feature "outlook", two continuous numeric
features, a binary one ("windy") and finally the expected decision,
which must not be tested as a feature because it is the result domain (to have it 
in Title case helps to visualize this distinction).
In C4.5, each feature must be declared with its value range, or
'continuous', as well as the set of result values. In Tcl it's easy to
deduce (or guess?) these from the sample itself. First let's just
extract a "column" from the sample - we'll need that more often:}
 proc column {sample index} {
    set res {}
    foreach case $sample {lappend res [lindex $case $index]}
    set res
 }
if 0 {With some heuristics, a column can be classified as being discrete
or continuous. The following procedure returns a pair of ''type'' and ''range'',
the latter being an enumeration for ''discrete'' and the upper and lower
bounds for ''continuous'':}
 proc featureType column {
    set values [lsort -unique $column]
    if {[llength $values]>3 && [allNumeric $values]} {
        set values [lsort -real $values] ;# covers integers as well
        list continuous [list [lindex $values 0] [lindex $values end]]
    } else {list discrete $values}
 }
 proc allNumeric list {expr {![catch {expr [join $list +]}]}}
#-- Test:
 set i -1
 foreach feature $features {
    puts $feature:[featureType [column $sample [incr i]]]
 }
if 0 {passes:
 outlook:discrete {overcast rain sunny}
 temp:continuous {64 85}
 humidity:continuous {65 96}
 windy:discrete {no yes}
 Result:discrete {Don't Play}
Now for building the tree. It can be imagined that this is recursive:
given a non-trivial sample (i.e. of >1 classes, and >1 cases), test all
available conditions according to gain or gainRatio, select the best,
split the sample according to that condition, and recurse over the parts of the partition:
}
 proc buildDecisionTree {sample features criterion} {
    if {[llength $sample]<=1} {
        return [list return [lindex $sample 0 end]]
    }
    if {[llength [lsort -unique [column $sample end]]]<=1} {
        return [list return [lindex $sample 0 end]]
    }
    set best -1
    set index 0
    foreach feature [lrange $features 0 end-1] {
        set ptn [bestPartition $sample $feature $index $criterion]
        set value [$criterion [slice $ptn end]]
        if {$value > $best && [llength $ptn]>1} {
            set best $value
            set bestPtn $ptn
        }
        incr index
    }
    #-- render the tree as well-formed Tcl...
    set word "if"
    set res ""
    foreach {cond part} $bestPtn {
        append res "$word $cond \{\n\t"
        append res [buildDecisionTree $part $features $criterion]
        set word "\} elseif"
    }
    append res \}
 }
if 0 {The generic routine for finding a best partition dispatches on
feature type, ''discrete'' or ''continuous'', by constructing the
appropriate proc name.
}
 proc bestPartition {sample feature index criterion} {
    set col [column $sample $index]
    foreach {type range} [featureType $col] break
    best-$type $sample $feature $index $range $criterion
 }
if 0 {For "best-discrete", just use the given values. This makes the tree
possibly more than binary, just as in C4.5:}
 proc best-discrete {sample feature index range criterion} {
    #-- fan out cases according to index-th feature into variables
    foreach case $sample {
        lappend _[lindex $case $index] $case
    }
    set res {}
    foreach value $range {
        lappend res "{$$feature eq {$value}}" [set _$value]
    }
    set res
 }
if 0 {For "best-continuous", all values are tried to see which gives the best partition:}
 proc best-continuous {sample feature index range criterion} {
    set col    [column $sample $index]
    set labels [column $sample end]
    foreach {from to} $range break
    #-- try all values in the range, in steps of 1 (might not be integer)
    set best -1
    for {set th $from} {$th < $to} {set th [expr {$th+1}]} {
        #-- lists for the "yes" and "no" cases:
        set 1 {}; set 0 {}
        foreach element $col label $labels {
            lappend [expr {$element > $th}] $label
        }
        set ptn   [list $1 $0]
        set value [$criterion $ptn]
        if {$value > $best} {
            set best    $value
            set bestPtn $ptn
            set bestTh  $th
        }
    }
    list "{$$feature > $bestTh}"  [lindex $bestPtn 0] \
         "{$$feature <= $bestTh}" [lindex $bestPtn 1]
 }
 proc slice {partition index} {
    set res {}
    foreach part $partition {
        set buf {}
        foreach case $part {lappend buf [lindex $case $index]}
        lappend res $buf
    }
    set res
 }
#-- Test:
 puts [buildDecisionTree $sample $features gain]
if 0 { produces the following string, which, besides the humidity
threshold and the order of tests, matches Quinlan's example, and could
easily be completed to a [proc] that implements the decision tree:
 if {$outlook eq {overcast}} {
	return Play} elseif {$outlook eq {rain}} {
	if {$windy eq {no}} {
	return Play} elseif {$windy eq {yes}} {
	return Don't}} elseif {$outlook eq {sunny}} {
	if {$humidity > 70} {
	return Don't} elseif {$humidity <= 70} {
	return Play}}
Of course, much of C4.5's functionality is still lacking (like how to treat missing values, how to prune a tree, etc.), and the code was certainly not 
written for maximal speed. But it gave me interesting challenges for the weekend, and
will make me feel more familiar with the real C4.5 back at work...
----
[Arts and crafts of Tcl-Tk programming]
}