Richard Suchenwirth 2004-03-28 - This weekend I took some work home, and I'm even reporting about it in the Wiki :) But no disclosure problems involved, this is about the C4.5 method of "data mining" - creating a decision tree from learning data, as published by J.R. Quinlan in Quinlan J.R.(1993), C4.5: Programs For Machine Learning, Morgan Kaufman Publishers Inc..
The idea is: given a sample set of learning cases, each having a number of attributes ("features"), and being labeled with a target class that it belongs to; create a tree-shaped construct of conditions and outcomes (sounds magic, but is basically a more or less big nesting of if...elseif...else clauses) such that for a new case, the correct class is returned as often as possible (i.e., with a low error rate). Attributes can be discrete, where you can test only for (in)equality or value subset membership, or numeric and continuous, where you can test whether a certain threshold is surpassed.
Decision trees follow a "divide and conquer" strategy: at every condition (node in the tree), try to exclude as many possibilities as possible, so that you quickly reach either a single possibility (entropy=0), or a probabilistic result (alternatives weighted with percentage) that cannot further be subdivided. The reduction of possibilities is here measured by entropy. For details, see the comments below.
When I'm faced with mathematical matter, my frequent reflex is to implement it in Tcl, to see whether I understood it correctly. So here's my Tcl implementation of central parts of the C4.5 system. C4.5 can also be downloaded as C sources from (*), but I didn't want to rewrite C code to Tcl, rather build it up from scratch, and practice minimalism...
#-- Entropy is a classic information measure (unit is bits) proc entropy cases { set sum 0 foreach class [lsort -unique $cases] { set freqN [expr {1.0*[freq $class $cases]/[llength $cases]}] set sum [expr {$sum + $freqN * -[log2 $freqN]}] } set sum } #-- How often does a class occur in a set? proc freq {class set} {llength [lsearch -all $set $class]} #-- Logarithm to base 2 proc log2 x {expr {log($x) / log(2)}} #-- Entropy of a partition, a list of sublists of cases proc pEntropy partition { set sum 0.0 set N [llength [join $partition]] foreach part $partition { set sum [expr {$sum + 1.*[llength $part]/$N * [entropy $part]}] } set sum }
For measuring how well a partition reduces entropy over a case set, Quinlan first proposes the gain function: the gain of a given partition (the result of applying a condition to the learn cases in question) is simply the difference between "before" and "after":
proc gain partition { expr {[entropy [join $partition]] - [pEntropy $partition]} }
However, on multi-valued features (the extreme being unique IDs), the gain measure acts as if a perfect partition was found - just that it's little usable, because it's over-adapted to the learn set. So Quinlan proposes as alternative the gain ratio, which mostly represents the potential information of a partition better - except in very unbalanced cases:
proc gainRatio partition { if {[llength $partition] <= 1} {return 999} expr {[gain $partition] / [split_info $partition]} } proc split_info partition { set sum 0.0 set N [llength [join $partition]] foreach part $partition { set share [expr {1.0*[llength $part]/$N}] set sum [expr {$sum + $share * -[log2 $share]}] } set sum } #-- Testing examples, should match the examples in the book foreach partition { {{1 0 0 0 1} {1 1 1 1} {0 0 1 1 1}} {{1 1 1 0 0 0} {1 1 1 1 1 1 0 0}} } { puts [list partition $partition] puts entropy=[entropy [join $partition]] puts pEntropy=[pEntropy $partition] puts gain=[gain $partition] puts gainRatio=[gainRatio $partition] }
So far, so good. The tools are at hand, and work like the doctor ordered. Let's jump into the water with the first toy example in Quinlan's book, where a decision tree for "to play or not to play" (some unspecified ball game) is constructed from the following meteorological data (temperature being obviously in degrees Fahrenheit):
set features { outlook temp humidity windy Result } set sample { {sunny 75 70 yes Play} {sunny 80 90 yes Don't} {sunny 85 85 no Don't} {sunny 72 95 no Don't} {sunny 69 70 no Play} {overcast 72 90 yes Play} {overcast 83 78 no Play} {overcast 64 65 yes Play} {overcast 81 75 no Play} {rain 71 80 yes Don't} {rain 65 70 yes Don't} {rain 75 80 no Play} {rain 68 80 no Play} {rain 70 96 no Play} }
We have the tri-state feature "outlook", two continuous numeric features, a binary one ("windy") and finally the expected decision, which must not be tested as a feature because it is the result domain (to have it in Title case helps to visualize this distinction). In C4.5, each feature must be declared with its value range, or 'continuous', as well as the set of result values. In Tcl it's easy to deduce (or guess?) these from the sample itself. First let's just extract a "column" from the sample - we'll need that more often:
proc column {sample index} { set res {} foreach case $sample {lappend res [lindex $case $index]} set res }
With some heuristics, a column can be classified as being discrete or continuous. The following procedure returns a pair of type and range, the latter being an enumeration for discrete and the upper and lower bounds for continuous:
proc featureType column { set values [lsort -unique $column] if {[llength $values]>3 && [allNumeric $values]} { set values [lsort -real $values] ;# covers integers as well list continuous [list [lindex $values 0] [lindex $values end]] } else {list discrete $values} } proc allNumeric list {expr {![catch {expr [join $list +]}]}} #-- Test: set i -1 foreach feature $features { puts $feature:[featureType [column $sample [incr i]]] }
passes:
outlook:discrete {overcast rain sunny} temp:continuous {64 85} humidity:continuous {65 96} windy:discrete {no yes} Result:discrete {Don't Play}
Now for building the tree. It can be imagined that this is recursive: given a non-trivial sample (i.e. of >1 classes, and >1 cases), test all available conditions according to gain or gainRatio, select the best, split the sample according to that condition, and recurse over the parts of the partition:
proc buildDecisionTree {sample features criterion} { if {[llength $sample]<=1} { return [list return [lindex $sample 0 end]] } if {[llength [lsort -unique [column $sample end]]]<=1} { return [list return [lindex $sample 0 end]] } set best -1 set index 0 foreach feature [lrange $features 0 end-1] { set ptn [bestPartition $sample $feature $index $criterion] set value [$criterion [slice $ptn end]] if {$value > $best && [llength $ptn]>1} { set best $value set bestPtn $ptn } incr index } #-- render the tree as well-formed Tcl... set word "if" set res "" foreach {cond part} $bestPtn { append res "$word $cond \{\n\t" append res [buildDecisionTree $part $features $criterion] set word "\} elseif" } append res \} }
The generic routine for finding a best partition dispatches on feature type, discrete or continuous, by constructing the appropriate proc name.
proc bestPartition {sample feature index criterion} { set col [column $sample $index] foreach {type range} [featureType $col] break best-$type $sample $feature $index $range $criterion }
For "best-discrete", just use the given values. This makes the tree possibly more than binary, just as in C4.5:}
proc best-discrete {sample feature index range criterion} { #-- fan out cases according to index-th feature into variables foreach case $sample { lappend _[lindex $case $index] $case } set res {} foreach value $range { lappend res "{$$feature eq {$value}}" [set _$value] } set res }
For "best-continuous", all values are tried to see which gives the best partition:
proc best-continuous {sample feature index range criterion} { set col [column $sample $index] set labels [column $sample end] foreach {from to} $range break #-- try all values in the range, in steps of 1 (might not be integer) set best -1 for {set th $from} {$th < $to} {set th [expr {$th+1}]} { #-- lists for the "yes" and "no" cases: set 1 {}; set 0 {} foreach element $col label $labels { lappend [expr {$element > $th}] $label } set ptn [list $1 $0] set value [$criterion $ptn] if {$value > $best} { set best $value set bestPtn $ptn set bestTh $th } } list "{$$feature > $bestTh}" [lindex $bestPtn 0] \ "{$$feature <= $bestTh}" [lindex $bestPtn 1] } proc slice {partition index} { set res {} foreach part $partition { set buf {} foreach case $part {lappend buf [lindex $case $index]} lappend res $buf } set res } #-- Test: puts [buildDecisionTree $sample $features gain]
produces the following string, which, besides the humidity threshold and the order of tests, matches Quinlan's example, and could easily be completed to a proc that implements the decision tree:
if {$outlook eq {overcast}} { return Play} elseif {$outlook eq {rain}} { if {$windy eq {no}} { return Play} elseif {$windy eq {yes}} { return Don't}} elseif {$outlook eq {sunny}} { if {$humidity > 70} { return Don't} elseif {$humidity <= 70} { return Play}}
Of course, much of C4.5's functionality is still lacking (like how to treat missing values, how to prune a tree, etc.), and the code was certainly not written for maximal speed. But it gave me interesting challenges for the weekend, and will make me feel more familiar with the real C4.5 back at work...