Arjen Markus (14 may 2019) Machine learning is a hot topic - at least in some circles. There is a wide array of techniques that you can apply: artificial neural networks in various guises, random forests, support vector machines and what not. The program below illustrates the use of decision trees.
Much involving machine learning concerns the classification of data points. The idea is that a set of attributes should be suitable for characterising the observation (or to make some decision). For instance, the example in the program below uses some weather attributes, outlook, temperature, humidity and windiness to decide whether or not some activity should be undertaken (P) or not (N).
It is based on an article by J.R. Quinlan . While it only implements the very simplest technique, the so-called ID3 method with no provisions for noise or missing values, to construct the decision tree, it does seem to work. At least for this (tiny) data set.
Note: the gains computed by the procedure DetermineGains do not quite agree with the ones in the article. The first one does (including the details), but the others are quite different. It does not affect the resulting tree though.
# decision.tcl -- # Implementation of decision trees # # TODO: # - Allow for noise and missing values # - Implement the "chi-square" criterium # # Based on: # Quinlan, J.R. "Induction of decision trees" # # DetermineGains -- # Determine the gain for each attribute # # Arguments: # training The training set # proc DetermineGains {training} { set numberAttrs [expr {[llength [lindex $training 0]] - 1}] set numberRecs [llength $training] # # Determine the total information # foreach record $training { set classValue [lindex $record end] dict incr classValues $classValue 1 } set overallInformation 0.0 dict for {class classCount} $classValues { set overallInformation [expr {$overallInformation - $classCount /double($numberRecs) * log($classCount / double($numberRecs))}] } set overallInformation [expr {$overallInformation / log(2.0)}] # # Now determine the gain per attribute # for {set attr 0} {$attr < $numberAttrs} {incr attr} { array unset count unset -nocomplain attValues unset -nocomplain classValues foreach record $training { set attValue [lindex $record $attr] set classValue [lindex $record end] incr count($attValue,$classValue) dict incr attValues $attValue 1 dict incr classValues $classValue 1 } set gain 0.0 dict for {attValue attCount} $attValues { dict for {classValue classCount} $classValues { set c 0.0 if { [info exists count($attValue,$classValue)] } { set c [expr {double($count($attValue,$classValue))}] } if { $c > 0 } { set gain [expr {$gain - ($c/$attCount) * log($c/$attCount)}] #puts "$c - count($attValue,$classValue) - $attCount -> $gain" } } } set gain [expr {$overallInformation - $gain * $attCount / [llength $training] / log(2.0)}] lappend gains $gain } return $gains } # ClassifyByTree -- # Determine the class # # Arguments: # tree The tree as trained # observation The list of observed attributes # # Result: # Either the "proper" classification or an empty string (if the tree turns out to be incomplete) # proc ClassifyByTree {tree observation} { lassign $tree attr cases set attValue [lindex $observation $attr] set class "" foreach {caseValue subtree} $cases { if { $attValue eq $caseValue } { if { [lindex $subtree 0] == -1 } { set class [lindex $subtree 1] break } else { set class [ClassifyByTree $subtree $observation] } } } return $class } # GetAttributeValues -- # Return a list of the values of an attribute from the training data # # Arguments: # training The training set to be used # attribute Index of the attribute # # Result: # List of attribute values # proc GetAttributeValues {training attribute} { set values {} foreach sample $training { lappend values [lindex $sample $attribute] } return [lsort -unique $values] } # GetSubsetTraining -- # Return a subset of the training set based on a particular value of an attribute # # Arguments: # training The training set to be used # attribute Index of the attribute # value Value to filter on # # Result: # Subset of the training set # proc GetSubsetTraining {training attribute value} { set subset {} foreach sample $training { if { [lindex $sample $attribute] eq $value } { lappend subset $sample } } return $subset } # constructDecisionTree -- # Construct the decision tree from the training data # # Arguments: # training The training set to be used # # Result: # Nested list representing the tree # proc constructDecisionTree {training {prevAtt {}}} { set tree {} # # Determine the "best" attribute # set gains [DetermineGains $training] set attribute -1 set maxAtt -1 set maxGain 0.0 foreach gain $gains { incr attribute if { $gain > $maxGain } { set maxGain $gain set maxAtt $attribute } } # # Stopping criterium: if there is only one item left # if { $maxAtt == -1 || [llength $training] <= 1 } { return [list -1 [lindex $training 0 end]] } else { set subTrees {} foreach attValue [GetAttributeValues $training $maxAtt] { set partialTree [constructDecisionTree [GetSubsetTraining $training $maxAtt $attValue] $maxAtt] set subTrees [concat $subTrees [list $attValue $partialTree]] } return [list $maxAtt $subTrees] } } # test -- # Simple test/demo # Small training set - article by Quinlan # # Outlook Temperature Humidity Windy Class set training { {sunny hot high false N } {sunny hot high true N } {overcast hot high false P } {rain mild high false P } {rain cool normal false P } {rain cool normal true N } {overcast cool normal true P } {sunny mild high false N } {sunny cool normal false P } {rain mild normal false P } {sunny mild normal true P } {overcast mild high true P } {overcast hot normal false P } {rain mild high true N } } # Simple tree # Leaves are represented by attrbiute -1 and the class value # Note: this tree describes the above set perfectly # set simple_tree { 0 {sunny {2 {high {-1 N} normal {-1 P} } } overcast {-1 P} rain {3 {true {-1 N} false {-1 P} } } } } set new_tree [constructDecisionTree $training] puts $new_tree puts "[ClassifyByTree $new_tree {sunny hot high false}] -- expected: N" puts "[ClassifyByTree $new_tree {overcast hot normal false}] -- expected: P" puts "[ClassifyByTree $new_tree {rain mild high true}] -- expected: N" puts "[ClassifyByTree $new_tree {rain hot high true}] -- expected: ? (new data)" puts "[ClassifyByTree $new_tree {rain hot high false}] -- expected: ? (new data)"