Constructing decision trees from data

Arjen Markus (14 may 2019) Machine learning is a hot topic - at least in some circles. There is a wide array of techniques that you can apply: artificial neural networks in various guises, random forests, support vector machines and what not. The program below illustrates the use of decision trees.

Much involving machine learning concerns the classification of data points. The idea is that a set of attributes should be suitable for characterising the observation (or to make some decision). For instance, the example in the program below uses some weather attributes, outlook, temperature, humidity and windiness to decide whether or not some activity should be undertaken (P) or not (N).

It is based on an article by J.R. Quinlan . While it only implements the very simplest technique, the so-called ID3 method with no provisions for noise or missing values, to construct the decision tree, it does seem to work. At least for this (tiny) data set.

Note: the gains computed by the procedure DetermineGains do not quite agree with the ones in the article. The first one does (including the details), but the others are quite different. It does not affect the resulting tree though.

# decision.tcl --
#     Implementation of decision trees
#
#     TODO:
#     - Allow for noise and missing values
#     - Implement the "chi-square" criterium
#
#     Based on:
#     Quinlan, J.R. "Induction of decision trees"
#

# DetermineGains --
#     Determine the gain for each attribute
#
# Arguments:
#     training          The training set
#
proc DetermineGains {training} {
    set numberAttrs [expr {[llength [lindex $training 0]] - 1}]
    set numberRecs  [llength $training]

    #
    # Determine the total information
    #
    foreach record $training {
        set classValue [lindex $record end]

        dict incr classValues $classValue 1

    }

    set overallInformation 0.0
    dict for {class classCount} $classValues {
        set overallInformation [expr {$overallInformation - $classCount /double($numberRecs) * log($classCount / double($numberRecs))}]
    }

    set overallInformation [expr {$overallInformation / log(2.0)}]

    #
    # Now determine the gain per attribute
    #
    for {set attr 0} {$attr < $numberAttrs} {incr attr} {
        array unset count
        unset -nocomplain attValues
        unset -nocomplain classValues

        foreach record $training {
            set attValue   [lindex $record $attr]
            set classValue [lindex $record end]

            incr count($attValue,$classValue)
            dict incr attValues   $attValue   1
            dict incr classValues $classValue 1

        }

        set gain 0.0
        dict for {attValue attCount} $attValues {
            dict for {classValue classCount} $classValues {
                set c 0.0
                if { [info exists count($attValue,$classValue)] } {
                    set c [expr {double($count($attValue,$classValue))}]
                }

                if { $c > 0 } {
                    set gain [expr {$gain - ($c/$attCount) * log($c/$attCount)}]
                    #puts "$c - count($attValue,$classValue) - $attCount -> $gain"
                }
            }
        }

        set gain [expr {$overallInformation - $gain * $attCount / [llength $training] / log(2.0)}]

        lappend gains $gain
    }

    return $gains
}

# ClassifyByTree --
#     Determine the class
#
# Arguments:
#     tree             The tree as trained
#     observation      The list of observed attributes
#
# Result:
#     Either the "proper" classification or an empty string (if the tree turns out to be incomplete)
#
proc ClassifyByTree {tree observation} {
    lassign $tree attr cases

    set attValue [lindex $observation $attr]

    set class ""

    foreach {caseValue subtree} $cases {
        if { $attValue eq $caseValue } {
            if { [lindex $subtree 0] == -1 } {
                set class [lindex $subtree 1]
                break
            } else {
                set class [ClassifyByTree $subtree $observation]
            }
        }
    }

    return $class
}

# GetAttributeValues --
#     Return a list of the values of an attribute from the training data
#
# Arguments:
#     training        The training set to be used
#     attribute       Index of the attribute
#
# Result:
#     List of attribute values
#
proc GetAttributeValues {training attribute} {
    set values {}

    foreach sample $training {
        lappend values [lindex $sample $attribute]
    }

    return [lsort -unique $values]
}

# GetSubsetTraining --
#     Return a subset of the training set based on a particular value of an attribute
#
# Arguments:
#     training        The training set to be used
#     attribute       Index of the attribute
#     value           Value to filter on
#
# Result:
#     Subset of the training set
#
proc GetSubsetTraining {training attribute value} {
    set subset {}

    foreach sample $training {
        if { [lindex $sample $attribute] eq $value } {
            lappend subset $sample
        }
    }

    return $subset
}

# constructDecisionTree --
#     Construct the decision tree from the training data
#
# Arguments:
#     training        The training set to be used
#
# Result:
#     Nested list representing the tree
#
proc constructDecisionTree {training {prevAtt {}}} {
    set tree {}

    #
    # Determine the "best" attribute
    #
    set gains [DetermineGains $training]

    set attribute -1
    set maxAtt    -1
    set maxGain    0.0

    foreach gain $gains {
        incr attribute

        if { $gain > $maxGain } {
            set maxGain $gain
            set maxAtt  $attribute
        }
    }

    #
    # Stopping criterium: if there is only one item left
    #
    if { $maxAtt == -1 || [llength $training] <= 1 } {
        return [list -1 [lindex $training 0 end]]
    } else {
        set subTrees {}
        foreach attValue [GetAttributeValues $training $maxAtt] {
            set partialTree [constructDecisionTree [GetSubsetTraining $training $maxAtt $attValue] $maxAtt]
            set subTrees [concat $subTrees [list $attValue $partialTree]]
        }
        return [list $maxAtt $subTrees]
    }
}

# test --
#     Simple test/demo

# Small training set - article by Quinlan
#
#    Outlook  Temperature Humidity Windy Class
set training {
    {sunny    hot         high     false N     }
    {sunny    hot         high     true  N     }
    {overcast hot         high     false P     }
    {rain     mild        high     false P     }
    {rain     cool        normal   false P     }
    {rain     cool        normal   true  N     }
    {overcast cool        normal   true  P     }
    {sunny    mild        high     false N     }
    {sunny    cool        normal   false P     }
    {rain     mild        normal   false P     }
    {sunny    mild        normal   true  P     }
    {overcast mild        high     true  P     }
    {overcast hot         normal   false P     }
    {rain     mild        high     true  N     }
}

# Simple tree
#     Leaves are represented by attrbiute -1 and the class value
#     Note: this tree describes the above set perfectly
#
set simple_tree {
    0 {sunny    {2 {high   {-1 N}
                    normal {-1 P}
                   }
                }
       overcast {-1 P}
       rain     {3 {true {-1 N}
                    false {-1 P}
                   }
                }
       }
}

set new_tree [constructDecisionTree $training]
puts $new_tree

puts "[ClassifyByTree $new_tree {sunny hot high false}] -- expected: N"
puts "[ClassifyByTree $new_tree {overcast hot normal false}] -- expected: P"
puts "[ClassifyByTree $new_tree {rain mild high true}] -- expected: N"
puts "[ClassifyByTree $new_tree {rain hot high true}] -- expected: ? (new data)"
puts "[ClassifyByTree $new_tree {rain hot high false}] -- expected: ? (new data)"