Clustering data

Arjen Markus (22 march 2004) Cluster analysis is a statistical technique for identifying groups of observations that are somehow "close" together. If you look at arbitrary points in a plane, then most probably you will see small groups of points. To the human eye it is very easy to identify such groups or clusters. For a computer program it is a bit tougher (in the sense: how to precisely define how close two points are together).

Well, here is a first attempt at doing cluster analysis with Tcl. A lot needs to be done still to make it practically useful:

  • Preparing the data: use scaling techniques to normalize the various "coordinates"
  • Weighing factors: select the weight of the various coordinates in the determination of the distance between points
  • Reporting: a procedure is missing to (re)construct the clusters formed at the Nth step
  • Optimization: the computational loops have not been optimized at all (I was too lazy :)
  • Refactoring the computation: several possibilities to simplify the computational procedures so that custom methods can be introduced to compute the distance or the characteristics of the clusters.

 # cluster.tcl --
 #    Basic cluster analysis:
 #    - A dataset (a list of lists) consists of records describing the data
 #    - Each record is a list of one or more values assumed to be the
 #      coordinates in a vector space.
 #    - The clustering algorithm will successively merge individual
 #      data points into clusters, ending when all data points have been
 #      merged into a single cluster.
 #    - The output consists of a list of lists:
 #      - each list contains a list of data points (indices into the original
 #        list) forming the next cluster
 #      - the second element is the center of that cluster
 #      - the third element is the radius
 #
 #    Note:
 #    There are many variations possible with respect to cluster
 #    analysis. In this script each cluster is characterized by
 #    the center of gravity of the points making up the cluster
 #    and the maximum distance to that center is taken as the radius.
 #    By redefining the procedure that determines the distance
 #    between clusters and points and the procedure that determines
 #    the characteristics of the clusters you can change the clustering
 #    method.
 #
 #

 namespace eval ::Cluster {
    # Force it to exist
 }

 # cluster --
 #    Compute the succession of clusters
 #
 # Arguments:
 #    datapoints       List of lists, each a data point
 # Result:
 #    List of lists describing the clusters that have been formed
 #
 proc ::Cluster::cluster {datapoints} {
    #
    # First step: create clusters consisting of one single point
    #
    set resulting_clusters {}
    set clusters {}

    set idx 0
    foreach point $datapoints {
       set cluster_data [list $idx $point 0.0]
       lappend clusters $cluster_data
       incr idx
    }

    #
    # Second step: determine the minimum distances
    # Note:
    # The work could be halved, but the algorithm would
    # be a bit more complicated. Leave it for now
    #
    set idx 0
    set noclusters [llength $clusters]
    set closest_clusters {}
    foreach cluster $clusters {
       set mindist {}
       for { set i 0 } { $i < $noclusters } { incr i } {
          if { $i != $idx } {
             set other [lindex $clusters $i]
             set dist [distanceClusters $cluster $other]
             if { $mindist == {} || $mindist > $dist } {
                set mindist $dist
                set closest $i
             }
          }
       }
       lappend closest_clusters $idx $closest $mindist
       incr idx
    }

    #
    # Third step:
    # - Determine the minimum distance between two clusters
    # - Join them
    # - Determine the new minimum distances
    # - Continue until only one is left
    #
    while { [llength $clusters] > 1 } {
       set mindist    {}
       set minidx     {}
       set candidates {}
       set curr       0
       foreach {idx closest dist} $closest_clusters {
          if { $mindist == {} || $mindist > $dist } {
             set mindist    $dist
             set minidx     $idx
             set minclosest $closest
          }
          incr curr
       }

       set new_cluster [determineNewCluster $clusters $minidx $minclosest $datapoints]
       set clusters    [lreplace $clusters $minidx $minidx $new_cluster]
       set clusters    [lreplace $clusters $minclosest $minclosest]

       lappend resulting_clusters $new_cluster
       #puts $resulting_clusters
       #puts $clusters

       #
       # Now the new distances!
       # Note:
       # For now a lazy method - just reiterate over all pairs
       #
       set idx 0
       set noclusters [llength $clusters]
       set closest_clusters {}
       foreach cluster $clusters {
          set mindist {}
          for { set i 0 } { $i < $noclusters } { incr i } {
             if { $i != $idx } {
                set other [lindex $clusters $i]
                set dist [distanceClusters $cluster $other]
                if { $mindist == {} || $mindist > $dist } {
                   set mindist $dist
                   set closest $i
                }
             }
          }
          lappend closest_clusters $idx $closest $mindist
          incr idx
       }
    }

    return $resulting_clusters
 }

 # determineNewCluster --
 #    Compute the characteristics of the new cluster
 #
 # Arguments:
 #    clusters      All clusters
 #    idx1          Index of the first cluster
 #    idx2          Index of the second cluster
 #    datapoints    Original data points
 # Result:
 #    The new cluster
 #
 proc ::Cluster::determineNewCluster {clusters idx1 idx2 datapoints} {
    foreach {indices1 centre1 radius1} [lindex $clusters $idx1] {break}
    foreach {indices2 centre2 radius2} [lindex $clusters $idx2] {break}

    #
    # Determine the new centre
    #
    set new_centre {}
    foreach crd $centre1 {
       lappend new_centre 0.0
    }
    set count 0
    foreach idx [concat $indices1 $indices2] {
       set coords [lindex $datapoints $idx]
       set sumcrd {}

       foreach nc $new_centre c $coords {
          set nc [expr {$nc+$c}]
          lappend sumcrd $nc
       }
       set  new_centre $sumcrd
       incr count
    }

    set sumcrd     $new_centre
    set new_centre {}
    foreach nc $sumcrd {
       lappend new_centre [expr {$nc/double($count)}]
    }

    #
    # Determine the radius
    # Note:
    # Here is some room for improvement - other_cluster
    #
    set new_cluster [list [concat $indices1 $indices2] $new_centre 0.0]

    set maxdist 0.0
    foreach idx [lindex $new_cluster 0] {
       set other_cluster [list {} [lindex $datapoints $idx] 0.0]
       set dist [distanceClusters $new_cluster $other_cluster]
       if { $dist > $maxdist } {
          set maxdist $dist
       }
    }

    set new_cluster [lreplace $new_cluster 2 2 $maxdist]
 }

 # distanceCluster --
 #    Compute the distance between two clusters
 #
 # Arguments:
 #    cluster1      Data determining the first cluster
 #    cluster2      Data determining the second cluster
 # Result:
 #    Distance between the clusters
 # Note:
 #    Just passing the centres and the radii will improve
 #    the performance
 #
 proc ::Cluster::distanceClusters {cluster1 cluster2} {
    foreach {indices1 centre1 radius1} $cluster1 {break}
    foreach {indices2 centre2 radius2} $cluster2 {break}

    #
    # Use the Euclidean norm
    #
    set dist 0.0
    foreach crd1 $centre1 crd2 $centre2 {
       set dist [expr {$dist+($crd1-$crd2)*($crd1-$crd2)}]
    }
    set dist [expr {sqrt($dist)-$radius1-$radius2}]
 }

 # main --
 #    Simple tests
 #
 catch {
 console show
 }

 set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} }

 puts [::Cluster::cluster $datapoints]
 puts " "
 set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} }
 set clusters [::Cluster::cluster $datapoints]
 foreach cluster $clusters {
    puts $cluster
 }

 #
 # Visualise the process
 #
 catch {
 package require Tk
 canvas .c -bg white -width 200 -height 200
 pack   .c -fill both

 foreach data $datapoints {
    foreach {x y} $data {break}
    set xcrd [expr {50+100*$x}]
    set ycrd [expr {150-100*$y}]
    .c create oval [expr {$xcrd-2}] [expr {$ycrd-2}] \
                   [expr {$xcrd+2}] [expr {$ycrd+2}] -fill black
 }

 foreach cluster $clusters \
         colour [lrange {cyan green yellow orange red magenta} 0 [llength $clusters]] {
    foreach {ids coords radius} $cluster {break}
    set xcrd   [expr {50+100*[lindex $coords 0]}]
    set ycrd   [expr {150-100*[lindex $coords 1]}]
    set radius [expr {100*$radius}]
    .c lower \
       [.c create oval [expr {$xcrd-$radius}] [expr {$ycrd-$radius}] \
                       [expr {$xcrd+$radius}] [expr {$ycrd+$radius}] -fill $colour]
    #.c create text $xcrd $ycrd -text $ids
 }
 .c create line   0 150 200 150 -fill grey
 .c create line  50   0  50 200 -fill grey
 .c create line   0  50 200  50 -fill grey
 .c create line 150   0 150 200 -fill grey
 } ;# End catch

EB: My first Tcl/Tk application, back in 96 with Tcl 7.6, dealt with interactive interpretation of hierachical clustering (screenshot at [1 ]).

mkg: Running Cluster::cluster on 5 or more randomly generated 2-D points almost always ends with an error as distanceClusters tries to sqrt a negative number. Looks like your cluster-to-cluster centroid distance exceeds the sum of the radii (think of two large clusters overlapping each other almost entirely).

This algorithm is a form of agglomerative hierarchical clustering with, I believe, a time complexity of O(n^2) [2 ] and is widely used in the life sciences (e.g., gene expression analysis). Alternative algorithms (such as K-means) based on partitioning techniques also exist with much better time complexity behaviour and are typically applied to large data sets that hierarchical clustering can't handle (Yes, there are other tradeoffs to be considered too).

AM Thanks for mentioning the problem with overlapping clusters and the alternative methods. Yes, the whole thing is a mere toy at the moment. And the above implementation is O(n^3) because I re-compute all distances, rather than update them.

My main motivation was to see how it could be done (following a conversation with Peter de Rijk about numerical analysis in general).

AM Found the bug: it was in the expression with sqrt() - the parenthesis was too far to the left. Fixed it.

George Petasis: Here is another clustering facility that implements k-means clustering. BSD license...


 ##############################################################################
 ##  SimpleKMeans.tcl:                                                       ##
 ## ---------------------                                                    ##
 ##  This class implements a simple k-means clustering algorithm.            ##
 ##                                                                          ##
 ##  This file is part of the Ellogon Language Engineering Platform.         ##
 ##  Copyright 1998-2007 by:                                                 ##
 ##  Georgios Petasis,                                                       ##
 ##  Athens, Greece.                                                         ##
 ##############################################################################
 
 package require Itcl
 
 itcl::class ELEP::MachineLearning::Clustering::SimpleKMeans {
   
   variable clusterAssignments {}; # A list holding the cluster for each instance
   variable clusterCentroids   {}; # A list holding the cluster centroids
   variable iterations;            # Keep track of the number of iterations
                                   # completed before convergence
   public {
     variable distance         {}; # A command prefix for returning the distance
                                   # between two instances
     variable random_seed      10; # The seed we set to random
     variable K                 2; # Number of clusters to generate
     variable debug             0
   }
 
 
 
   constructor {args} {
     eval configure $args
   };# constructor
 
   # buildClusterer --
   #    Generates a clusterer from a set of training data. The produced
   #    clusterer can be used to classify new instances.
   #
   # Arguments:
   #    instances:    A list of instances serving as the training data
   # Result:
   #    -
   #
   method buildClusterer {instances} {
     set k $K
     expr {srand($random_seed)}
     set iterations 0
     #
     # Step 1: Select randomly k instances as the centroids of the k clusters
     #
     if {$debug} {puts "buildClusterer: K=$k"; update}
     set clusterCentroids {}
     initialiseClusterCentroids $instances
     set k [llength $clusterCentroids]
     if {$debug} {
       puts "clusterCentroids:"
       for {set i 0} {$i < $k} {incr i} {
         puts "  k=$i, centroid: [lindex $clusterCentroids $i]"
       }
       update
     }
     #
     # Step 2: Run the k-means algorithm
     #
     set clusterAssignments {}
     foreach instance $instances {lappend clusterAssignments 0}
     set converged false
     while {!$converged} {
       incr iterations
       set converged true
       #
       # Step A: Find the cluster of each individual
       #
       if {$debug} {
         puts "Step A: Finding the cluster of each individual..."; update
       }
       array unset cluster_instances
       set i 0
       foreach instance $instances assignment $clusterAssignments {
         set new_assignment [clusterInstance $instance]
         if {$new_assignment != $assignment} {
           lset clusterAssignments $i $new_assignment
           set converged false
         }
         lappend cluster_instances($new_assignment) $instance
         if {$debug} {
           puts "  Best cluster: $new_assignment (instance=$instance,\
                   centroid: [lindex $clusterCentroids $new_assignment])"; update
         }
         incr i
       }
       set k [array size cluster_instances]
       #
       # Step B: Update the centroids
       #
       if {$debug} {
         puts "Step B: Updating the centroids (K = $k)..."; update
       }
       set empty_clusters 0
       set clusterCentroids {}
       foreach cluster [lsort -integer [array names cluster_instances]] {
         lappend clusterCentroids [clusterCentroid $cluster_instances($cluster)]
       }
       if {$debug} {
         set i -1
         foreach centroid $clusterCentroids {
           puts "  Cluster [incr i] centroid: $centroid"; update
         }
       }
     };# while {!$converged}
   };# buildClusterer
 
   # initialiseClusterCentroids --
   #    Initialise (randomly) the centroid of each cluster.
   #
   # Arguments:
   #    instances:     A list of instances serving as the training data
   # Result:
   #    -
   #
   method initialiseClusterCentroids {instances} {
     for {set i [expr {[llength $instances]-1}]} {$i >= 0} {incr i -1} {
       set index [expr {int(rand()*($i+1))}]
       set instance [lindex $instances $index]
       if {[lsearch -exact $clusterCentroids $instance] == -1} {
         lappend clusterCentroids $instance
       }
       lset instances $index [lindex $instances $i]
       lset instances $i     $instance
       if {[llength $clusterCentroids] == $K} {break}
     }
   };# initialiseClusterCentroids
 
   # clusterInstance --
   #    Finds and returns the cluster the given instance belong to.
   #
   # Arguments:
   #    instance:     The instance whose cluster is wanted
   # Result:
   #    An integer specifying the cluster
   #
   method clusterInstance {instance} {
     if {$debug > 2} {
       puts "  + clusterInstance: $instance"; update
     }
     set best_cluster 0
     if {$debug > 2} {puts "      Cluster: 0"; update}
     set min_distance [distance $instance [lindex $clusterCentroids 0]]
     set k [llength $clusterCentroids]
     for {set i 1} {$i < $k} {incr i} {
       if {$debug > 2} {puts "      Cluster: $i"; update}
       set dist [distance $instance [lindex $clusterCentroids $i]]
       if {$dist < $min_distance} {
         set best_cluster $i
         set min_distance $dist
       }
     }
     if {$debug > 2} {
       puts "  + clusterInstance: best cluster = $best_cluster"; update
     }
     return $best_cluster
   };# clusterInstance
 
   # clusterCentroid --
   #    Finds and returns the instance that is in the "center" of all given
   #    instances.
   #
   # Arguments:
   #    instances:    The list of instances, one of which should be returned
   # Result:
   #    An instance from the given instances that is the cluster centroid
   #
   method clusterCentroid {instances} {
     set instances_number [llength $instances]
     set centroid [lindex $instances 0]
     if {$instances_number == 1} {return $centroid}
     set min [expr {100*[distance $centroid [lindex $instances end]]}]
     for {set i 0} {$i < $instances_number} {incr i} {
       set instance [lindex $instances $i]
       set mean 0.0
       for {set j 0} {$j < $instances_number} {incr j} {
         if {$i == $j} {continue}
         set mean [expr {$mean + [distance $instance [lindex $instances $j]]}]
       }
       set mean [expr {$mean/$instances_number}]
       if {$mean < $min} {
         set min $mean
         set centroid $instance
       }
     }
     return $centroid
   };# clusterCentroid
 
   # distance --
   #    Returns the distance between two instances.
   #
   # Arguments:
   #    instance1:    The first instance
   #    instance2:    The second instance
   # Result:
   #    A positive real number
   # Notes:
   #    The returned distance does not need to be in the range [0, 1]. The
   #    returned distance can be any positive real number
   #
   method distance {instance1 instance2} {
     if {$debug > 3} {
       puts -nonewline \
         "        Calling \"$distance {$instance1} {$instance2}\": "; update
       set d [uplevel #0 $distance [list $instance1] [list $instance2]]
       puts "Distance: $d"; update
       return $d
     } else {
       return [uplevel #0 $distance [list $instance1] [list $instance2]]
     }
   };# distance
 
   method getClusterAssignments {} {
     return $clusterAssignments
   };# getClusterAssignments
 
 };# class ELEP::MachineLearning::Clustering::SimpleKMeans
 
 package provide ELEP::MachineLearning::Clustering::SimpleKMeans 1.0

Test code:


 source SimpleKMeans.tcl
 catch {console show}

 proc distance {instance1 instance2} {
   foreach {x1 y1} $instance1 {break}
   foreach {x2 y2} $instance2 {break}
   return [expr {sqrt(pow($x1-$x2,2)+pow($y1-$y2,2))}]
 };# distance

 ELEP::MachineLearning::Clustering::SimpleKMeans clusterer \
   -distance distance -K 2 -debug 0

 foreach data { { {1.0 0.0} {1.0 0.1} {0.0 0.0} }
                { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} } } {
   puts stderr "Performing clustering..."; update
   clusterer buildClusterer $data
   puts "Results:"; update
   foreach point $data cluster [clusterer getClusterAssignments] {
     puts "   Cluster $cluster : $point"
   }
   puts {}
 }

I really want to see a tcllib module with some basic machine learning algorithms.

EKB I'm not sure who wrote the sentence above, but for a starting point there's Playing C4.5 and a little learning decision tree, as well as this page.

AK: I agree, that clustering, bayesian and other machine-learning codes would be very nice to have.

Norbert I wonder if anybody knows a tcl algorithm for cluster analysis starting with a distance matrix (pairwise distances) and not with data points having same sort of coordinates.