Version 7 of Clustering data

Updated 2007-10-04 18:48:42 by petasis

Arjen Markus (22 march 2004) Cluster analysis is a statistical technique for identifying groups of observations that are somehow "close" together. If you look at aribtrary points in a plane, then most probably you will see small groups of points. To the human eye it is very easy to identify such groups or clusters. For a computer program it is a bit tougher (in the sense: how to precisely define how close two points are together).

Well, here is a first attempt at doing cluster analysis with Tcl. A lot needs to be done still to make it practically useful:

  • Preparing the data: use scaling techniques to normalise the various "coordinates"
  • Weighing factors: select the weight of the various coordinates in the determination of the distance between points
  • Reporting: a procedure is missing to (re)construct the clusters formed at the Nth step
  • Optimisation: the computational loops have not been optimised at all (I was too lazy :)
  • Refactoring the computation: several possibilities to simplify the computational procedures so that custom methods can be introduced to compute the distance or the characteristics of the clusters.

 # cluster.tcl --
 #    Basic cluster analysis:
 #    - A dataset (a list of lists) consists of records describing the data
 #    - Each record is a list of one or more values assumed to be the
 #      coordinates in a vector space.
 #    - The clustering algorithm will successively merge individual
 #      data points into clusters, ending when all data points have been
 #      merged into a single cluster.
 #    - The output consists of a list of lists:
 #      - each list contains a list of data points (indices into the original
 #        list) forming the next cluster
 #      - the second element is the centre of that cluster
 #      - the third element is the radius
 #
 #    Note:
 #    There are many variations possible with respect to cluster
 #    analysis. In this script each cluster is characterised by
 #    the centre of gravity of the points making up the cluster
 #    and the maximum distance to that centre is taken as the radius.
 #    By redefining the procedure that determines the distance
 #    between clusters and points and the procedure that determines
 #    the characteristics of the clusters you can change the clustering
 #    method.
 #
 #

 namespace eval ::Cluster {
    # Force it to exist
 }

 # cluster --
 #    Compute the succession of clusters
 #
 # Arguments:
 #    datapoints       List of lists, each a data point
 # Result:
 #    List of lists describing the clusters that have been formed
 #
 proc ::Cluster::cluster {datapoints} {
    #
    # First step: create clusters consisting of one single point
    #
    set resulting_clusters {}
    set clusters {}

    set idx 0
    foreach point $datapoints {
       set cluster_data [list $idx $point 0.0]
       lappend clusters $cluster_data
       incr idx
    }

    #
    # Second step: determine the minimum distances
    # Note:
    # The work could be halved, but the algorithm would
    # be a bit more complicated. Leave it for now
    #
    set idx 0
    set noclusters [llength $clusters]
    set closest_clusters {}
    foreach cluster $clusters {
       set mindist {}
       for { set i 0 } { $i < $noclusters } { incr i } {
          if { $i != $idx } {
             set other [lindex $clusters $i]
             set dist [distanceClusters $cluster $other]
             if { $mindist == {} || $mindist > $dist } {
                set mindist $dist
                set closest $i
             }
          }
       }
       lappend closest_clusters $idx $closest $mindist
       incr idx
    }

    #
    # Third step:
    # - Determine the minimum distance between two clusters
    # - Join them
    # - Determine the new minimum distances
    # - Continue until only one is left
    #
    while { [llength $clusters] > 1 } {
       set mindist    {}
       set minidx     {}
       set candidates {}
       set curr       0
       foreach {idx closest dist} $closest_clusters {
          if { $mindist == {} || $mindist > $dist } {
             set mindist    $dist
             set minidx     $idx
             set minclosest $closest
          }
          incr curr
       }

       set new_cluster [determineNewCluster $clusters $minidx $minclosest $datapoints]
       set clusters    [lreplace $clusters $minidx $minidx $new_cluster]
       set clusters    [lreplace $clusters $minclosest $minclosest]

       lappend resulting_clusters $new_cluster
       #puts $resulting_clusters
       #puts $clusters

       #
       # Now the new distances!
       # Note:
       # For now a lazy method - just reiterate over all pairs
       #
       set idx 0
       set noclusters [llength $clusters]
       set closest_clusters {}
       foreach cluster $clusters {
          set mindist {}
          for { set i 0 } { $i < $noclusters } { incr i } {
             if { $i != $idx } {
                set other [lindex $clusters $i]
                set dist [distanceClusters $cluster $other]
                if { $mindist == {} || $mindist > $dist } {
                   set mindist $dist
                   set closest $i
                }
             }
          }
          lappend closest_clusters $idx $closest $mindist
          incr idx
       }
    }

    return $resulting_clusters
 }

 # determineNewCluster --
 #    Compute the characteristics of the new cluster
 #
 # Arguments:
 #    clusters      All clusters
 #    idx1          Index of the first cluster
 #    idx2          Index of the second cluster
 #    datapoints    Original data points
 # Result:
 #    The new cluster
 #
 proc ::Cluster::determineNewCluster {clusters idx1 idx2 datapoints} {
    foreach {indices1 centre1 radius1} [lindex $clusters $idx1] {break}
    foreach {indices2 centre2 radius2} [lindex $clusters $idx2] {break}

    #
    # Determine the new centre
    #
    set new_centre {}
    foreach crd $centre1 {
       lappend new_centre 0.0
    }
    set count 0
    foreach idx [concat $indices1 $indices2] {
       set coords [lindex $datapoints $idx]
       set sumcrd {}

       foreach nc $new_centre c $coords {
          set nc [expr {$nc+$c}]
          lappend sumcrd $nc
       }
       set  new_centre $sumcrd
       incr count
    }

    set sumcrd     $new_centre
    set new_centre {}
    foreach nc $sumcrd {
       lappend new_centre [expr {$nc/double($count)}]
    }

    #
    # Determine the radius
    # Note:
    # Here is some room for improvement - other_cluster
    #
    set new_cluster [list [concat $indices1 $indices2] $new_centre 0.0]

    set maxdist 0.0
    foreach idx [lindex $new_cluster 0] {
       set other_cluster [list {} [lindex $datapoints $idx] 0.0]
       set dist [distanceClusters $new_cluster $other_cluster]
       if { $dist > $maxdist } {
          set maxdist $dist
       }
    }

    set new_cluster [lreplace $new_cluster 2 2 $maxdist]
 }

 # distanceCluster --
 #    Compute the distance between two clusters
 #
 # Arguments:
 #    cluster1      Data determining the first cluster
 #    cluster2      Data determining the second cluster
 # Result:
 #    Distance between the clusters
 # Note:
 #    Just passing the centres and the radii will improve
 #    the performance
 #
 proc ::Cluster::distanceClusters {cluster1 cluster2} {
    foreach {indices1 centre1 radius1} $cluster1 {break}
    foreach {indices2 centre2 radius2} $cluster2 {break}

    #
    # Use the Euclidean norm
    #
    set dist 0.0
    foreach crd1 $centre1 crd2 $centre2 {
       set dist [expr {$dist+($crd1-$crd2)*($crd1-$crd2)}]
    }
    set dist [expr {sqrt($dist)-$radius1-$radius2}]
 }

 # main --
 #    Simple tests
 #
 catch {
 console show
 }

 set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} }

 puts [::Cluster::cluster $datapoints]
 puts " "
 set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} }
 set clusters [::Cluster::cluster $datapoints]
 foreach cluster $clusters {
    puts $cluster
 }

 #
 # Visualise the process
 #
 catch {
 package require Tk
 canvas .c -bg white -width 200 -height 200
 pack   .c -fill both

 foreach data $datapoints {
    foreach {x y} $data {break}
    set xcrd [expr {50+100*$x}]
    set ycrd [expr {150-100*$y}]
    .c create oval [expr {$xcrd-2}] [expr {$ycrd-2}] \
                   [expr {$xcrd+2}] [expr {$ycrd+2}] -fill black
 }

 foreach cluster $clusters \
         colour [lrange {cyan green yellow orange red magenta} 0 [llength $clusters]] {
    foreach {ids coords radius} $cluster {break}
    set xcrd   [expr {50+100*[lindex $coords 0]}]
    set ycrd   [expr {150-100*[lindex $coords 1]}]
    set radius [expr {100*$radius}]
    .c lower \
       [.c create oval [expr {$xcrd-$radius}] [expr {$ycrd-$radius}] \
                       [expr {$xcrd+$radius}] [expr {$ycrd+$radius}] -fill $colour]
    #.c create text $xcrd $ycrd -text $ids
 }
 .c create line   0 150 200 150 -fill grey
 .c create line  50   0  50 200 -fill grey
 .c create line   0  50 200  50 -fill grey
 .c create line 150   0 150 200 -fill grey
 } ;# End catch

EB: My first Tcl/Tk application, back in 96 with Tcl 7.6, dealt with interactive interpretation of hierachical clustering (screenshot at [L1 ]).

mkg: Running Cluster::cluster on 5 or more randomly generated 2-D points almost always ends with an error as distanceClusters tries to sqrt a negative number. Looks like your cluster-to-cluster centroid distance exceeds the sum of the radii (think of two large clusters overlapping each other almost entirely).

This algorithm is a form of agglomerative hierarchical clustering with, I believe, a time complexity of O(n^2) [L2 ] and is widely used in the life sciences (e.g., gene expression analysis). Alternative algorithms (such as K-means) based on partitioning techniques also exist with much better time complexity behaviour and are typically applied to large data sets that hierarchical clustering can't handle (Yes, there are other tradeoffs to be considered too).

AM Thanks for mentioning the problem with overlapping clusters and the alternative methods. Yes, the whole thing is a mere toy at the moment. And the above implementation is O(n^3) because I re-compute all distances, rather than update them.

My main motivation was to see how it could be done (following a conversation with Peter de Rijk about numerical analysis in general).

AM Found the bug: it was in the expression with sqrt() - the parenthesis was too far to the left. Fixed it.


George Petasis: Here is another clustering facility that implements k-means clustering. BSD license... { ## SimpleKMeans.tcl: ## ## --------------------- ## ## This class implements a simple k-means clustering algorithm. ## ## ## ## This file is part of the Ellogon Language Engineering Platform. ## ## Copyright 1998-2007 by: ## ## Georgios Petasis, ## ## Athens, Greece. ##

package require Itcl

itcl::class ELEP::MachineLearning::Clustering::SimpleKMeans {

  variable clusterAssignments {}; # A list holding the cluster for each instance
  variable clusterCentroids   {}; # A list holding the cluster centroids
  variable iterations;            # Keep track of the number of iterations
                                  # completed before convergence
  public {
    variable distance         {}; # A command prefix for returning the distance
                                  # between two instances
    variable random_seed      10; # The seed we set to random
    variable K                 2; # Number of clusters to generate
    variable debug             0
  }



  constructor {args} {
    eval configure $args
  };# constructor

  # buildClusterer --
  #    Generates a clusterer from a set of training data. The produced
  #    clusterer can be used to clasify new instances.
  #
  # Arguments:
  #    instances:    A list of instances serving as the training data
  # Result:
  #    -
  #
  method buildClusterer {instances} {
    set k $K
    expr {srand($random_seed)}
    set iterations 0
    #
    # Step 1: Select randomly k instances as the centroids of the k clusters
    #
    if {$debug} {puts "buildClusterer: K=$k"; update}
    set clusterCentroids {}
    initialiseClusterCentroids $instances
    set k [llength $clusterCentroids]
    if {$debug} {
      puts "clusterCentroids:"
      for {set i 0} {$i < $k} {incr i} {
        puts "  k=$i, centroid: [lindex $clusterCentroids $i]"
      }
      update
    }
    #
    # Step 2: Run the k-means algorithm
    #
    set clusterAssignments {}
    foreach instance $instances {lappend clusterAssignments 0}
    set converged false
    while {!$converged} {
      incr iterations
      set converged true
      #
      # Step A: Find the cluster of each individual
      #
      if {$debug} {
        puts "Step A: Finding the cluster of each individual..."; update
      }
      array unset cluster_instances
      set i 0
      foreach instance $instances assignment $clusterAssignments {
        set new_assignment [clusterInstance $instance]
        if {$new_assignment != $assignment} {
          lset clusterAssignments $i $new_assignment
          set converged false
        }
        lappend cluster_instances($new_assignment) $instance
        if {$debug} {
          puts "  Best cluster: $new_assignment (instance=$instance,\
                  centroid: [lindex $clusterCentroids $new_assignment])"; update
        }
        incr i
      }
      set k [array size cluster_instances]
      #
      # Step B: Update the centroids
      #
      if {$debug} {
        puts "Step B: Updating the centroids (K = $k)..."; update
      }
      set empty_clusters 0
      set clusterCentroids {}
      foreach cluster [lsort -integer [array names cluster_instances]] {
        lappend clusterCentroids [clusterCentroid $cluster_instances($cluster)]
      }
      if {$debug} {
        set i -1
        foreach centroid $clusterCentroids {
          puts "  Cluster [incr i] centroid: $centroid"; update
        }
      }
    };# while {!$converged}
  };# buildClusterer

  # initialiseClusterCentroids --
  #    Initialise (randomly) the centroid of each cluster.
  #
  # Arguments:
  #    instances:     A list of instances serving as the training data
  # Result:
  #    -
  #
  method initialiseClusterCentroids {instances} {
    for {set i [expr {[llength $instances]-1}]} {$i >= 0} {incr i -1} {
      set index [expr {int(rand()*($i+1))}]
      set instance [lindex $instances $index]
      if {[lsearch -exact $clusterCentroids $instance] == -1} {
        lappend clusterCentroids $instance
      }
      lset instances $index [lindex $instances $i]
      lset instances $i     $instance
      if {[llength $clusterCentroids] == $K} {break}
    }
  };# initialiseClusterCentroids

  # clusterInstance --
  #    Finds and returns the cluster the given instance belong to.
  #
  # Arguments:
  #    instance:     The instance whose cluster is wanted
  # Result:
  #    An integer specifying the cluster
  #
  method clusterInstance {instance} {
    if {$debug > 2} {
      puts "  + clusterInstance: $instance"; update
    }
    set best_cluster 0
    if {$debug > 2} {puts "      Cluster: 0"; update}
    set min_distance [distance $instance [lindex $clusterCentroids 0]]
    set k [llength $clusterCentroids]
    for {set i 1} {$i < $k} {incr i} {
      if {$debug > 2} {puts "      Cluster: $i"; update}
      set dist [distance $instance [lindex $clusterCentroids $i]]
      if {$dist < $min_distance} {
        set best_cluster $i
        set min_distance $dist
      }
    }
    if {$debug > 2} {
      puts "  + clusterInstance: best cluster = $best_cluster"; update
    }
    return $best_cluster
  };# clusterInstance

  # clusterCentroid --
  #    Finds and returns the instance that is in the "center" of all given
  #    instances.
  #
  # Arguments:
  #    instances:    The list of instances, one of which should be returned
  # Result:
  #    An instance from the given instances that is the cluster centroid
  #
  method clusterCentroid {instances} {
    set instances_number [llength $instances]
    set centroid [lindex $instances 0]
    if {$instances_number == 1} {return $centroid}
    set min [expr {100*[distance $centroid [lindex $instances end]]}]
    for {set i 0} {$i < $instances_number} {incr i} {
      set instance [lindex $instances $i]
      set mean 0.0
      for {set j 0} {$j < $instances_number} {incr j} {
        if {$i == $j} {continue}
        set mean [expr {$mean + [distance $instance [lindex $instances $j]]}]
      }
      set mean [expr {$mean/$instances_number}]
      if {$mean < $min} {
        set min $mean
        set centroid $instance
      }
    }
    return $centroid
  };# clusterCentroid

  # distance --
  #    Returns the distance between two instances.
  #
  # Arguments:
  #    instance1:    The first instance
  #    instance2:    The second instance
  # Result:
  #    A positive real number
  # Notes:
  #    The returned distance does not need to be in the range [0, 1]. The
  #    returned distance can be any positive real number
  #
  method distance {instance1 instance2} {
    if {$debug > 3} {
      puts -nonewline \
        "        Calling \"$distance {$instance1} {$instance2}\": "; update
      set d [uplevel #0 $distance [list $instance1] [list $instance2]]
      puts "Distance: $d"; update
      return $d
    } else {
      return [uplevel #0 $distance [list $instance1] [list $instance2]]
    }
  };# distance

  method getClusterAssignments {} {
    return $clusterAssignments
  };# getClusterAssignments

};# class ELEP::MachineLearning::Clustering::SimpleKMeans

package provide ELEP::MachineLearning::Clustering::SimpleKMeans 1.0}


Test code: {source SimpleKMeans.tcl catch {console show}

proc distance {instance1 instance2} {

  foreach {x1 y1} $instance1 {break}
  foreach {x2 y2} $instance2 {break}
  return [expr {sqrt(pow($x1-$x2,2)+pow($y1-$y2,2))}]

};# distance

ELEP::MachineLearning::Clustering::SimpleKMeans clusterer \

  -distance distance -K 2 -debug 0

foreach data { { {1.0 0.0} {1.0 0.1} {0.0 0.0} }

               { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} } } {
  puts stderr "Performing clustering..."; update
  clusterer buildClusterer $data
  puts "Results:"; update
  foreach point $data cluster [clusterer getClusterAssignments] {
    puts "   Cluster $cluster : $point"
  }
  puts {}

}}

I really want to see a tcllib module with some basic machine learning algorithms.


[ Category Mathematics

Category Numerical Analysis

]