Arjen Markus (22 march 2004) Cluster analysis is a statistical technique for identifying groups of observations that are somehow "close" together. If you look at arbitrary points in a plane, then most probably you will see small groups of points. To the human eye it is very easy to identify such groups or clusters. For a computer program it is a bit tougher (in the sense: how to precisely define how close two points are together).
Well, here is a first attempt at doing cluster analysis with Tcl. A lot needs to be done still to make it practically useful:
# cluster.tcl -- # Basic cluster analysis: # - A dataset (a list of lists) consists of records describing the data # - Each record is a list of one or more values assumed to be the # coordinates in a vector space. # - The clustering algorithm will successively merge individual # data points into clusters, ending when all data points have been # merged into a single cluster. # - The output consists of a list of lists: # - each list contains a list of data points (indices into the original # list) forming the next cluster # - the second element is the center of that cluster # - the third element is the radius # # Note: # There are many variations possible with respect to cluster # analysis. In this script each cluster is characterized by # the center of gravity of the points making up the cluster # and the maximum distance to that center is taken as the radius. # By redefining the procedure that determines the distance # between clusters and points and the procedure that determines # the characteristics of the clusters you can change the clustering # method. # # namespace eval ::Cluster { # Force it to exist } # cluster -- # Compute the succession of clusters # # Arguments: # datapoints List of lists, each a data point # Result: # List of lists describing the clusters that have been formed # proc ::Cluster::cluster {datapoints} { # # First step: create clusters consisting of one single point # set resulting_clusters {} set clusters {} set idx 0 foreach point $datapoints { set cluster_data [list $idx $point 0.0] lappend clusters $cluster_data incr idx } # # Second step: determine the minimum distances # Note: # The work could be halved, but the algorithm would # be a bit more complicated. Leave it for now # set idx 0 set noclusters [llength $clusters] set closest_clusters {} foreach cluster $clusters { set mindist {} for { set i 0 } { $i < $noclusters } { incr i } { if { $i != $idx } { set other [lindex $clusters $i] set dist [distanceClusters $cluster $other] if { $mindist == {} || $mindist > $dist } { set mindist $dist set closest $i } } } lappend closest_clusters $idx $closest $mindist incr idx } # # Third step: # - Determine the minimum distance between two clusters # - Join them # - Determine the new minimum distances # - Continue until only one is left # while { [llength $clusters] > 1 } { set mindist {} set minidx {} set candidates {} set curr 0 foreach {idx closest dist} $closest_clusters { if { $mindist == {} || $mindist > $dist } { set mindist $dist set minidx $idx set minclosest $closest } incr curr } set new_cluster [determineNewCluster $clusters $minidx $minclosest $datapoints] set clusters [lreplace $clusters $minidx $minidx $new_cluster] set clusters [lreplace $clusters $minclosest $minclosest] lappend resulting_clusters $new_cluster #puts $resulting_clusters #puts $clusters # # Now the new distances! # Note: # For now a lazy method - just reiterate over all pairs # set idx 0 set noclusters [llength $clusters] set closest_clusters {} foreach cluster $clusters { set mindist {} for { set i 0 } { $i < $noclusters } { incr i } { if { $i != $idx } { set other [lindex $clusters $i] set dist [distanceClusters $cluster $other] if { $mindist == {} || $mindist > $dist } { set mindist $dist set closest $i } } } lappend closest_clusters $idx $closest $mindist incr idx } } return $resulting_clusters } # determineNewCluster -- # Compute the characteristics of the new cluster # # Arguments: # clusters All clusters # idx1 Index of the first cluster # idx2 Index of the second cluster # datapoints Original data points # Result: # The new cluster # proc ::Cluster::determineNewCluster {clusters idx1 idx2 datapoints} { foreach {indices1 centre1 radius1} [lindex $clusters $idx1] {break} foreach {indices2 centre2 radius2} [lindex $clusters $idx2] {break} # # Determine the new centre # set new_centre {} foreach crd $centre1 { lappend new_centre 0.0 } set count 0 foreach idx [concat $indices1 $indices2] { set coords [lindex $datapoints $idx] set sumcrd {} foreach nc $new_centre c $coords { set nc [expr {$nc+$c}] lappend sumcrd $nc } set new_centre $sumcrd incr count } set sumcrd $new_centre set new_centre {} foreach nc $sumcrd { lappend new_centre [expr {$nc/double($count)}] } # # Determine the radius # Note: # Here is some room for improvement - other_cluster # set new_cluster [list [concat $indices1 $indices2] $new_centre 0.0] set maxdist 0.0 foreach idx [lindex $new_cluster 0] { set other_cluster [list {} [lindex $datapoints $idx] 0.0] set dist [distanceClusters $new_cluster $other_cluster] if { $dist > $maxdist } { set maxdist $dist } } set new_cluster [lreplace $new_cluster 2 2 $maxdist] } # distanceCluster -- # Compute the distance between two clusters # # Arguments: # cluster1 Data determining the first cluster # cluster2 Data determining the second cluster # Result: # Distance between the clusters # Note: # Just passing the centres and the radii will improve # the performance # proc ::Cluster::distanceClusters {cluster1 cluster2} { foreach {indices1 centre1 radius1} $cluster1 {break} foreach {indices2 centre2 radius2} $cluster2 {break} # # Use the Euclidean norm # set dist 0.0 foreach crd1 $centre1 crd2 $centre2 { set dist [expr {$dist+($crd1-$crd2)*($crd1-$crd2)}] } set dist [expr {sqrt($dist)-$radius1-$radius2}] } # main -- # Simple tests # catch { console show } set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} } puts [::Cluster::cluster $datapoints] puts " " set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} } set clusters [::Cluster::cluster $datapoints] foreach cluster $clusters { puts $cluster } # # Visualise the process # catch { package require Tk canvas .c -bg white -width 200 -height 200 pack .c -fill both foreach data $datapoints { foreach {x y} $data {break} set xcrd [expr {50+100*$x}] set ycrd [expr {150-100*$y}] .c create oval [expr {$xcrd-2}] [expr {$ycrd-2}] \ [expr {$xcrd+2}] [expr {$ycrd+2}] -fill black } foreach cluster $clusters \ colour [lrange {cyan green yellow orange red magenta} 0 [llength $clusters]] { foreach {ids coords radius} $cluster {break} set xcrd [expr {50+100*[lindex $coords 0]}] set ycrd [expr {150-100*[lindex $coords 1]}] set radius [expr {100*$radius}] .c lower \ [.c create oval [expr {$xcrd-$radius}] [expr {$ycrd-$radius}] \ [expr {$xcrd+$radius}] [expr {$ycrd+$radius}] -fill $colour] #.c create text $xcrd $ycrd -text $ids } .c create line 0 150 200 150 -fill grey .c create line 50 0 50 200 -fill grey .c create line 0 50 200 50 -fill grey .c create line 150 0 150 200 -fill grey } ;# End catch
EB: My first Tcl/Tk application, back in 96 with Tcl 7.6, dealt with interactive interpretation of hierachical clustering (screenshot at [L1 ]).
mkg: Running Cluster::cluster on 5 or more randomly generated 2-D points almost always ends with an error as distanceClusters tries to sqrt a negative number. Looks like your cluster-to-cluster centroid distance exceeds the sum of the radii (think of two large clusters overlapping each other almost entirely).
This algorithm is a form of agglomerative hierarchical clustering with, I believe, a time complexity of O(n^2) [L2 ] and is widely used in the life sciences (e.g., gene expression analysis). Alternative algorithms (such as K-means) based on partitioning techniques also exist with much better time complexity behaviour and are typically applied to large data sets that hierarchical clustering can't handle (Yes, there are other tradeoffs to be considered too).
AM Thanks for mentioning the problem with overlapping clusters and the alternative methods. Yes, the whole thing is a mere toy at the moment. And the above implementation is O(n^3) because I re-compute all distances, rather than update them.
My main motivation was to see how it could be done (following a conversation with Peter de Rijk about numerical analysis in general).
AM Found the bug: it was in the expression with sqrt() - the parenthesis was too far to the left. Fixed it.
George Petasis: Here is another clustering facility that implements k-means clustering. BSD license...
############################################################################## ## SimpleKMeans.tcl: ## ## --------------------- ## ## This class implements a simple k-means clustering algorithm. ## ## ## ## This file is part of the Ellogon Language Engineering Platform. ## ## Copyright 1998-2007 by: ## ## Georgios Petasis, ## ## Athens, Greece. ## ############################################################################## package require Itcl itcl::class ELEP::MachineLearning::Clustering::SimpleKMeans { variable clusterAssignments {}; # A list holding the cluster for each instance variable clusterCentroids {}; # A list holding the cluster centroids variable iterations; # Keep track of the number of iterations # completed before convergence public { variable distance {}; # A command prefix for returning the distance # between two instances variable random_seed 10; # The seed we set to random variable K 2; # Number of clusters to generate variable debug 0 } constructor {args} { eval configure $args };# constructor # buildClusterer -- # Generates a clusterer from a set of training data. The produced # clusterer can be used to classify new instances. # # Arguments: # instances: A list of instances serving as the training data # Result: # - # method buildClusterer {instances} { set k $K expr {srand($random_seed)} set iterations 0 # # Step 1: Select randomly k instances as the centroids of the k clusters # if {$debug} {puts "buildClusterer: K=$k"; update} set clusterCentroids {} initialiseClusterCentroids $instances set k [llength $clusterCentroids] if {$debug} { puts "clusterCentroids:" for {set i 0} {$i < $k} {incr i} { puts " k=$i, centroid: [lindex $clusterCentroids $i]" } update } # # Step 2: Run the k-means algorithm # set clusterAssignments {} foreach instance $instances {lappend clusterAssignments 0} set converged false while {!$converged} { incr iterations set converged true # # Step A: Find the cluster of each individual # if {$debug} { puts "Step A: Finding the cluster of each individual..."; update } array unset cluster_instances set i 0 foreach instance $instances assignment $clusterAssignments { set new_assignment [clusterInstance $instance] if {$new_assignment != $assignment} { lset clusterAssignments $i $new_assignment set converged false } lappend cluster_instances($new_assignment) $instance if {$debug} { puts " Best cluster: $new_assignment (instance=$instance,\ centroid: [lindex $clusterCentroids $new_assignment])"; update } incr i } set k [array size cluster_instances] # # Step B: Update the centroids # if {$debug} { puts "Step B: Updating the centroids (K = $k)..."; update } set empty_clusters 0 set clusterCentroids {} foreach cluster [lsort -integer [array names cluster_instances]] { lappend clusterCentroids [clusterCentroid $cluster_instances($cluster)] } if {$debug} { set i -1 foreach centroid $clusterCentroids { puts " Cluster [incr i] centroid: $centroid"; update } } };# while {!$converged} };# buildClusterer # initialiseClusterCentroids -- # Initialise (randomly) the centroid of each cluster. # # Arguments: # instances: A list of instances serving as the training data # Result: # - # method initialiseClusterCentroids {instances} { for {set i [expr {[llength $instances]-1}]} {$i >= 0} {incr i -1} { set index [expr {int(rand()*($i+1))}] set instance [lindex $instances $index] if {[lsearch -exact $clusterCentroids $instance] == -1} { lappend clusterCentroids $instance } lset instances $index [lindex $instances $i] lset instances $i $instance if {[llength $clusterCentroids] == $K} {break} } };# initialiseClusterCentroids # clusterInstance -- # Finds and returns the cluster the given instance belong to. # # Arguments: # instance: The instance whose cluster is wanted # Result: # An integer specifying the cluster # method clusterInstance {instance} { if {$debug > 2} { puts " + clusterInstance: $instance"; update } set best_cluster 0 if {$debug > 2} {puts " Cluster: 0"; update} set min_distance [distance $instance [lindex $clusterCentroids 0]] set k [llength $clusterCentroids] for {set i 1} {$i < $k} {incr i} { if {$debug > 2} {puts " Cluster: $i"; update} set dist [distance $instance [lindex $clusterCentroids $i]] if {$dist < $min_distance} { set best_cluster $i set min_distance $dist } } if {$debug > 2} { puts " + clusterInstance: best cluster = $best_cluster"; update } return $best_cluster };# clusterInstance # clusterCentroid -- # Finds and returns the instance that is in the "center" of all given # instances. # # Arguments: # instances: The list of instances, one of which should be returned # Result: # An instance from the given instances that is the cluster centroid # method clusterCentroid {instances} { set instances_number [llength $instances] set centroid [lindex $instances 0] if {$instances_number == 1} {return $centroid} set min [expr {100*[distance $centroid [lindex $instances end]]}] for {set i 0} {$i < $instances_number} {incr i} { set instance [lindex $instances $i] set mean 0.0 for {set j 0} {$j < $instances_number} {incr j} { if {$i == $j} {continue} set mean [expr {$mean + [distance $instance [lindex $instances $j]]}] } set mean [expr {$mean/$instances_number}] if {$mean < $min} { set min $mean set centroid $instance } } return $centroid };# clusterCentroid # distance -- # Returns the distance between two instances. # # Arguments: # instance1: The first instance # instance2: The second instance # Result: # A positive real number # Notes: # The returned distance does not need to be in the range [0, 1]. The # returned distance can be any positive real number # method distance {instance1 instance2} { if {$debug > 3} { puts -nonewline \ " Calling \"$distance {$instance1} {$instance2}\": "; update set d [uplevel #0 $distance [list $instance1] [list $instance2]] puts "Distance: $d"; update return $d } else { return [uplevel #0 $distance [list $instance1] [list $instance2]] } };# distance method getClusterAssignments {} { return $clusterAssignments };# getClusterAssignments };# class ELEP::MachineLearning::Clustering::SimpleKMeans package provide ELEP::MachineLearning::Clustering::SimpleKMeans 1.0
Test code:
source SimpleKMeans.tcl catch {console show} proc distance {instance1 instance2} { foreach {x1 y1} $instance1 {break} foreach {x2 y2} $instance2 {break} return [expr {sqrt(pow($x1-$x2,2)+pow($y1-$y2,2))}] };# distance ELEP::MachineLearning::Clustering::SimpleKMeans clusterer \ -distance distance -K 2 -debug 0 foreach data { { {1.0 0.0} {1.0 0.1} {0.0 0.0} } { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} } } { puts stderr "Performing clustering..."; update clusterer buildClusterer $data puts "Results:"; update foreach point $data cluster [clusterer getClusterAssignments] { puts " Cluster $cluster : $point" } puts {} }
I really want to see a tcllib module with some basic machine learning algorithms.
EKB I'm not sure who wrote the sentence above, but for a starting point there's Playing C4.5 and a little learning decision tree, as well as this page.
AK: I agree, that clustering, bayesian and other machine-learning codes would be very nice to have.
Norbert I wonder if anybody knows a tcl algorithm for cluster analysis starting with a distance matrix (pairwise distances) and not with data points having same sort of coordinates.