[Arjen Markus] (22 march 2004) Cluster analysis is a statistical technique for identifying groups of observations that are somehow "close" together. If you look at aribtrary points in a plane, then most probably you will see small groups of points. To the human eye it is very easy to identify such groups or ''clusters''. For a computer program it is a bit tougher (in the sense: how to precisely define how close two points are together). Well, here is a first attempt at doing cluster analysis with Tcl. A lot needs to be done still to make it practically useful: * ''Preparing the data:'' use scaling techniques to normalise the various "coordinates" * ''Weighing factors:'' select the weight of the various coordinates in the determination of the distance between points * ''Reporting:'' a procedure is missing to (re)construct the clusters formed at the Nth step * ''Optimisation:'' the computational loops have not been optimised at all (I was too lazy :) * ''Refactoring the computation:'' several possibilities to simplify the computational procedures so that custom methods can be introduced to compute the distance or the characteristics of the clusters. ---- # cluster.tcl -- # Basic cluster analysis: # - A dataset (a list of lists) consists of records describing the data # - Each record is a list of one or more values assumed to be the # coordinates in a vector space. # - The clustering algorithm will successively merge individual # data points into clusters, ending when all data points have been # merged into a single cluster. # - The output consists of a list of lists: # - each list contains a list of data points (indices into the original # list) forming the next cluster # - the second element is the centre of that cluster # - the third element is the radius # # Note: # There are many variations possible with respect to cluster # analysis. In this script each cluster is characterised by # the centre of gravity of the points making up the cluster # and the maximum distance to that centre is taken as the radius. # By redefining the procedure that determines the distance # between clusters and points and the procedure that determines # the characteristics of the clusters you can change the clustering # method. # # namespace eval ::Cluster { # Force it to exist } # cluster -- # Compute the succession of clusters # # Arguments: # datapoints List of lists, each a data point # Result: # List of lists describing the clusters that have been formed # proc ::Cluster::cluster {datapoints} { # # First step: create clusters consisting of one single point # set resulting_clusters {} set clusters {} set idx 0 foreach point $datapoints { set cluster_data [list $idx $point 0.0] lappend clusters $cluster_data incr idx } # # Second step: determine the minimum distances # Note: # The work could be halved, but the algorithm would # be a bit more complicated. Leave it for now # set idx 0 set noclusters [llength $clusters] set closest_clusters {} foreach cluster $clusters { set mindist {} for { set i 0 } { $i < $noclusters } { incr i } { if { $i != $idx } { set other [lindex $clusters $i] set dist [distanceClusters $cluster $other] if { $mindist == {} || $mindist > $dist } { set mindist $dist set closest $i } } } lappend closest_clusters $idx $closest $mindist incr idx } # # Third step: # - Determine the minimum distance between two clusters # - Join them # - Determine the new minimum distances # - Continue until only one is left # while { [llength $clusters] > 1 } { set mindist {} set minidx {} set candidates {} set curr 0 foreach {idx closest dist} $closest_clusters { if { $mindist == {} || $mindist > $dist } { set mindist $dist set minidx $idx set minclosest $closest } incr curr } set new_cluster [determineNewCluster $clusters $minidx $minclosest $datapoints] set clusters [lreplace $clusters $minidx $minidx $new_cluster] set clusters [lreplace $clusters $minclosest $minclosest] lappend resulting_clusters $new_cluster #puts $resulting_clusters #puts $clusters # # Now the new distances! # Note: # For now a lazy method - just reiterate over all pairs # set idx 0 set noclusters [llength $clusters] set closest_clusters {} foreach cluster $clusters { set mindist {} for { set i 0 } { $i < $noclusters } { incr i } { if { $i != $idx } { set other [lindex $clusters $i] set dist [distanceClusters $cluster $other] if { $mindist == {} || $mindist > $dist } { set mindist $dist set closest $i } } } lappend closest_clusters $idx $closest $mindist incr idx } } return $resulting_clusters } # determineNewCluster -- # Compute the characteristics of the new cluster # # Arguments: # clusters All clusters # idx1 Index of the first cluster # idx2 Index of the second cluster # datapoints Original data points # Result: # The new cluster # proc ::Cluster::determineNewCluster {clusters idx1 idx2 datapoints} { foreach {indices1 centre1 radius1} [lindex $clusters $idx1] {break} foreach {indices2 centre2 radius2} [lindex $clusters $idx2] {break} # # Determine the new centre # set new_centre {} foreach crd $centre1 { lappend new_centre 0.0 } set count 0 foreach idx [concat $indices1 $indices2] { set coords [lindex $datapoints $idx] set sumcrd {} foreach nc $new_centre c $coords { set nc [expr {$nc+$c}] lappend sumcrd $nc } set new_centre $sumcrd incr count } set sumcrd $new_centre set new_centre {} foreach nc $sumcrd { lappend new_centre [expr {$nc/double($count)}] } # # Determine the radius # Note: # Here is some room for improvement - other_cluster # set new_cluster [list [concat $indices1 $indices2] $new_centre 0.0] set maxdist 0.0 foreach idx [lindex $new_cluster 0] { set other_cluster [list {} [lindex $datapoints $idx] 0.0] set dist [distanceClusters $new_cluster $other_cluster] if { $dist > $maxdist } { set maxdist $dist } } set new_cluster [lreplace $new_cluster 2 2 $maxdist] } # distanceCluster -- # Compute the distance between two clusters # # Arguments: # cluster1 Data determining the first cluster # cluster2 Data determining the second cluster # Result: # Distance between the clusters # Note: # Just passing the centres and the radii will improve # the performance # proc ::Cluster::distanceClusters {cluster1 cluster2} { foreach {indices1 centre1 radius1} $cluster1 {break} foreach {indices2 centre2 radius2} $cluster2 {break} # # Use the Euclidean norm # set dist 0.0 foreach crd1 $centre1 crd2 $centre2 { set dist [expr {$dist+($crd1-$crd2)*($crd1-$crd2)}] } set dist [expr {sqrt($dist)-$radius1-$radius2}] } # main -- # Simple tests # catch { console show } set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} } puts [::Cluster::cluster $datapoints] puts " " set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} } set clusters [::Cluster::cluster $datapoints] foreach cluster $clusters { puts $cluster } # # Visualise the process # catch { package require Tk canvas .c -bg white -width 200 -height 200 pack .c -fill both foreach data $datapoints { foreach {x y} $data {break} set xcrd [expr {50+100*$x}] set ycrd [expr {150-100*$y}] .c create oval [expr {$xcrd-2}] [expr {$ycrd-2}] \ [expr {$xcrd+2}] [expr {$ycrd+2}] -fill black } foreach cluster $clusters \ colour [lrange {cyan green yellow orange red magenta} 0 [llength $clusters]] { foreach {ids coords radius} $cluster {break} set xcrd [expr {50+100*[lindex $coords 0]}] set ycrd [expr {150-100*[lindex $coords 1]}] set radius [expr {100*$radius}] .c lower \ [.c create oval [expr {$xcrd-$radius}] [expr {$ycrd-$radius}] \ [expr {$xcrd+$radius}] [expr {$ycrd+$radius}] -fill $colour] #.c create text $xcrd $ycrd -text $ids } .c create line 0 150 200 150 -fill grey .c create line 50 0 50 200 -fill grey .c create line 0 50 200 50 -fill grey .c create line 150 0 150 200 -fill grey } ;# End catch ---- [EB]: My first Tcl/Tk application, back in 96 with Tcl 7.6, dealt with ''interactive interpretation of hierachical clustering'' (''screenshot at [http://eric.boudaillier.free.fr/vache.%70ng]''). [mkg]: Running Cluster::cluster on 5 or more randomly generated 2-D points almost always ends with an error as distanceClusters tries to sqrt a negative number. Looks like your cluster-to-cluster centroid distance exceeds the sum of the radii (think of two large clusters overlapping each other almost entirely). This algorithm is a form of agglomerative hierarchical clustering with, I believe, a time complexity of O(n^2) [http://www-csli.stanford.edu/~schuetze/completelink.html] and is widely used in the life sciences (e.g., gene expression analysis). Alternative algorithms (such as K-means) based on partitioning techniques also exist with much better time complexity behaviour and are typically applied to large data sets that hierarchical clustering can't handle (Yes, there are other tradeoffs to be considered too). [AM] Thanks for mentioning the problem with overlapping clusters and the alternative methods. Yes, the whole thing is a mere toy at the moment. And the above implementation is O(n^3) because I re-compute ''all'' distances, rather than update them. My main motivation was to see how it could be done (following a conversation with Peter de Rijk about numerical analysis in general). [AM] Found the bug: it was in the expression with sqrt() - the parenthesis was too far to the left. Fixed it. ---- [[ [Category Mathematics] | [Category Numerical Analysis] ]]