[Arjen Markus] (22 march 2004) Cluster analysis is a statistical technique for identifying groups of observations that are somehow "close" together. If you look at aribtrary points in a plane, then most probably you will see small groups of points. To the human eye it is very easy to identify such groups or ''clusters''. For a computer program it is a bit tougher (in the sense: how to precisely define how close two points are together). Well, here is a first attempt at doing cluster analysis with Tcl. A lot needs to be done still to make it practically useful: * ''Preparing the data:'' use scaling techniques to normalise the various "coordinates" * ''Weighing factors:'' select the weight of the various coordinates in the determination of the distance between points * ''Reporting:'' a procedure is missing to (re)construct the clusters formed at the Nth step * ''Optimisation:'' the computational loops have not been optimised at all (I was too lazy :) * ''Refactoring the computation:'' several possibilities to simplify the computational procedures so that custom methods can be introduced to compute the distance or the characteristics of the clusters. ---- # cluster.tcl -- # Basic cluster analysis: # - A dataset (a list of lists) consists of records describing the data # - Each record is a list of one or more values assumed to be the # coordinates in a vector space. # - The clustering algorithm will successively merge individual # data points into clusters, ending when all data points have been # merged into a single cluster. # - The output consists of a list of lists: # - each list contains a list of data points (indices into the original # list) forming the next cluster # - the second element is the centre of that cluster # - the third element is the radius # # Note: # There are many variations possible with respect to cluster # analysis. In this script each cluster is characterised by # the centre of gravity of the points making up the cluster # and the maximum distance to that centre is taken as the radius. # By redefining the procedure that determines the distance # between clusters and points and the procedure that determines # the characteristics of the clusters you can change the clustering # method. # # namespace eval ::Cluster { # Force it to exist } # cluster -- # Compute the succession of clusters # # Arguments: # datapoints List of lists, each a data point # Result: # List of lists describing the clusters that have been formed # proc ::Cluster::cluster {datapoints} { # # First step: create clusters consisting of one single point # set resulting_clusters {} set clusters {} set idx 0 foreach point $datapoints { set cluster_data [list $idx $point 0.0] lappend clusters $cluster_data incr idx } # # Second step: determine the minimum distances # Note: # The work could be halved, but the algorithm would # be a bit more complicated. Leave it for now # set idx 0 set noclusters [llength $clusters] set closest_clusters {} foreach cluster $clusters { set mindist {} for { set i 0 } { $i < $noclusters } { incr i } { if { $i != $idx } { set other [lindex $clusters $i] set dist [distanceClusters $cluster $other] if { $mindist == {} || $mindist > $dist } { set mindist $dist set closest $i } } } lappend closest_clusters $idx $closest $mindist incr idx } # # Third step: # - Determine the minimum distance between two clusters # - Join them # - Determine the new minimum distances # - Continue until only one is left # while { [llength $clusters] > 1 } { set mindist {} set minidx {} set candidates {} set curr 0 foreach {idx closest dist} $closest_clusters { if { $mindist == {} || $mindist > $dist } { set mindist $dist set minidx $idx set minclosest $closest } incr curr } set new_cluster [determineNewCluster $clusters $minidx $minclosest $datapoints] set clusters [lreplace $clusters $minidx $minidx $new_cluster] set clusters [lreplace $clusters $minclosest $minclosest] lappend resulting_clusters $new_cluster #puts $resulting_clusters #puts $clusters # # Now the new distances! # Note: # For now a lazy method - just reiterate over all pairs # set idx 0 set noclusters [llength $clusters] set closest_clusters {} foreach cluster $clusters { set mindist {} for { set i 0 } { $i < $noclusters } { incr i } { if { $i != $idx } { set other [lindex $clusters $i] set dist [distanceClusters $cluster $other] if { $mindist == {} || $mindist > $dist } { set mindist $dist set closest $i } } } lappend closest_clusters $idx $closest $mindist incr idx } } return $resulting_clusters } # determineNewCluster -- # Compute the characteristics of the new cluster # # Arguments: # clusters All clusters # idx1 Index of the first cluster # idx2 Index of the second cluster # datapoints Original data points # Result: # The new cluster # proc ::Cluster::determineNewCluster {clusters idx1 idx2 datapoints} { foreach {indices1 centre1 radius1} [lindex $clusters $idx1] {break} foreach {indices2 centre2 radius2} [lindex $clusters $idx2] {break} # # Determine the new centre # set new_centre {} foreach crd $centre1 { lappend new_centre 0.0 } set count 0 foreach idx [concat $indices1 $indices2] { set coords [lindex $datapoints $idx] set sumcrd {} foreach nc $new_centre c $coords { set nc [expr {$nc+$c}] lappend sumcrd $nc } set new_centre $sumcrd incr count } set sumcrd $new_centre set new_centre {} foreach nc $sumcrd { lappend new_centre [expr {$nc/double($count)}] } # # Determine the radius # Note: # Here is some room for improvement - other_cluster # set new_cluster [list [concat $indices1 $indices2] $new_centre 0.0] set maxdist 0.0 foreach idx [lindex $new_cluster 0] { set other_cluster [list {} [lindex $datapoints $idx] 0.0] set dist [distanceClusters $new_cluster $other_cluster] if { $dist > $maxdist } { set maxdist $dist } } set new_cluster [lreplace $new_cluster 2 2 $maxdist] } # distanceCluster -- # Compute the distance between two clusters # # Arguments: # cluster1 Data determining the first cluster # cluster2 Data determining the second cluster # Result: # Distance between the clusters # Note: # Just passing the centres and the radii will improve # the performance # proc ::Cluster::distanceClusters {cluster1 cluster2} { foreach {indices1 centre1 radius1} $cluster1 {break} foreach {indices2 centre2 radius2} $cluster2 {break} # # Use the Euclidean norm # set dist 0.0 foreach crd1 $centre1 crd2 $centre2 { set dist [expr {$dist+($crd1-$crd2)*($crd1-$crd2)}] } set dist [expr {sqrt($dist-$radius1-$radius2)}] } # main -- # Simple tests # catch { console show } set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} } puts [::Cluster::cluster $datapoints] puts " " set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} } set clusters [::Cluster::cluster $datapoints] foreach cluster $clusters { puts $cluster } # # Visualise the process # catch { package require Tk canvas .c -bg white -width 200 -height 200 pack .c -fill both foreach data $datapoints { foreach {x y} $data {break} set xcrd [expr {50+100*$x}] set ycrd [expr {150-100*$y}] .c create oval [expr {$xcrd-2}] [expr {$ycrd-2}] \ [expr {$xcrd+2}] [expr {$ycrd+2}] -fill black } foreach cluster $clusters \ colour [lrange {cyan green yellow orange red magenta} 0 [llength $clusters]] { foreach {ids coords radius} $cluster {break} set xcrd [expr {50+100*[lindex $coords 0]}] set ycrd [expr {150-100*[lindex $coords 1]}] set radius [expr {100*$radius}] .c lower \ [.c create oval [expr {$xcrd-$radius}] [expr {$ycrd-$radius}] \ [expr {$xcrd+$radius}] [expr {$ycrd+$radius}] -fill $colour] #.c create text $xcrd $ycrd -text $ids } .c create line 0 150 200 150 -fill grey .c create line 50 0 50 200 -fill grey .c create line 0 50 200 50 -fill grey .c create line 150 0 150 200 -fill grey } ;# End catch ---- [[ [Category Mathematics] | [Category Numerical Analysis] ]]