Version 0 of Clustering data

Updated 2004-03-22 08:25:26

Arjen Markus (22 march 2004) Cluster analysis is a statistical technique for identifying groups of observations that are somehow "close" together. If you look at aribtrary points in a plane, then most probably you will see small groups of points. To the human eye it is very easy to identify such groups or clusters. For a computer program it is a bit tougher (in the sense: how to precisely define how close two points are together).

Well, here is a first attempt at doing cluster analysis with Tcl. A lot needs to be done still to make it practically useful:

  • Preparing the data: use scaling techniques to normalise the various "coordinates"
  • Weighing factors: select the weight of the various coordinates in the determination of the distance between points
  • Reporting: a procedure is missing to (re)construct the clusters formed at the Nth step
  • Optimisation: the computational loops have not been optimised at all (I was too lazy :)
  • Refactoring the computation: several possibilities to simplify the computational procedures so that custom methods can be introduced to compute the distance or the characteristics of the clusters.

 # cluster.tcl --
 #    Basic cluster analysis:
 #    - A dataset (a list of lists) consists of records describing the data
 #    - Each record is a list of one or more values assumed to be the
 #      coordinates in a vector space.
 #    - The clustering algorithm will successively merge individual
 #      data points into clusters, ending when all data points have been
 #      merged into a single cluster.
 #    - The output consists of a list of lists:
 #      - each list contains a list of data points (indices into the original
 #        list) forming the next cluster
 #      - the second element is the centre of that cluster
 #      - the third element is the radius
 #
 #    Note:
 #    There are many variations possible with respect to cluster
 #    analysis. In this script each cluster is characterised by
 #    the centre of gravity of the points making up the cluster
 #    and the maximum distance to that centre is taken as the radius.
 #    By redefining the procedure that determines the distance
 #    between clusters and points and the procedure that determines
 #    the characteristics of the clusters you can change the clustering
 #    method.
 #
 #

 namespace eval ::Cluster {
    # Force it to exist
 }

 # cluster --
 #    Compute the succession of clusters
 #
 # Arguments:
 #    datapoints       List of lists, each a data point
 # Result:
 #    List of lists describing the clusters that have been formed
 #
 proc ::Cluster::cluster {datapoints} {
    #
    # First step: create clusters consisting of one single point
    #
    set resulting_clusters {}
    set clusters {}

    set idx 0
    foreach point $datapoints {
       set cluster_data [list $idx $point 0.0]
       lappend clusters $cluster_data
       incr idx
    }

    #
    # Second step: determine the minimum distances
    # Note:
    # The work could be halved, but the algorithm would
    # be a bit more complicated. Leave it for now
    #
    set idx 0
    set noclusters [llength $clusters]
    set closest_clusters {}
    foreach cluster $clusters {
       set mindist {}
       for { set i 0 } { $i < $noclusters } { incr i } {
          if { $i != $idx } {
             set other [lindex $clusters $i]
             set dist [distanceClusters $cluster $other]
             if { $mindist == {} || $mindist > $dist } {
                set mindist $dist
                set closest $i
             }
          }
       }
       lappend closest_clusters $idx $closest $mindist
       incr idx
    }

    #
    # Third step:
    # - Determine the minimum distance between two clusters
    # - Join them
    # - Determine the new minimum distances
    # - Continue until only one is left
    #
    while { [llength $clusters] > 1 } {
       set mindist    {}
       set minidx     {}
       set candidates {}
       set curr       0
       foreach {idx closest dist} $closest_clusters {
          if { $mindist == {} || $mindist > $dist } {
             set mindist    $dist
             set minidx     $idx
             set minclosest $closest
          }
          incr curr
       }

       set new_cluster [determineNewCluster $clusters $minidx $minclosest $datapoints]
       set clusters    [lreplace $clusters $minidx $minidx $new_cluster]
       set clusters    [lreplace $clusters $minclosest $minclosest]

       lappend resulting_clusters $new_cluster
       #puts $resulting_clusters
       #puts $clusters

       #
       # Now the new distances!
       # Note:
       # For now a lazy method - just reiterate over all pairs
       #
       set idx 0
       set noclusters [llength $clusters]
       set closest_clusters {}
       foreach cluster $clusters {
          set mindist {}
          for { set i 0 } { $i < $noclusters } { incr i } {
             if { $i != $idx } {
                set other [lindex $clusters $i]
                set dist [distanceClusters $cluster $other]
                if { $mindist == {} || $mindist > $dist } {
                   set mindist $dist
                   set closest $i
                }
             }
          }
          lappend closest_clusters $idx $closest $mindist
          incr idx
       }
    }

    return $resulting_clusters
 }

 # determineNewCluster --
 #    Compute the characteristics of the new cluster
 #
 # Arguments:
 #    clusters      All clusters
 #    idx1          Index of the first cluster
 #    idx2          Index of the second cluster
 #    datapoints    Original data points
 # Result:
 #    The new cluster
 #
 proc ::Cluster::determineNewCluster {clusters idx1 idx2 datapoints} {
    foreach {indices1 centre1 radius1} [lindex $clusters $idx1] {break}
    foreach {indices2 centre2 radius2} [lindex $clusters $idx2] {break}

    #
    # Determine the new centre
    #
    set new_centre {}
    foreach crd $centre1 {
       lappend new_centre 0.0
    }
    set count 0
    foreach idx [concat $indices1 $indices2] {
       set coords [lindex $datapoints $idx]
       set sumcrd {}

       foreach nc $new_centre c $coords {
          set nc [expr {$nc+$c}]
          lappend sumcrd $nc
       }
       set  new_centre $sumcrd
       incr count
    }

    set sumcrd     $new_centre
    set new_centre {}
    foreach nc $sumcrd {
       lappend new_centre [expr {$nc/double($count)}]
    }

    #
    # Determine the radius
    # Note:
    # Here is some room for improvement - other_cluster
    #
    set new_cluster [list [concat $indices1 $indices2] $new_centre 0.0]

    set maxdist 0.0
    foreach idx [lindex $new_cluster 0] {
       set other_cluster [list {} [lindex $datapoints $idx] 0.0]
       set dist [distanceClusters $new_cluster $other_cluster]
       if { $dist > $maxdist } {
          set maxdist $dist
       }
    }

    set new_cluster [lreplace $new_cluster 2 2 $maxdist]
 }

 # distanceCluster --
 #    Compute the distance between two clusters
 #
 # Arguments:
 #    cluster1      Data determining the first cluster
 #    cluster2      Data determining the second cluster
 # Result:
 #    Distance between the clusters
 # Note:
 #    Just passing the centres and the radii will improve
 #    the performance
 #
 proc ::Cluster::distanceClusters {cluster1 cluster2} {
    foreach {indices1 centre1 radius1} $cluster1 {break}
    foreach {indices2 centre2 radius2} $cluster2 {break}

    #
    # Use the Euclidean norm
    #
    set dist 0.0
    foreach crd1 $centre1 crd2 $centre2 {
       set dist [expr {$dist+($crd1-$crd2)*($crd1-$crd2)}]
    }
    set dist [expr {sqrt($dist-$radius1-$radius2)}]
 }

 # main --
 #    Simple tests
 #
 catch {
 console show
 }

 set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} }

 puts [::Cluster::cluster $datapoints]
 puts " "
 set datapoints { {1.0 0.0} {1.0 0.1} {0.0 0.0} {0.0 0.5} {0.01 0.5} }
 set clusters [::Cluster::cluster $datapoints]
 foreach cluster $clusters {
    puts $cluster
 }

 #
 # Visualise the process
 #
 catch {
 package require Tk
 canvas .c -bg white -width 200 -height 200
 pack   .c -fill both

 foreach data $datapoints {
    foreach {x y} $data {break}
    set xcrd [expr {50+100*$x}]
    set ycrd [expr {150-100*$y}]
    .c create oval [expr {$xcrd-2}] [expr {$ycrd-2}] \
                   [expr {$xcrd+2}] [expr {$ycrd+2}] -fill black
 }

 foreach cluster $clusters \
         colour [lrange {cyan green yellow orange red magenta} 0 [llength $clusters]] {
    foreach {ids coords radius} $cluster {break}
    set xcrd   [expr {50+100*[lindex $coords 0]}]
    set ycrd   [expr {150-100*[lindex $coords 1]}]
    set radius [expr {100*$radius}]
    .c lower \
       [.c create oval [expr {$xcrd-$radius}] [expr {$ycrd-$radius}] \
                       [expr {$xcrd+$radius}] [expr {$ycrd+$radius}] -fill $colour]
    #.c create text $xcrd $ycrd -text $ids
 }
 .c create line   0 150 200 150 -fill grey
 .c create line  50   0  50 200 -fill grey
 .c create line   0  50 200  50 -fill grey
 .c create line 150   0 150 200 -fill grey
 } ;# End catch

[ Category Mathematics

Category Numerical Analysis

]