Arjen Markus (24 july 2018) Harm Olthoff asked about a Tcl implementation of the so-called Wasserstein distance or "Earth Mover's Distance" as a measure for the discrepancy between two probability distributions. While the full definition is rather mathematical and technical, if you limit yourself to one-dimensional probability distributions in the form of histograms based on uniformly sized bins, the algorithm is almost trivial (see for instance: https://math.stackexchange.com/questions/714476/how-do-you-compute-numerically-the-earth-movers-distance-emd )
Here is an implementation - note that in two or more dimensions a somewhat similar approach is possible:
# stat-wasserstein.tcl -- # Determine the Wasserstein distance between two probability distributions # # Note: # This is an implementation for one-dimensional distributions (or better: # non-negative patterns) # # LastNonZero -- # Auxiliary procedure to find the last non-zero entry # # Arguments: # prob Probability distribution # # Result: # Index in the list of the last non-zero entry # # Note: # To avoid numerical problems any value smaller than 1.0e-10 is considered to # be zero # proc LastNonZero {prob} { set maxidx [expr {[llength $prob] - 1}] for {set idx $maxidx} {$idx >= 0} {incr idx -1} { if { [lindex $prob $idx] > 1.0e-10 } { return $idx } } return -1 ;# No non-zero entry } # Normalise -- # Auxiliary procedure to normalise the probability distribution # # Arguments: # prob Probability distribution # # Result: # Normalised distribution (i.e. the entries sum to 1) # # Note: # To avoid numerical problems any value smaller than 1.0e-10 is set to zero # proc Normalise {prob} { set newprob {} set sum 0.0 foreach p $prob { set sum [expr {$sum + $p}] } if { $sum == 0.0 } { return -code error "Probability distribution should not consist of only zeroes" } foreach p $prob { lappend newprob [expr {$p > 1.0e-10? ($p/$sum) : 0.0}] } return $newprob } # wasserstein-distance -- # Determine the Wasserstein distance using a "greedy" algorithm. # # Arguments: # prob1 First probability distribution, interpreted as a histogram # with uniform bin width # prob2 Second probability distribution # # Result: # Distance between the two distributions # proc wasserstein-distance {prob1 prob2} { # # First step: make sure the histograms have the same length and the # same cumulative weight. # if { [llength $prob1] != [llength $prob2] } { return -code error "Lengths of the probability histograms must be the same" } set prob1 [Normalise $prob1] set prob2 [Normalise $prob2] set distance 0.0 # # Determine the last non-zero bin - this bin will be shifted to the second # distribution # while {1} { set idx1 [LastNonZero $prob1] set idx2 [LastNonZero $prob2] if { $idx1 < 0 } { break ;# We are done } set bin1 [lindex $prob1 $idx1] set bin2 [lindex $prob2 $idx2] if { $bin1 <= $bin2 } { lset prob1 $idx1 0.0 lset prob2 $idx2 [expr {$bin2 - $bin1}] set distance [expr {$distance + abs($idx2-$idx1) * $bin1}] } else { lset prob1 $idx1 [expr {$bin1 - $bin2}] lset prob2 $idx2 0.0 set distance [expr {$distance + abs($idx2-$idx1) * $bin2}] } } return $distance } # tests -- # # Almost trivial set prob1 {0.0 0.0 0.0 1.0} set prob2 {0.0 0.0 1.0 0.0} puts "Expected distance: 1" puts "Calculated: [wasserstein-distance $prob1 $prob2]" puts "Symmetric: [wasserstein-distance $prob2 $prob1]" # Less trivial set prob1 {0.0 0.75 0.25 0.0} set prob2 {0.0 0.0 1.0 0.0} puts "Expected distance: 0.75" puts "Calculated: [wasserstein-distance $prob1 $prob2]" puts "Symmetric: [wasserstein-distance $prob2 $prob1]" # Shift trivial set prob1 {0.0 0.1 0.2 0.4 0.2 0.1 0.0 0.0} set prob2 {0.0 0.0 0.0 0.1 0.2 0.4 0.2 0.1} puts "Expected distance: 2" puts "Calculated: [wasserstein-distance $prob1 $prob2]" puts "Symmetric: [wasserstein-distance $prob2 $prob1]"