Version 6 of similarity

Updated 2002-03-02 20:55:58

It strikes me that "similarity scoring" is the sort of gadget that attracts/inspires RS, Arjen, ... If I leave a mention of [L1 ] here, will they exhibit Tcl examples?

RS can't withstand a challenge... Indeed, I have often been wishing for such a measuring device - thanks for the link! Here's a plump translation to Tcl of the Python version of the Levenshtein algorithm given there (where it hurt to have to do all index arithmetics with expr, so I introduced a helper subtractor), plus an application of stringDistance to compute stringSimilarity, where the only little gag is that we have to determine the sum of the string lengths only once, as they're concatenated:

 proc stringDistance {a b} {
        set n [string length $a]
        set m [string length $b]
        for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
        for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
        for {set i 1} {$i<=$n} {incr i} {
           for {set j 1} {$j<=$m} {incr j} {
                set x [expr {$c([- $i 1],$j)+1}]                
                set y [expr {$c($i,[- $j 1])+1}]
                set z $c([- $i 1],[- $j 1])
                if {[string index $a [- $i 1]]!=[string index $b [- $j 1]]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 }
 # some little helpers:
 proc min args {lindex [lsort -real $args] 0}
 proc max args {lindex [lsort -real $args] end}
 proc - {p q} {expr {$p-$q}}

 proc stringSimilarity {a b} {
        set totalLength [string length $a$b]
        max [expr {double($totalLength-2*[stringDistance $a $b])/$totalLength}] 0.0
 }

# Testing...

 % stringSimilarity hello hello
 1.0
 % stringSimilarity hello hallo
 0.8
 % stringSimilarity hello Hallo
 0.6
 % stringSimilarity hello world
 0.2
 % stringSimilarity hello helplo
 0.818181818182
 % stringSimilarity hello again
 0.0

Additional string functions - Arts and crafts of Tcl-Tk programming