Version 11 of similarity

Updated 2002-03-04 07:52:14

It strikes me that "similarity scoring" is the sort of gadget that attracts/inspires RS, Arjen, ... If I leave a mention of [L1 ] here, will they exhibit Tcl examples?

RS can't withstand a challenge... Indeed, I have often been wishing for such a measuring device - thanks for the link! Here's a plump translation to Tcl of the Python version of the Levenshtein algorithm given there (where it hurt to have to do all index arithmetics with expr, so I introduced a helper subtractor), plus an application of stringDistance to compute stringSimilarity, where the only little gag is that we have to determine the sum of the string lengths only once, as they're concatenated:

 proc stringDistance {a b} {
        set n [string length $a]
        set m [string length $b]
        for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
        for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
        for {set i 1} {$i<=$n} {incr i} {
           for {set j 1} {$j<=$m} {incr j} {
                set x [expr {$c([- $i 1],$j)+1}]                
                set y [expr {$c($i,[- $j 1])+1}]
                set z $c([- $i 1],[- $j 1])
                if {[string index $a [- $i 1]]!=[string index $b [- $j 1]]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 }
 # some little helpers:
 proc min args {lindex [lsort -real $args] 0}
 proc max args {lindex [lsort -real $args] end}
 proc - {p q} {expr {$p-$q}}

 proc stringSimilarity {a b} {
        set totalLength [string length $a$b]
        max [expr {double($totalLength-2*[stringDistance $a $b])/$totalLength}] 0.0
 }

# Testing...

 % stringSimilarity hello hello  ;# identity implies perfect similarity
 1.0
 % stringSimilarity hello hallo  ;# changed one out of five letters
 0.8
 % stringSimilarity hello Hallo  ;# case matters
 0.6
 % stringSimilarity hello world  ;# one match of five (l or o)
 0.2
 % stringSimilarity hello helplo ;# insert costs slightly less
 0.818181818182
 % stringSimilarity hello again  ;# total dissimilarity
 0.0

[Nice work, of course; I particularly applaud the example evaluations.]


Both string* functions may be tuned to better fit the needs of the application. In stringDistance, the cost for inequality (presently constant 1, done by the incr z) could be derived from the characters in question, e.g. 0/O or I/1 could cost only 0.1, etc.; in stringSimilarity one could, if the strings are qualified as being either standard (like from a dictionary) or (possible) deviation, divide the distance by the length of the standard string (this would prevent the above effect that an insert consts slightly less, because it increases the total length.


Additional string functions - Arts and crafts of Tcl-Tk programming