Version 28 of similarity

Updated 2005-06-17 19:39:40

It strikes me that "similarity scoring" is the sort of gadget that attracts/inspires RS, Arjen, ... If I leave a mention of [L1 ] here, will they exhibit Tcl examples?

[LV that google URL doesn't seem to point to anything for me...]

Note Tcllib FR [ 514486 ] request textutil snd-like comparisons [L2 ]

Levenshtein's algorithm for Hamming distance is also the foundation of diff in Tcl, which compares files line-by-line instead of comparing strings character-by-character.

RS can't withstand a challenge... Indeed, I have often been wishing for such a measuring device - thanks for the link! Here's a plump translation to Tcl of the Python version of the Levenshtein algorithm given there (where it hurt to have to do all index arithmetics with expr, so I introduced a helper subtractor), plus an application of stringDistance to compute stringSimilarity, where the only little gag is that we have to determine the sum of the string lengths only once, as they're concatenated:

 proc stringDistance {a b} {
        set n [string length $a]
        set m [string length $b]
        for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
        for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
        for {set i 1} {$i<=$n} {incr i} {
           for {set j 1} {$j<=$m} {incr j} {
                set x [expr {$c([- $i 1],$j)+1}]
                set y [expr {$c($i,[- $j 1])+1}]
                set z $c([- $i 1],[- $j 1])
                if {[string index $a [- $i 1]]!=[string index $b [- $j 1]]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 }
 # some little helpers:
 proc min args {lindex [lsort -real $args] 0}
 proc max args {lindex [lsort -real $args] end}
 proc - {p q} {expr {$p-$q}}

 proc stringSimilarity {a b} {
        set totalLength [string length $a$b]
        max [expr {double($totalLength-2*[stringDistance $a $b])/$totalLength}] 0.0
 }

# Testing...

 % stringSimilarity hello hello  ;# identity implies perfect similarity
 1.0
 % stringSimilarity hello hallo  ;# changed one out of five letters
 0.8
 % stringSimilarity hello Hallo  ;# case matters
 0.6
 % stringSimilarity hello world  ;# one match of five (l or o)
 0.2
 % stringSimilarity hello helplo ;# insert costs slightly less
 0.818181818182
 % stringSimilarity hello again  ;# total dissimilarity
 0.0

[Nice work, of course; I particularly applaud the example evaluations.]


Both string* functions may be tuned to better fit the needs of the application. In stringDistance, the cost for inequality (presently constant 1, done by the incr z) could be derived from the characters in question, e.g. 0/O or I/1 could cost only 0.1, etc.; in stringSimilarity one could, if the strings are qualified as being either standard (like from a dictionary) or (possible) deviation, divide the distance by the length of the standard string (this would prevent the above effect that an insert consts slightly less, because it increases the total length.


Speed tuning: The subtractor helper "-" above makes the code look nicer than if an explicit expr were thrown in for each occurrence; however, keeping a second iterator whose value is one less than the original iterator brings runtime from 5.7 ms down to 4.2 ms (Sun Solaris; tested both "hello hello" and "hello again"):

 proc stringDistance2 {a b} {
     set n [string length $a]
     set m [string length $b]
     for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
     for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
     for {set i 1; set i0 0} {$i<=$n} {incr i; incr i0} {
         for {set j 1; set j0 0} {$j<=$m} {incr j; incr j0} {
                set x [expr {$c($i0,$j)+1}]
                set y [expr {$c($i,$j0)+1}]
                set z $c($i0,$j0)
                if {[string index $a $i0]!=[string index $b $j0]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 } ;# RS

IL This is a topic I've been dealing with a lot lately, and I'm finding (of course) that the nature of the math really depends on the data you're trying to score. In the field of genetic sequence matching you might be looking for common subsequences, in editorial fields you might be looking for misspellings. (escargo - Is that a joke?) IL-no, i'm not that corny :(, thanks for the headsup

I've found that under a certain number of characters in a string, any percentage measurement really doesn't do much good. CAT and CAR (especially in an editorial context) are entirely different concepts but you can make the argument it only differs by one char, in this case that difference just ends up being 33%. It raised the question to me, can you ever really assign a percentage of relevancy by sheer numbers alone? Probably not, in most cases I regard string matching as highly domain specific.

Also there is the notion that A has a relevancy towards B, but the reverse might not be true. ie. you can say THE has a 100% match against THE FLYING CIRCUS.

[L3 ] Here is a good summary I found on the topic of genetic sequence matching algorithms. I was using a variation of the Smith-Waterman algorithm to run tests against some supposedly related datasets.


For faster (though not as exact) string comparisons, see Fuzzy string search


To compare how similar sounding two strings are, try the soundex page.


Additional string functions - Arts and crafts of Tcl-Tk programming