It strikes me that "similarity scoring" is the sort of gadget that attracts/inspires [RS], Arjen, ... If I leave a mention of [http://groups.google.com/groups?hl=en&frame=right&th=4c0bf2d2c3fb160a] here, will they exhibit [Tcl] examples? [[[LV] that google URL doesn't seem to point to anything for me...] Note Tcllib FR [[ 514486 ]] request textutil snd-like comparisons [http://sourceforge.net/tracker/index.php?func=detail&aid=514486&group_id=12883&atid=362883] Levenshtein's algorithm for Hamming distance is also the foundation of [diff in Tcl], which compares files line-by-line instead of comparing strings character-by-character. [RS] can't withstand a challenge... Indeed, I have often been wishing for such a measuring device - thanks for the link! Here's a plump translation to Tcl of the [Python] version of the Levenshtein algorithm given there (where it hurt to have to do all index arithmetics with [expr], so I introduced a helper subtractor), plus an application of ''stringDistance'' to compute ''stringSimilarity'', where the only little gag is that we have to determine the sum of the string lengths only once, as they're concatenated: proc stringDistance {a b} { set n [string length $a] set m [string length $b] for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i} for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j} for {set i 1} {$i<=$n} {incr i} { for {set j 1} {$j<=$m} {incr j} { set x [expr {$c([- $i 1],$j)+1}] set y [expr {$c($i,[- $j 1])+1}] set z $c([- $i 1],[- $j 1]) if {[string index $a [- $i 1]]!=[string index $b [- $j 1]]} { incr z } set c($i,$j) [min $x $y $z] } } set c($n,$m) } # some little helpers: proc min args {lindex [lsort -real $args] 0} proc max args {lindex [lsort -real $args] end} proc - {p q} {expr {$p-$q}} proc stringSimilarity {a b} { set totalLength [string length $a$b] max [expr {double($totalLength-2*[stringDistance $a $b])/$totalLength}] 0.0 } # Testing... % stringSimilarity hello hello ;# identity implies perfect similarity 1.0 % stringSimilarity hello hallo ;# changed one out of five letters 0.8 % stringSimilarity hello Hallo ;# case matters 0.6 % stringSimilarity hello world ;# one match of five (l or o) 0.2 % stringSimilarity hello helplo ;# insert costs slightly less 0.818181818182 % stringSimilarity hello again ;# total dissimilarity 0.0 [[Nice work, of course; I particularly applaud the example evaluations.]] ---- Both ''string*'' functions may be tuned to better fit the needs of the application. In ''stringDistance'', the cost for inequality (presently constant 1, done by the ''incr z'') could be derived from the characters in question, e.g. 0/O or I/1 could cost only 0.1, etc.; in ''stringSimilarity'' one could, if the strings are qualified as being either standard (like from a dictionary) or (possible) deviation, divide the distance by the length of the standard string (this would prevent the above effect that an insert consts slightly less, because it increases the total length. ---- '''Speed tuning:''' The subtractor helper "-" above makes the code look nicer than if an explicit [expr] were thrown in for each occurrence; however, keeping a second iterator whose value is one less than the original iterator brings runtime from 5.7 ms down to 4.2 ms (Sun Solaris; tested both "hello hello" and "hello again"): proc stringDistance2 {a b} { set n [string length $a] set m [string length $b] for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i} for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j} for {set i 1; set i0 0} {$i<=$n} {incr i; incr i0} { for {set j 1; set j0 0} {$j<=$m} {incr j; incr j0} { set x [expr {$c($i0,$j)+1}] set y [expr {$c($i,$j0)+1}] set z $c($i0,$j0) if {[string index $a $i0]!=[string index $b $j0]} { incr z } set c($i,$j) [min $x $y $z] } } set c($n,$m) } ;# RS ---- [Artur Trzewik] 2006-03-31: There is another one implementation, that I found in OmegaT java programm and have rewritten to Tcl, seems to be a little bit faster (30%). # author Vladimir Levenshtein # author Michael Gilleland, Merriam Park Software # author Chas Emerick, Apache Software Foundation # author Maxym Mykhalchuk proc levenshteinDistance {s t} { set n [string length $s] set m [string length $t] if {$n==0} { return $m } elseif {$m==0} { return $n } for {set i 0} {$i<=$n} {incr i} { lappend d 0 } # indexes into strings s and t # int i; // iterates through s # int j; // iterates through t # int cost; // cost for {set i 0} {$i<=$n} {incr i} { lappend p $i } for {set j 1} {$j<=$m} {incr j} { set t_j [string index $t [expr {$j-1}]] lset d 0 $j for {set i 1} {$i<=$n} {incr i} { set s_i [string index $s [expr {$i-1}]] if {$s_i eq $t_j} { set cost 0 } else { set cost 1 } # minimum of cell to the left+1, to the top+1, diagonally left and up +cost lset d $i [min [expr {[lindex $d [expr {$i-1}]]+1}] [expr {[lindex $p $i]+1}] [expr {[lindex $p [expr {$i-1}]]+$cost}]] } # copy current distance counts to 'previous row' distance counts set _d $p set p $d set d $_d } # our last action in the above loop was to switch d and p, so p now # actually has the most recent cost counts lindex $p $n } ---- [IL] This is a topic I've been dealing with a lot lately, and I'm finding (of course) that the nature of the math really depends on the data you're trying to score. In the field of genetic sequence matching you might be looking for common subsequences, in editorial fields you might be looking for misspellings. ''([escargo] - Is that a joke?) [IL]-no, i'm not that corny :(, thanks for the headsup'' I've found that under a certain number of characters in a string, any percentage measurement really doesn't do much good. CAT and CAR (especially in an editorial context) are entirely different concepts but you can make the argument it only differs by one char, in this case that difference just ends up being 33%. It raised the question to me, can you ever really assign a percentage of relevancy by sheer numbers alone? Probably not, in most cases I regard string matching as highly domain specific. Also there is the notion that A has a relevancy towards B, but the reverse might not be true. ie. you can say THE has a 100% match against THE FLYING CIRCUS. [http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module12/align.html] Here is a good summary I found on the topic of genetic sequence matching algorithms. I was using a variation of the Smith-Waterman algorithm to run tests against some supposedly related datasets. ---- For faster (though not as exact) string comparisons, see [Fuzzy string search] ---- To compare how similar sounding two strings are, try the [soundex] page. ---- [Additional string functions] - [Arts and crafts of Tcl-Tk programming]