similarity

It strikes me that "similarity scoring" is the sort of gadget that attracts/inspires RS, Arjen, ... If I leave a mention of [L1 ] here, will they exhibit Tcl examples?

LV A definition of similarity, on the page, would be worthwhile for novices as they read along...

[LV that google groups URL in the first sentence doesn't point to anything useful for me..., however, wikipedia has [L2 ], which seems to be related to the following discussion. Also, check out http://www.dcs.shef.ac.uk/~sam/stringmetrics.html for string metrics pointers ]

Note Tcllib FR [ 514486 ] requested textutil snd-like comparisons [L3 ] and is now closed, thanks to the inclusion of tcllib's soundex module.

Levenshtein's algorithm for Hamming distance is also the foundation of diff in Tcl, which compares files line-by-line instead of comparing strings character-by-character.

RS can't withstand a challenge... Indeed, I have often been wishing for such a measuring device - thanks for the link! Here's a plump translation to Tcl of the Python version of the Levenshtein algorithm given there (where it hurt to have to do all index arithmetics with expr, so I introduced a helper subtractor), plus an application of stringDistance to compute stringSimilarity, where the only little gag is that we have to determine the sum of the string lengths only once, as they're concatenated:

 proc stringDistance {a b} {
        set n [string length $a]
        set m [string length $b]
        for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
        for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
        for {set i 1} {$i<=$n} {incr i} {
           for {set j 1} {$j<=$m} {incr j} {
                set x [expr {$c([- $i 1],$j)+1}]
                set y [expr {$c($i,[- $j 1])+1}]
                set z $c([- $i 1],[- $j 1])
                if {[string index $a [- $i 1]]!=[string index $b [- $j 1]]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 }
 # some little helpers:
 if {[catch {
    # DKF - these things (or rather improved versions) are provided by the 8.5 core
    package require Tcl 8.5
    namespace path {tcl::mathfunc tcl::mathop}
 }]} then {
    proc min args {lindex [lsort -real $args] 0}
    proc max args {lindex [lsort -real $args] end}
    proc - {p q} {expr {$p-$q}}
 }

 proc stringSimilarity {a b} {
        set totalLength [string length $a$b]
        max [expr {double($totalLength-2*[stringDistance $a $b])/$totalLength}] 0.0
 }
# Testing...
 % stringSimilarity hello hello  ;# identity implies perfect similarity
 1.0
 % stringSimilarity hello hallo  ;# changed one out of five letters
 0.8
 % stringSimilarity hello Hallo  ;# case matters
 0.6
 % stringSimilarity hello world  ;# one match of five (l or o)
 0.2
 % stringSimilarity hello helplo ;# insert costs slightly less
 0.818181818182
 % stringSimilarity hello lohel  ;# same letters but all in different positions
 0.2
 % stringSimilarity hello again  ;# total dissimilarity
 0.0

[Nice work, of course; I particularly applaud the example evaluations.]


Both string* functions may be tuned to better fit the needs of the application. In stringDistance, the cost for inequality (presently constant 1, done by the incr z) could be derived from the characters in question, e.g. 0/O or I/1 could cost only 0.1, etc.; in stringSimilarity one could, if the strings are qualified as being either standard (like from a dictionary) or (possible) deviation, divide the distance by the length of the standard string (this would prevent the above effect that an insert consts slightly less, because it increases the total length.


jruv 2009-12-21 Testing with the below parameters.

% stringSimilarity hello h
0.0

Obviously, they are not total dissimilarity. I made a small change to the stringSimilarity.

proc stringSimilarity {a b} {
    set totalLength [max [string length $a] [string length $b]]
    max [expr {double($totalLength-[levenshteinDistance $a $b])/$totalLength}] 0.0
}

Testing...

% stringSimilarity hello h
0.2

Here is another version of stringSimilarity

proc stringSimilarity {s t} {
    set sl [string length $s]
    set tl [string length $t]

    set ml [max $sl $tl]
    set dn [levenshteinDistance $s $t]

    # -- get match characters number
    set mn [expr $ml - $dn]

    # -- match number != 0? (mn-1)/tl + (1/tl)*(mn/sl)
    return [expr $mn==0?0:($mn-1+double($mn)/$sl)/$tl]
}

Speed tuning: The subtractor helper "-" above makes the code look nicer than if an explicit expr were thrown in for each occurrence; however, keeping a second iterator whose value is one less than the original iterator brings runtime from 5.7 ms down to 4.2 ms (Sun Solaris; tested both "hello hello" and "hello again"):

 proc stringDistance2 {a b} {
     set n [string length $a]
     set m [string length $b]
     for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
     for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
     for {set i 1; set i0 0} {$i<=$n} {incr i; incr i0} {
         for {set j 1; set j0 0} {$j<=$m} {incr j; incr j0} {
                set x [expr {$c($i0,$j)+1}]
                set y [expr {$c($i,$j0)+1}]
                set z $c($i0,$j0)
                if {[string index $a $i0]!=[string index $b $j0]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 } ;# RS

Artur Trzewik 2006-03-31: There is another one implementation, that I found in OmegaT java programm and have rewritten to Tcl, seems to be a little bit faster (30%).

  # author Vladimir Levenshtein
  # author Michael Gilleland, Merriam Park Software
  # author Chas Emerick, Apache Software Foundation
  # author Maxym Mykhalchuk
  proc levenshteinDistance {s t} {
    set n [string length $s]
    set m [string length $t]
    if {$n==0} {
        return $m
    } elseif {$m==0} {
        return $n
    }
    for {set i 0} {$i<=$n} {incr i} {
        lappend d 0
    }
    # indexes into strings s and t
    # int i; // iterates through s
    # int j; // iterates through t
    # int cost; // cost
    for {set i 0} {$i<=$n} {incr i} {
        lappend p $i
    }
    for {set j 1} {$j<=$m} {incr j} {
        set t_j [string index $t [expr {$j-1}]]
        lset d 0 $j
        for {set i 1} {$i<=$n} {incr i} {
            set s_i [string index $s [expr {$i-1}]]
            if {$s_i eq $t_j} {
                set cost 0
            } else {
                set cost 1
            }
            # minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            lset d $i [min [expr {[lindex $d [expr {$i-1}]]+1}]  [expr {[lindex $p $i]+1}] [expr {[lindex $p [expr {$i-1}]]+$cost}]]
        }
        # copy current distance counts to 'previous row' distance counts
        set _d $p
        set p $d
        set d $_d
    }
    # our last action in the above loop was to switch d and p, so p now
    # actually has the most recent cost counts
    lindex $p $n
  }

aricb 11-11-2008 Here is another implementation of the Levenshtein Distance algorithm. This one is based on the Wikipedia pseudocode at http://en.wikipedia.org/wiki/Levenshtein_distance . It economizes a bit by only remembering two rows of distances at a time. It runs a bit faster than stringDistance above, and (I think) is also a bit more Tclish.

(Few hours later) As I look at Artur's implementation a bit closer, I realize that his proc and mine are very similar, despite using different mechanisms for accessing the characters of the strings being compared. Not surprisingly, the two procs run at about the same speed; Artur's seems to be just a hair ahead of mine :)

  proc levenshteinDistance {s t} {
    # special case: $s is an empty string
    if {$s eq ""} {
      return [string length $t]
    }

    # initialize first row in table
    for {set i 0} {$i <= [string length $t]} {incr i} {
      lappend prevrow $i
    }

    set i 0
    foreach schar [split $s ""] {
      incr i
      set j 0
      set row [list $i]
      foreach tchar [split $t ""] {
        incr j
        set cost [expr {$schar ne $tchar}] ;# $cost is 0 if same, 1 if different
        # difference between $schar and $tchar is minimum of:
        #   [lindex $prevrow $j]   + 1     = cost of deletion
        #   [lindex $row     $j-1] + 1     = cost of insertion
        #   [lindex $prevrow $j-1] + $cost = cost of substitution (zero if same char)
        lappend row [expr {min([lindex $prevrow $j] + 1, [lindex $row [expr {$j-1}]] + 1, [lindex $prevrow [expr {$j-1}]] + $cost)}]
      }
      set prevrow $row
    }
    # Levenshtein distance is value at last cell of last row
    return [lindex $row end]
  }

DKF: An optimized version of Artur's code above is this:

proc levenshteinDistance {s t} {
    if {![set n [string length $t]]} {
        return [string length $s]
    } elseif {![set m [string length $s]]} {
        return $n
    }
    for {set i 0} {$i <= $m} {incr i} {
        lappend d 0
        lappend p $i
    }
    for {set j 0} {$j < $n} {} {
        set tj [string index $t $j]
        lset d 0 [incr j]
        for {set i 0} {$i < $m} {} {
            set a [expr {[lindex $d $i]+1}]
            set b [expr {[lindex $p $i]+([string index $s $i] ne $tj)}]
            set c [expr {[lindex $p [incr i]]+1}]
            lset d $i [expr {$a<$b ? $c<$a ? $c : $a : $c<$b ? $c : $b}]
        }
        set nd $p; set p $d; set d $nd
    }
    return [lindex $p end]
}

Note that this is 100% bytecoded in Tcl 8.5.


IL This is a topic I've been dealing with a lot lately, and I'm finding (of course) that the nature of the math really depends on the data you're trying to score. In the field of genetic sequence matching you might be looking for common subsequences, in editorial fields you might be looking for misspellings. (escargo - Is that a joke?) IL-no, i'm not that corny :(, thanks for the headsup

I've found that under a certain number of characters in a string, any percentage measurement really doesn't do much good. CAT and CAR (especially in an editorial context) are entirely different concepts but you can make the argument it only differs by one char, in this case that difference just ends up being 33%. It raised the question to me, can you ever really assign a percentage of relevancy by sheer numbers alone? Probably not, in most cases I regard string matching as highly domain specific.

You should build a semantic network of networks simulating semiotic network, constructed from nontrivial symbol knots - you could do this by really simple statistical analysis and dictionary definitions from some commonly given source. From the dictionary you will get a definition, list of synonyms and antonyms, list of related terms and phrases (it *doesn't* *matter* *how* they are related). Select a word A and word B you are interested in (like CAT and CAR). Get the definitions. Now find uncommon terms in the definitions. Assign a number telling us how probable is it's appearing by chance only, calculated over some large body of language (documents related in any way to document we have the words from, preferably). Go from the beginning of the definition to the end (just like we are reading material), selecting words-nodes.

Calculate p of relation forming an arc from term a to term b by multiplying probabilities of apperance - call that relation, "retrospection" or "retroaction" or "implication by chance" - it tells how much one of the word reflects the semantic connections of the other semantic connections, thus increasing a chance that an intelligent agent will put it there, in that order (we think backwards, like Google). Lappend it to a vector.

Calculate p of relation "precognition", wytch tells us that a word b has a chance to appear after word a by reverse creation of sense (like afterthought of a writer), making it more probable that intelligent agent will backtrack and write up paragraph that contains it - this time add probabilities as those events are not really connected causally and should get some bonus for being so lonely. Lappend it to a vector.

Calculate probability of relation "negation" - substract probabilities, just don't get a negative number - negation of nodes works both ways (is reflective) - it tells us how much they don't like each other - so if they are here in the text we are analyzing, it's an instance of unusual writing style, they should be far away from each other and they are not. Lappend it to a vector.

Calculate probability of relation "self-propagation" by mutating p of the node by inverting it (substracting from 1 - the MUST in RFC sense) - if a node is unlikely to show up at random it will get a large nuber of self-propagation for it did show up - this relation is symmetrical, obviously. Lappend it to a vector.

If there are several identical terms that are also chosen as a node it means we have an unlikely event of heavy semantic coupling - divide the p of the node by number of occurences plus one (one less degree of freedom). Otherwise simply lappend p of the singular node to a vector. You get a vector consisting of five relations and connections to word a at the beginning and word b at the end. This gives you a way to build a matrix that looks like a picture of a face for the word a and word b - add those vectors, one after another, until you are out of features (rare words, that are likely to be in turn defined somewhere using a lot of common, related words) to extract from the definition itself. Analyze the matrix - calculate connection of relations of the same type from two vectors that lie one after another (all the structure is really a faerie ring ;-}), don't worry if you get funny numbers - we do not really deal with a simple probabilistic space, so the 0..1 does not apply, you can normalize them later to fit.

Now you could train our network by finding relations of higher order, but that's a long story and you can skip this step if you want. Take what you have - a list of lists, like a matrix of pixels showing some picture. Use some image recognition algorithms to calculate various degrees of similarity between the matrix-image complex symbolic representation anologous to real images showing us a CAT and a CAR (numbers converted to colors of pixels and other values important in image processing). If you are not sure that is close enough - make such images for every mutation of a word CAT and CAR you can think of to get more accurate results - warning - mutated words must be connected to some real-world objects, that is words you can find defined in a dictionary. If you want you can replace this with finding definitions for hypernodes in dictionary definitions of our words - first level is a level of image reflected truly in a mirror, second level is a level of reflection of image from the other side of the mirror (relation matrix is inverted), third level is a level of metalanguage - that is images that are about image and reflections of it (relation matrix is transposed, rows become columns, columns rows), fourth level is a level of disconnection - think of it like a secret society layer in a neural network, it affects the metalanguage level, but is not seen - nodes that are considered there are the words that are popular, not those that are rare (other way to invert without inverting anything in substance).

Fifth level is a level of primodial chaos and love or desire to connect no matter what - it is connected back to itself and propagates random changes back to level one to create some lively noise, simmulating several people reading the same words and then talking leisurely to each other about their contents, naturally using those rare words they found out and a lot of common ones to keep conversation lively. Think about five mirrors, that are reflecting image of some entity one to another and at the same time reflecting itself in second set of mirrors that serves as memory, that changes constantly, reflecting reality realized. Easy to remember - you have two hands and each one has five fingers, now imagine yourself weaving something. It's you, weaving the reality and the story cannot ever end, obviously. You can get sense out of nonsense. It's a kind of magic. The similarity between the word CAT and CAR are a dynamic process too - in 50 years those words will mean something completely different, spelling might chance... I gave you a way to capture a gestalt of this process. Calculations are really simple and I hope my poetic zeal did not offend anyone. Milk and kisses...

Also there is the notion that A has a relevancy towards B, but the reverse might not be true. ie. you can say THE has a 100% match against THE FLYING CIRCUS.

[L4 ] Here is a good summary I found on the topic of genetic sequence matching algorithms. I was using a variation of the Smith-Waterman algorithm to run tests against some supposedly related datasets.


For faster (though not as exact) string comparisons, see Fuzzy string search


To compare how similar sounding two strings are, try the soundex page.