Version 13 of soundex

Updated 2003-04-03 14:21:29

Purpose: a pattern matching algorithm to how close two strings sound similar to one another.


Apparently tcllib has a new module for soundex. I wonder if any of the following code on this page has been considered for addition as well?

AK: I got myself essentially two soundexes, the one from this page and the Knuth one, implemented by Evan Rempel. When I found out that the algorithm here differs just in a minor detail from the Knuth (see later on this page) I decided to use the Knuth one as seed for the module. The other algorithm here, metaphone is a completely unknown to me and thus I am hesitant to just add it. Especially given the comments about first pass / draft and the unusual equivalences.

Feel free to add more algorithms here, polish existing one, and/or give references to papers, implementations, etc. I.e. make this a staging are for things which can go into the soundex module of tcllib.


Michael Schlenker: Should the tcllib soundex module be reserved for such sounds-like pattern matching algorithms or could more linguistic tools be added (like stemmers, see for example Tclsnowball for a Tcl binding to some stemmers.)?

AK: Hm. ... This could be a question for the tcllib-devel mailing list in general.

Question: What does a stemmer do ?

Answer: A stemmer tries to find the stem of words. This is useful for mapping plural and singular forms of words, and other forms of basically the same word to a common stem.


DKF: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward...

  ## Be nice and friendly with namespaces
  namespace eval ::soundex {namespace export soundex}

  ## Set up some static data only once
  array set ::soundex::soundexcode {
      a 0   b 1   c 2   d 3   e 0   f 1   g 2
      h 0   i 0   j 2   k 2   l 4   m 5   n 5
      o 0   p 1   q 2   r 6   s 2   t 3   u 0
      v 1   w 0   x 2   y 0   z 2
  }

  proc ::soundex::soundex {string} {
      variable soundexcode

      ## force lowercase and strip out all non-alphabetic characters
      regsub -all {[^a-z]} [string tolower $string] {} letters

      ## the null string is code Z000
      if {![string length $letters]} { return Z000 }

      set last -1
      set key  {}

      ## scan until end of string or until the key is built
      foreach char [split $letters {}] {
          set code $soundexcode($char)
          ## Fold together adjacent letters with the same code
          if {$last != $code} {
              set last $code
              ## Ignore code==0 letters except as separators
              if {$last} {
                  append key $last
                  ## Only need the first four
                  if {[string length $key] >= 4} {break}
              }
          }
      }

      ## normalise by adding zeros to get four characters
      string range "[string toupper $key]0000" 0 3
  }

DKF: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of $char instead?

AK: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result.

You can can find an implementation of the Knuth soundex in Tcl at Evan Rempel's page, http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html

LV: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives?


 ## ******************************************************** 
 ##
 ## Name: metaphone.tcl
 ##
 ## Description:
 ## A better soundex type algorithm
 ##
 ## Usage:
 ##
 ## Comments:
 ## The idea here is not to match some existing standard.
 ##
 ## The idea *is* to try to reduce *english* to a sound
 ## based structure while preserving readability.
 ##
 ## This results in output that *can* be used for the same
 ## purpose as soundex.
 ##
 ## No question, this is a first pass and needs polish!
 ##

 proc metaphone { string } {

     set patterns {
        (ough|igh|a|e|i|o|u|y) {}
        (gth) g
        (th|t|tt) t
        (sch|sh|ss|s) s
        (ghn|gn|nn) n
        (ch|zh|gh|x|j) x
        (ph|ff|f) f
        (ck|kk|k) k
        (gg|gh) g
        (ll|lh) l
        (mm|mn) m
        (dd|dh) d
        (zz|sz) z
         wh w
         h {}
     }

     foreach [ list pattern replacement ] $patterns {
        regsub -all -nocase $pattern $string $replacement string
     }

     regsub -all {\s+} $string { } string

     return $string
 }
 ## ******************************************************** 

if 0 {

DKF - I see what is meant by first pass, given that it equivalences sought, satay and asti... :^)

}

You write that as if it were pregnant with significance. Why should it not do so?


A reference site for Lawrence Philips' Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/


[ Category Command (a part of tcllib) ]