Version 5 of soundex

Updated 2003-03-21 21:46:53

Purpose: a pattern matching algorithm to how close two strings sound similar to one another.


DKF: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward...

  ## Be nice and friendly with namespaces
  namespace eval ::soundex {namespace export soundex}

  ## Set up some static data only once
  array set ::soundex::soundexcode {
      a 0   b 1   c 2   d 3   e 0   f 1   g 2
      h 0   i 0   j 2   k 2   l 4   m 5   n 5
      o 0   p 1   q 2   r 6   s 2   t 3   u 0
      v 1   w 0   x 2   y 0   z 2
  }

  proc ::soundex::soundex {string} {
      variable soundexcode

      ## force lowercase and strip out all non-alphabetic characters
      regsub -all {[^a-z]} [string tolower $string] {} letters

      ## the null string is code Z000
      if {![string length $letters]} { return Z000 }

      set last -1
      set key  {}

      ## scan until end of string or until the key is built
      foreach char [split $letters {}] {
          set code $soundexcode($char)
          ## Fold together adjacent letters with the same code
          if {$last != $code} {
              set last $code
              ## Ignore code==0 letters except as separators
              if {$last} {
                  append key $last
                  ## Only need the first four
                  if {[string length $key] >= 4} {break}
              }
          }
      }

      ## normalise by adding zeros to get four characters
      string range "[string toupper $key]0000" 0 3
  }

DKF: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of $char instead?

LV: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives?


 ## ******************************************************** 
 ##
 ## Name: metaphone.tcl
 ##
 ## Description:
 ## A better soundex type algorithm
 ##
 ## Usage:
 ##
 ## Comments:
 ## The idea here is not to match some existing standard.
 ##
 ## The idea *is* to try to reduce *english* to a sound
 ## based structure while preserving readability.
 ##
 ## This results in output that *can* be used for the same
 ## purpose as soundex.
 ##
 ## No question, this is a first pass and needs polish!
 ##

 proc metaphone { string } {

     set patterns {
        (ough|igh|a|e|i|o|u|y) {}
        (gth) g
        (th|t|tt) t
        (sch|sh|ss|s) s
        (ghn|gn|nn) n
        (ch|zh|gh|x|j) x
        (ph|ff|f) f
        (ck|kk|k) k
        (gg|gh) g
        (ll|lh) l
        (mm|mn) m
        (dd|dh) d
        (zz|sz) z
         wh w
         h {}
     }

     foreach [ list pattern replacement ] $patterns {
        regsub -all -nocase $pattern $string $replacement string
     }

     regsub -all {\s+} $string { } string

     return $string
 }
 ## ******************************************************** 

if 0 {

DKF - I see what is meant by first pass, given that it equivalences sought, satay and asti... :^)

}

You write that as if it were pregnant with significance. Why should it not do so?


Category Command