Version 1 of soundex

Updated 2003-03-20 16:16:47

Purpose: a pattern matching algorithm to how close two strings sound similar to one another.


DKF: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward...

  ## Be nice and friendly with namespaces
  namespace eval ::soundex {namespace export soundex}

  ## Set up some static data only once
  array set ::soundex::soundexcode {
      a 0   b 1   c 2   d 3   e 0   f 1   g 2
      h 0   i 0   j 2   k 2   l 4   m 5   n 5
      o 0   p 1   q 2   r 6   s 2   t 3   u 0
      v 1   w 0   x 2   y 0   z 2
  }

  proc ::soundex::soundex {string} {
      variable soundexcode

      ## force lowercase and strip out all non-alphabetic characters
      regsub -all {[^a-z]} [string tolower $string] {} letters

      ## the null string is code Z000
      if {![string length $letters]} { return Z000 }

      set last -1
      set key  {}

      ## scan until end of string or until the key is built
      foreach char [split $letters {}] {
          set code $soundexcode($char)
          ## Fold together adjacent letters with the same code
          if {$last != $code} {
              set last $code
              ## Ignore code==0 letters except as separators
              if {$last} {
                  append key $last
                  ## Only need the first four
                  if {[string length $key] >= 4} {break}
              }
          }
      }

      ## normalise by adding zeros to get four characters
      string range "[string toupper $key]0000" 0 3
  }

DKF: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of $char instead?

LV: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives?

PSE: OK. I'm working on a pure Tcl metaphone engine.


Category Command