soundex

Soundex is an algorithm designed by Knuth. It is a kind of hash for words, where words that sound similar will have hash to similar values. It can be used in a pattern matching algorithm for similar words which could be misspellings of one another, or mistaken for each other in speech (especially the output from speech recognition).

Official Documentation

There is an implementation of Knuth's soundex in tcllib:

https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/soundex/soundex.md

See also

  • similarity includes several implementation of Levenshtein (edit) Distance, and links to related algorithms
  • Lawrence Philips' Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/
  • Another implementation of Knuth's soundex in Tcl is at Evan Rempel's page, [L1 ]. AK consulted this implementation when developing the tcllib module.
  • Tclsnowball contains a binding to some stemmers

Discussion

AK: Feel free to add more algorithms here, polish existing one, and/or give references to papers, implementations, etc. I.e. make this a staging are for things which can go into the soundex module of tcllib.

I got myself essentially two soundexes, the one from this page and the Knuth one, implemented by Evan Rempel. When I found out that the algorithm here differs just in a minor detail from the Knuth (see later on this page) I decided to use the Knuth one as seed for the module. The other algorithm here, metaphone is a completely unknown to me and thus I am hesitant to just add it. Especially given the comments about first pass / draft and the unusual equivalences.


Michael Schlenker: Should the tcllib soundex module be reserved for such sounds-like pattern matching algorithms or could more linguistic tools be added (like stemmers, see for example Tclsnowball for a Tcl binding to some stemmers.)?

AK: Hm. ... This could be a question for the tcllib-devel mailing list in general.

Question: What does a stemmer do ?

Answer: A stemmer tries to find the stem of words. This is useful for mapping plural and singular forms of words, and other forms of basically the same word to a common stem.


DKF: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward...

## Be nice and friendly with namespaces
namespace eval ::soundex {namespace export soundex}

## Set up some static data only once
array set ::soundex::soundexcode {
    a 0   b 1   c 2   d 3   e 0   f 1   g 2
    h 0   i 0   j 2   k 2   l 4   m 5   n 5
    o 0   p 1   q 2   r 6   s 2   t 3   u 0
    v 1   w 0   x 2   y 0   z 2
}

proc ::soundex::soundex {string} {
    variable soundexcode

    ## force lowercase and strip out all non-alphabetic characters
    regsub -all {[^a-z]} [string tolower $string] {} letters

    ## the null string is code Z000
    if {![string length $letters]} { return Z000 }

    set last -1
    set key  {}

    ## scan until end of string or until the key is built
    foreach char [split $letters {}] {
        set code $soundexcode($char)
        ## Fold together adjacent letters with the same code
        if {$last != $code} {
            set last $code
            ## Ignore code==0 letters except as separators
            if {$last} {
                append key $last
                ## Only need the first four
                if {[string length $key] >= 4} {break}
            }
        }
    }

    ## normalise by adding zeros to get four characters
    string range "[string toupper $key]0000" 0 3
}

DKF: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of $char instead?

AK: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result.

LV: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives?


MG Apr 29th 2004 - I wrote a soundex package myself about a year ago, shortly before I found out Tcllib had one included. Just found it again, and thought I'd share it. It uses the same algorithm as Knuth, I believe, though is a fair bit slower. The only 'advantage' is the code is smaller, in terms of numbers of characters.

proc soundex-mg {word} {

    if {![string is alpha -strict $word]} {
        error "argument must contain only Unicode alphabet characters";
    }

    set word [string toupper $word]
        if {[string match "PH*" $word]} {
            set first "F"
            set rest [string range $word 2 end]
        } else {
            set first [string range $word 0 0]
            set rest [string range $word 1 end]
        }

    set map {A "" E "" I "" O "" U "" H "" W "" Y "" B 1 P 1 F 1 V 1 C 2 G 2 J 2 K 2 Q 2 S 2 X 2 Z 2 D 3 T 3 L 4 M 5 N 5 R 6}

    regsub -all {(.)\1+} [string map $map $rest] {\1} rest

    return [string range "$first${rest}000" 0 3];
 
};# soundex-mg