Soundex is an algorithm designed by Knuth. It is a kind of [hash] for words, where words that ''sound'' similar will have hash to similar values. It can be used in a pattern matching algorithm for similar words which could be misspellings of one another, or mistaken for each other in speech (especially the output from speech recognition). ** Official Documentation ** There is an implementation of Knuth's soundex in [tcllib]: http://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/soundex/soundex.html: ** See also ** * [similarity] includes several implementation of Levenshtein (edit) Distance, and links to related algorithms * Lawrence Philips' Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/ * Another implementation of Knuth's soundex in Tcl is at [Evan Rempel]'s page, [http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html]. [AK] consulted this implementation when developing the [tcllib] module. * [Tclsnowball] contains a binding to some stemmers ---- ** Discussion ** [AK]: Feel free to add more algorithms here, polish existing one, and/or give references to papers, implementations, etc. I.e. make this a staging are for things which can go into the soundex module of [tcllib]. I got myself essentially two soundexes, the one from this page and the [Knuth] one, implemented by [Evan Rempel]. When I found out that the algorithm here differs just in a minor detail from the Knuth (see later on this page) I decided to use the Knuth one as seed for the module. The other algorithm here, metaphone is a completely unknown to me and thus I am hesitant to just add it. Especially given the comments about first pass / draft and the unusual equivalences. ---- [Michael Schlenker]: Should the tcllib soundex module be reserved for such sounds-like pattern matching algorithms or could more linguistic tools be added (like stemmers, see for example [Tclsnowball] for a Tcl binding to some stemmers.)? [AK]: Hm. ... This could be a question for the tcllib-devel mailing list in general. Question: What does a stemmer do ? Answer: A stemmer tries to find the stem of words. This is useful for mapping plural and singular forms of words, and other forms of basically the same word to a common stem. ---- [DKF]: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward... ====== ## Be nice and friendly with namespaces namespace eval ::soundex {namespace export soundex} ## Set up some static data only once array set ::soundex::soundexcode { a 0 b 1 c 2 d 3 e 0 f 1 g 2 h 0 i 0 j 2 k 2 l 4 m 5 n 5 o 0 p 1 q 2 r 6 s 2 t 3 u 0 v 1 w 0 x 2 y 0 z 2 } proc ::soundex::soundex {string} { variable soundexcode ## force lowercase and strip out all non-alphabetic characters regsub -all {[^a-z]} [string tolower $string] {} letters ## the null string is code Z000 if {![string length $letters]} { return Z000 } set last -1 set key {} ## scan until end of string or until the key is built foreach char [split $letters {}] { set code $soundexcode($char) ## Fold together adjacent letters with the same code if {$last != $code} { set last $code ## Ignore code==0 letters except as separators if {$last} { append key $last ## Only need the first four if {[string length $key] >= 4} {break} } } } ## normalise by adding zeros to get four characters string range "[string toupper $key]0000" 0 3 } ====== [DKF]: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of ''$char'' instead? [AK]: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result. [LV]: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives? ---- [MG] Apr 29th 2004 - I wrote a soundex package myself about a year ago, shortly before I found out Tcllib had one included. Just found it again, and thought I'd share it. It uses the same algorithm as Knuth, I believe, though is a fair bit slower. The only 'advantage' is the code is smaller, in terms of numbers of characters. ====== proc soundex-mg {word} { if {![string is alpha -strict $word]} { error "argument must contain only Unicode alphabet characters"; } set word [string toupper $word] if {[string match "PH*" $word]} { set first "F" set rest [string range $word 2 end] } else { set first [string range $word 0 0] set rest [string range $word 1 end] } set map {A "" E "" I "" O "" U "" H "" W "" Y "" B 1 P 1 F 1 V 1 C 2 G 2 J 2 K 2 Q 2 S 2 X 2 Z 2 D 3 T 3 L 4 M 5 N 5 R 6} regsub -all {(.)\1+} [string map $map $rest] {\1} rest return [string range "$first${rest}000" 0 3]; };# soundex-mg ====== ---- <> Command | Tcllib | String Processing