Purpose: a pattern matching algorithm to how close two strings ''sound'' similar to one another. ---- Apparently [tcllib] has a new module for soundex. I wonder if any of the following code on this page has been considered for addition as well? [AK]: I got myself essentially two soundexes, the one from this page and the [Knuth] one, implemented by [Evan Rempel]. When I found out that the algorithm here differs just in a minor detail from the Knuth (see later on this page) I decided to use the Knuth one as seed for the module. The other algorithm here, metaphone is a completely unknown to me and thus I am hesitant to just add it. Especially given the comments about first pass / draft and the unusual equivalences. Feel free to add more algorithms here, polish existing one, and/or give references to papers, implementations, etc. I.e. make this a staging are for things which can go into the soundex module of [tcllib]. [LV] Also take note of the [similarity] wiki page. ---- [Michael Schlenker]: Should the tcllib soundex module be reserved for such sounds-like pattern matching algorithms or could more linguistic tools be added (like stemmers, see for example [Tclsnowball] for a Tcl binding to some stemmers.)? [AK]: Hm. ... This could be a question for the tcllib-devel mailing list in general. Question: What does a stemmer do ? Answer: A stemmer tries to find the stem of words. This is useful for mapping plural and singular forms of words, and other forms of basically the same word to a common stem. ---- [DKF]: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward... ## Be nice and friendly with namespaces namespace eval ::soundex {namespace export soundex} ## Set up some static data only once array set ::soundex::soundexcode { a 0 b 1 c 2 d 3 e 0 f 1 g 2 h 0 i 0 j 2 k 2 l 4 m 5 n 5 o 0 p 1 q 2 r 6 s 2 t 3 u 0 v 1 w 0 x 2 y 0 z 2 } proc ::soundex::soundex {string} { variable soundexcode ## force lowercase and strip out all non-alphabetic characters regsub -all {[^a-z]} [string tolower $string] {} letters ## the null string is code Z000 if {![string length $letters]} { return Z000 } set last -1 set key {} ## scan until end of string or until the key is built foreach char [split $letters {}] { set code $soundexcode($char) ## Fold together adjacent letters with the same code if {$last != $code} { set last $code ## Ignore code==0 letters except as separators if {$last} { append key $last ## Only need the first four if {[string length $key] >= 4} {break} } } } ## normalise by adding zeros to get four characters string range "[string toupper $key]0000" 0 3 } [DKF]: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of ''$char'' instead? [AK]: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result. You can can find an implementation of the Knuth soundex in Tcl at [Evan Rempel]'s page, http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html [LV]: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives? ---- [MG] Apr 29th 2004 - I wrote a soundex package myself about a year ago, shortly before I found out Tcllib had one included. Just found it again, and thought I'd share it. It uses the same algorithm as Knuth, I believe, though is a fair bit slower. The only 'advantage' is the code is smaller, in terms of numbers of characters. proc soundex-mg {word} { if { ![string is alpha -strict $word] } { error "argument must contain only Unicode alphabet characters"; } set word [string toupper $word] if { [string match "PH*" $word] } { set first "F" set rest [string range $word 2 end] } else { set first [string range $word 0 0] set rest [string range $word 1 end] } set map {A "" E "" I "" O "" U "" H "" W "" Y "" B 1 P 1 F 1 V 1 C 2 G 2 J 2 K 2 Q 2 S 2 X 2 Z 2 D 3 T 3 L 4 M 5 N 5 R 6} regsub -all {(.)\1+} [string map $map $rest] {\1} rest return [string range "$first${rest}000" 0 3]; };# soundex-mg ---- A reference site for Lawrence Philips' Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/ ---- [[ [Category Command] (a part of [tcllib]) - [Category String Processing] ]]