Purpose: a pattern matching algorithm to how close two strings ''sound'' similar to one another. ---- Apparently [tcllib] has a new module for soundex. I wonder if any of the following code on this page has been considered for addition as well? [AK]: I got myself essentially two soundexes, the one from this page and the Knuth one, implemented by [Evan Rempel]. When I found out that the algorithm here differs just in a minor detail from the Knuth (see later on this page) I decided to use the Knuth one as seed for the module. The other algorithm here, metaphone is a completely unknown to me and thus I am hesitant to just add it. Especially given the comments about first pass / draft and the unusual equivalences. Feel free to add more algorithms here, polish existing one, and/or give references to papers, implementations, etc. I.e. make this a staging are for things which can go into the soundex module of [tcllib]. ---- [DKF]: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward... ## Be nice and friendly with namespaces namespace eval ::soundex {namespace export soundex} ## Set up some static data only once array set ::soundex::soundexcode { a 0 b 1 c 2 d 3 e 0 f 1 g 2 h 0 i 0 j 2 k 2 l 4 m 5 n 5 o 0 p 1 q 2 r 6 s 2 t 3 u 0 v 1 w 0 x 2 y 0 z 2 } proc ::soundex::soundex {string} { variable soundexcode ## force lowercase and strip out all non-alphabetic characters regsub -all {[^a-z]} [string tolower $string] {} letters ## the null string is code Z000 if {![string length $letters]} { return Z000 } set last -1 set key {} ## scan until end of string or until the key is built foreach char [split $letters {}] { set code $soundexcode($char) ## Fold together adjacent letters with the same code if {$last != $code} { set last $code ## Ignore code==0 letters except as separators if {$last} { append key $last ## Only need the first four if {[string length $key] >= 4} {break} } } } ## normalise by adding zeros to get four characters string range "[string toupper $key]0000" 0 3 } [DKF]: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of ''$char'' instead? [AK]: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result. You can can find an implementation of the Knuth soundex in Tcl at [Evan Rempel]'s page, http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html [LV]: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives? ---- ## ******************************************************** ## ## Name: metaphone.tcl ## ## Description: ## A better soundex type algorithm ## ## Usage: ## ## Comments: ## The idea here is not to match some existing standard. ## ## The idea *is* to try to reduce *english* to a sound ## based structure while preserving readability. ## ## This results in output that *can* be used for the same ## purpose as soundex. ## ## No question, this is a first pass and needs polish! ## proc metaphone { string } { set patterns { (ough|igh|a|e|i|o|u|y) {} (gth) g (th|t|tt) t (sch|sh|ss|s) s (ghn|gn|nn) n (ch|zh|gh|x|j) x (ph|ff|f) f (ck|kk|k) k (gg|gh) g (ll|lh) l (mm|mn) m (dd|dh) d (zz|sz) z wh w h {} } foreach [ list pattern replacement ] $patterns { regsub -all -nocase $pattern $string $replacement string } regsub -all {\s+} $string { } string return $string } ## ******************************************************** if 0 { [DKF] - I see what is meant by ''first pass'', given that it equivalences ''sought'', ''satay'' and ''asti''... :^) } You write that as if it were pregnant with significance. Why should it not do so? ---- A reference site for Lawrence Philips Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/ ---- [[ [Category Command] | [tcllib] ]]