Purpose: a pattern matching algorithm to how close two strings ''sound'' similar to one another. ---- Apparently [tcllib] has a new module for soundex. I wonder if any of the following code on this page has been considered for addition as well? ---- [DKF]: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward... ## Be nice and friendly with namespaces namespace eval ::soundex {namespace export soundex} ## Set up some static data only once array set ::soundex::soundexcode { a 0 b 1 c 2 d 3 e 0 f 1 g 2 h 0 i 0 j 2 k 2 l 4 m 5 n 5 o 0 p 1 q 2 r 6 s 2 t 3 u 0 v 1 w 0 x 2 y 0 z 2 } proc ::soundex::soundex {string} { variable soundexcode ## force lowercase and strip out all non-alphabetic characters regsub -all {[^a-z]} [string tolower $string] {} letters ## the null string is code Z000 if {![string length $letters]} { return Z000 } set last -1 set key {} ## scan until end of string or until the key is built foreach char [split $letters {}] { set code $soundexcode($char) ## Fold together adjacent letters with the same code if {$last != $code} { set last $code ## Ignore code==0 letters except as separators if {$last} { append key $last ## Only need the first four if {[string length $key] >= 4} {break} } } } ## normalise by adding zeros to get four characters string range "[string toupper $key]0000" 0 3 } [DKF]: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of ''$char'' instead? [AK]: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result. You can can find an implementation of the Knuth soundex in Tcl at [Evan Rempel]'s page, http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html [LV]: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives? ---- ## ******************************************************** ## ## Name: metaphone.tcl ## ## Description: ## A better soundex type algorithm ## ## Usage: ## ## Comments: ## The idea here is not to match some existing standard. ## ## The idea *is* to try to reduce *english* to a sound ## based structure while preserving readability. ## ## This results in output that *can* be used for the same ## purpose as soundex. ## ## No question, this is a first pass and needs polish! ## proc metaphone { string } { set patterns { (ough|igh|a|e|i|o|u|y) {} (gth) g (th|t|tt) t (sch|sh|ss|s) s (ghn|gn|nn) n (ch|zh|gh|x|j) x (ph|ff|f) f (ck|kk|k) k (gg|gh) g (ll|lh) l (mm|mn) m (dd|dh) d (zz|sz) z wh w h {} } foreach [ list pattern replacement ] $patterns { regsub -all -nocase $pattern $string $replacement string } regsub -all {\s+} $string { } string return $string } ## ******************************************************** if 0 { [DKF] - I see what is meant by ''first pass'', given that it equivalences ''sought'', ''satay'' and ''asti''... :^) } You write that as if it were pregnant with significance. Why should it not do so? ---- [Category Command]