Purpose: a pattern matching algorithm to how close two strings ''sound'' similar to one another. ---- DKF: This code has been greatly tightened and should be clearer too. Some idioms work better in Tcl than they do in other languages, so transcribing an algorithm from C is not always straight-forward... ## Be nice and friendly with namespaces namespace eval ::soundex {namespace export soundex} ## Set up some static data only once array set ::soundex::soundexcode { a 0 b 1 c 2 d 3 e 0 f 1 g 2 h 0 i 0 j 2 k 2 l 4 m 5 n 5 o 0 p 1 q 2 r 6 s 2 t 3 u 0 v 1 w 0 x 2 y 0 z 2 } proc ::soundex::soundex {string} { variable soundexcode ## force lowercase and strip out all non-alphabetic characters regsub -all {[^a-z]} [string tolower $string] {} letters ## the null string is code Z000 if {![string length $letters]} { return Z000 } set last -1 set key {} ## scan until end of string or until the key is built foreach char [split $letters {}] { set code $soundexcode($char) ## Fold together adjacent letters with the same code if {$last != $code} { set last $code ## Ignore code==0 letters except as separators if {$last} { append key $last ## Only need the first four if {[string length $key] >= 4} {break} } } } ## normalise by adding zeros to get four characters string range "[string toupper $key]0000" 0 3 } DKF: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of ''$char'' instead? LV: There are some alternative algorithms to soundex that attempt to achive similar functionality. I recall seeing several coded for Perl. The benefit was that they achieved varying degrees of better matches. Anyone familar with alternatives? ---- ## ******************************************************** ## ## Name: metaphone.tcl ## ## Description: ## A better soundex type algorithm ## ## Usage: ## ## Comments: ## The idea here is not to match some existing standard. ## ## The idea *is* to try to reduce *english* to a sound ## based structure while preserving readability. ## ## This results in output that *can* be used for the same ## purpose as soundex. ## ## No question, this is a first pass and needs polish! ## proc metaphone { string } { set patterns { (ough|igh|a|e|i|o|u|y) {} (gth) g (th|t|tt) t (sch|sh|ss|s) s (ghn|gn|nn) n (ch|zh|gh|x|j) x (ph|ff|f) f (ck|kk|k) k (gg|gh) g (ll|lh) l (mm|mn) m (dd|dh) d (zz|sz) z wh w h {} } foreach [ list pattern replacement ] $patterns { regsub -all -nocase $pattern $string $replacement string } regsub -all {\s+} $string { } string return $string } ## ******************************************************** if 0 { [DKF] - I see what is meant by ''first pass'', given that it equivalences ''sought'', ''satay'' and ''asti''... :^) } You write that as if it were pregnant with significance. Why should it not do so? ---- [Category Command]