With a given word list (such as from the 12-dicts [L1 ] project) formatted as a Tcl procedure that returns a list [L2 ], it is interesting to speculate on the fastest procedure for obtaining anagrams of a given set of letters.
The following procedure is my effort. It has some refinements: (1) it takes an optional template, as for string match and (2) it allows for a wildcard character (such as the blank in Scabble or the joker in card games), written as "_".
Code:
proc anagram {style letters {template *}} { set anagrams [list] set usingTemplate [string match "template" $style] foreach word [wordList] { if {$usingTemplate} { if {![string match $template $word]} {continue} } set discard $letters set flag true for {set i 0} {$i<[string length $word]} {incr i} { set match [string first [string index $word $i] $discard] if {$match>-1} { set discard [string replace $discard $match $match] } else { #-- We don't hold the char; but we may have a blank set match [string first _ $discard] if {$match>-1} { #-- We do have a blank, so we use that instead set discard [string replace $discard $match $match] } else { set flag false break } } } if {$flag} { lappend anagrams $word } } return $anagrams } # Examples: catch {console show} puts [anagram anyof RETAINS] puts [anagram template RETAINS *A] puts [anagram template TO_UE ?????]
Alastair Davies - 17 January 2006
HJG I get the error 'Invalid command name "wordList"'. Maybe a Tcl 8.5 feature ? - RS No - looks like a function that reads the word list file, and returns them as a list. Left as an exercise for the reader :^)
Alastair responds: HJG and others can download [L3 ] the source code that defines "wordList". The start and end of this look like...
proc wordList {{lang en}} { switch $lang { en { return { AARDVARK AARDVARKS ABACI ABACK ABACUS ZYGOTE ZYGOTES ZYGOTIC ZYMURGY } } } }
...but the whole file is 1359 kb. English only at present, but does anyone have a similar French or German word list that could be added to the source?
KPV By far the fastest way to do this is to preprocess the word list and create what's known as an anagram dictionary[L4 ]. Once that's done, figuring out anagrams takes O(1) time.
Over lunchtime sandwiches, Alastair created an anagram dictionary:
set anagramDict [dict create] foreach word [wordList] { dict lappend anagramDict [join [lsort [split $word ""]] ""] $word }
In anagramDict, the key "AEINRST", for example, lists the words "ANTSIER", "NASTIER", "RETAINS", "RETINAS" and "RETSINA", and it takes (approximately) no time at all to access them. Thanks, KPV! Incidentally, there are 69894 keys in the dictionary and 75273 words in the original list.
Thinking of anagrams in the looser sense (using any, rather than all, of the letters), I came up with this recursive procedure to reduce a word letter by letter to find all possible keys:
proc reduce {word} { lappend reduction $word if {[string length $word]>2} { for {set i 0} {$i<[string length $word]} {incr i} { set reduction [concat $reduction [reduce [string replace $word $i $i]]] } } return [lsort -uniq $reduction] }
These keys can be used to search the anagramDict (as defined above):
proc anagram {word} { global anagramDict set word [join [lsort [split $word ""]] ""] set result [list] foreach key [reduce $word] { if {[dict exists $anagramDict $key]} { set result [concat $result [dict get $anagramDict $key]] } } return $result }
KPV Speaking of anagram dictionaries, there are a couple of interesting questions you can ask. See if you answer them before looking at the dictionary.
Alastair is amused: Ha! Funny. Makes me look at words in a new light, to be considering the letters in alphabetical order.
Unfortunately (for me), recursion (see proc reduce {word} above) turns out not to be a good solution for identifying all the possible groups of letters within a word. The time taken blows up for words of length somewhere around nine or ten letters. Has anyone got a better suggestion to share?
Lars H: I'm not surprised, considering that you generate every subword of length k from a word of length n as many as (n-k)! times. Changing reduce so that it branches on the two cases (i) subword contains first letter of $word and (ii) subword does not contain first letter of $word should improve this considerably; see power set of a list for some relevant procedures.
There is also a regexp-based alternative, which may be very fast:
proc anagram2 {word} { global anagramDict array set anagramArray $anagramDict set RE ^[join [lsort [split $word ""]] ?]?\$ set result [list] foreach key [array names anagramArray -regexp $RE] { lappend result {*}$anagramArray($key) } return $result }; # Not tested, but should work.
It's necessary to put the dictionary in an array for this, because the -regexp option of array names was not remembered (or simply discarded as feature creep) when dict keys was designed.
Alastair found the pointer to enumerating the power set of a list very helpful, and snaffled the subsets procedure for the following (speedy) anagrammer:
proc subsets {l} { set subsets [list [list]] foreach e $l { foreach subset $subsets { lappend subsets [lappend subset $e] } } return $subsets } proc anagram3 {word {template *}} { global anagramDict set letters [lsort [split $word ""]] set result [list] foreach subset [subsets $letters] { set key [join $subset ""] if {[dict exists $anagramDict $key]} { foreach word [dict get $anagramDict $key] { if {[string match $template $word]} { lappend result $word } } } } return $result }
This version doesn't handle wildcard characters - as far as I can see these would need to be inserted in each possible position within each of the returned subsets, to be matched using a glob pattern by dict keys. Incidentally, using regular expressions to match the keys (of an array) proved to be rather slow.