CombinationCapitalization

Larger Problem: Generating all possible combinations of all possible capitalizations of six words chosen from a list of N words.

To illustrate, consider the reduced problem of combinations of two words from a list of three word list: dog cat bear all combinations: dogdog dogcat dogbear catdog catcat catbear beardog bearcat bearbear all possible capitalizations would start: dogdog Dogdog DOgdog DOGdog ... dogDog ... dogDOG ...

I had the goal to avoid use of recursion.

First I started with a Sub-Problem: I want to have all combinations of capitalization of the letters of a word.

Again the goal was to not use recursion. Here was what I first came up with:

proc binaryadd {bits} {
  set carry 1
  set l [llength $bits]
  set m [expr $l - 1]
  for {} {$m != -1} {incr m -1} {
    if {$carry == 1} {
      if {[lindex $bits $m] == 0} {
        set bits [lreplace $bits $m $m 1]
        set carry 0
      } else {
        set bits [lreplace $bits $m $m 0]
        set carry 1
      }
    }
  }
  if {$carry == 1} { return {} 
  } else {           return $bits }
}

proc capmask {wordlist mask} {
  set n 0
  foreach m $mask {
    if {$m == 1} {
      set ch [lindex $wordlist $n]
      if { [string is upper $ch] } {
        set wordlist [lreplace $wordlist $n $n [string tolower $ch] ]
      } else {
        set wordlist [lreplace $wordlist $n $n [string toupper $ch] ]
      }
    }
    incr n
  }
  return [join $wordlist {}]
}

proc allcap {word} {
  set len [string length $word]
  for {set n 0} {$n < $len} {incr n} { lappend wordlist [string index $word $n] }
  for {set n 1} {$n <= $len} {incr n} { lappend bits 0 }
  while {[llength $bits] > 0} {
    lappend results [capmask $wordlist $bits]
    set bits [binaryadd $bits]
  }
  return $results
}

One might run this by: allcap dog

and get the output: dog Dog dOg DOg doG DoG dOG DOG

This works by creating a binary counter that counts through all possible values for the number of bits equal to the number of letters in the word. The counter returns the empty list when it overflows a full count. The bits indicate which letters in the word ought to be capitalized.

Now, maybe someone has a better implementation of the counter, but I thought this was a reasonable way to do it.

So, the Sub-Problem is solved.

Now back to the problem. Here was my initial implementation of this, which used the above code for allcap, capmask and binaryadd.

set fp [open "dict.txt" r]; # file contains a list of words
set data [read $fp]; close $fp
set words [split $data "\n"]
set wordsnull $words; lappend wordsnull {};  # add on the empty entry to handle reduced number of words
set numwords [llength $words]
set numwordsnull [expr $numwords + 1]

set ofp [open "expanded-dict.txt" w]

# Combine up to six words, using the null at the end of the list to make it less than 6.
# In other words, when you hit the null word in the list wordsnull you will have no entry for that word so 
# it could result in just 1, 2, 3, 4 or 5 words combined, instead of 6.

for {set n1 0} {$n1 < $numwords} {incr n1} {
  for {set n2 0} {$n2 < $numwordsnull} {incr n2} {
    for {set n3 0} {$n3 < $numwordsnull} {incr n3} {
      for {set n4 0} {$n4 < $numwordsnull} {incr n4} {
        for {set n5 0} {$n5 < $numwordsnull} {incr n5} {
          for {set n6 0} {$n6 < $numwordsnull} {incr n6} {
            set phrase [lindex $words $n1]
            append phrase [lindex $wordsnull $n2]
            append phrase [lindex $wordsnull $n3]
            append phrase [lindex $wordsnull $n4]
            append phrase [lindex $wordsnull $n5]
            append phrase [lindex $wordsnull $n6]
            puts $ofp [join [allcap $phrase] "\n"]
          }
        }
      }
    }
  }
}

As is par for me, I underestimated the computational resources this would require in CPU time. However, I also encountered some sort of memory leak. On Windows XP, wish put out some sort of message saying it couldn't get any more memory. On Linux, on a machine with 64 GB of memory, tclsh eventually complained that it could not allocate 384 bytes. So, there is a some sort of leak.

I'm in the process of optimizing the code to reduce variable allocation by having everything be global and not passing anything into procs. I've also optimized it to reduce the total number of capital letters allowed and the total word length, as I had the flexibility in my application to do that. I'll post an update here when I get that fully running.

buchs 3/18/2010

Here is the code that is optimized:

#!/usr/bin/tclsh
proc binaryadd {} {
  global bitmask carry m
  set carry 1
  set m [expr [llength $bitmask] - 1]
  for {} {$m != -1} {incr m -1} {
    if {$carry == 1} {
      if {[lindex $bitmask $m] == 0} {
        set bitmask [lreplace $bitmask $m $m 1]
        set carry 0
      } else {
        set bitmask [lreplace $bitmask $m $m 0]
        set carry 1
      }
    }
  }
  if {$carry == 1} { set bitmask {} }
}

proc capmask {} {
  global wordlist bitmask m bit ch results
  set m 0
  foreach bit $bitmask {
    if {$bit == 1} {
      set ch [lindex $wordlist $m]
      if { [string is upper $ch] } {
        set wordlist [lreplace $wordlist $m $m [string tolower $ch] ]
      } else {
        set wordlist [lreplace $wordlist $m $m [string toupper $ch] ]
      }
    }
    incr m
  }
  lappend results [join $wordlist {}]
}

proc maxcaps {num} {
  global bitmask
  if { [string length [string map {0 {}} [join $bitmask {}]] ] > $num } {
    return true
  } else { return false }
}

proc allcap {} {
  global results wordlist bitmask phrase
  set results {}; set wordlist {}; set bitmask {}
  set len [string length $phrase]
  for {set n 0} {$n < $len} {incr n} { 
    lappend wordlist [string index $phrase $n]
    lappend bitmask 0
  }
  while {[llength $bitmask] > 0} {
    if {[maxcaps 6]} { continue }
    capmask
    binaryadd
  }
}

set fp [open "dict.txt" r]
set data [read $fp]; close $fp
set words [split $data "\n"]
set wordsnull $words; lappend wordsnull {};  # add on the empty entry to handle reduced number of words
set numwords [llength $words]
set numwordsnull [expr $numwords + 1]

set ofp [open "expanded-dict.txt" w]

# Combine up to six words, using the null at the end of the list to make it less than 6.
# In other words, when you hit the null word in the list wordsnull you will have no entry for that word
for {set n1 0} {$n1 < $numwords} {incr n1} {
  for {set n2 0} {$n2 < $numwordsnull} {incr n2} {
    for {set n3 0} {$n3 < $numwordsnull} {incr n3} {
      for {set n4 0} {$n4 < $numwordsnull} {incr n4} {
        for {set n5 0} {$n5 < $numwordsnull} {incr n5} {
          close $ofp; set ofp [open "expanded-dict.txt" a]
          for {set n6 0} {$n6 < $numwordsnull} {incr n6} {
            if { $n2 == $n1 || $n3 == $n1 || $n4 == $n1 || $n5 == $n1 || $n6 == $n1 || \
                               $n3 == $n2 || $n4 == $n2 || $n5 == $n2 || $n6 == $n2 || \
                                             $n4 == $n3 || $n5 == $n3 || $n6 == $n3 || \
                                                           $n5 == $n4 || $n6 == $n4 || \
                                                                         $n6 == $n5   } \
                { continue }
            set phrase [lindex $words $n1];
            append phrase [lindex $wordsnull $n2];
            append phrase [lindex $wordsnull $n3];
            append phrase [lindex $wordsnull $n4];
            append phrase [lindex $wordsnull $n5];
            append phrase [lindex $wordsnull $n6];
            if { [string length $phrase] > 15 } { continue }
            allcap
            puts $ofp [join $results "\n"]
          }
        }
      }
    }
  }
}

This script runs for hours and hours without encountering the out of memory error. This, I believe, was a result of making all variables global. I restricted the phrase size and the number of case changes is up to 6 (continue in the loop if the number of "1"s is more than 6). Still, I gave the computer way too much work to do! ;-). I'll have to re-scope the problem.

buchs 3/19/2010

AM (24 march 2010) It may help considerably to put the loops in a procedure - global code is not byte-code compiled, so Tcl is interpreting the code. (At least that was true for a long time)

Category Mathematics

Category Algorithm

Category Performance

Category Word and Text Processing

Category String Processing