[Richard Suchenwirth] 2002-10-31 - In [Huffman coding], it was described how to en- and decode strings with bit sequences of varying length - the frequent ones short, the rarer long. In order to play with the [binary] command, here is a second take which starts from frequency-counting a particular text: proc ccount string { # character frequency count, e.g. # ccount Tennessee => {e 4} {n 2} {s 2} {T 1} foreach char [split $string ""] { if [catch {incr a($char)}] {set a($char) 1} } set res {} foreach char [array names a] { lappend res [list [list $char] $a($char)] } lsort -integer -decr -index 1 $res } if 0 {Such a tally can well be used to construct a Huffman code tree, as I've learned from [SICP], where the two most rare occurences are merged into a node, until the tree is complete:} proc huffmantree string { set res [ccount $string] while {[llength $res]>1} { foreach {c1 n1} [lindex $res end] break foreach {c2 n2} [lindex $res end-1] break set res [lrange $res 0 end-2] lappend res [list [list $c1 $c2] [expr {$n1+$n2}]] set res [lsort -integer -decr -index 1 $res] } lindex [lindex $res 0] 0 } if 0 {For [string map] use, such a tree has to be turned into a paired list of character and bit sequence, by a recursive procedure that traverses the code tree and adds 0 or 1 until a terminal node is reached:} proc tree2map {tree {prefix ""}} { if {[llength $tree]==1} { list $tree $prefix } else { concat [tree2map [lindex $tree 0] ${prefix}0] \ [tree2map [lindex $tree 1] ${prefix}1] } } if 0 {The resulting list has to be [join]ed once and is then ready for doing the encoding into a binary string. The following proc returns both the used map and the binary string:} proc encode string { set map [join [tree2map [huffmantree $string//]]] list $map [binary format b* [string map $map $string]] } if 0 {Decoding is easy with the map (which has to be inverted, so that {a b c d} makes {b a d c}), and the binary string as input:} proc decode {mapstring} { set rmap {} foreach {char code} [lindex $mapstring 0] { lappend rmap $code $char } binary scan [lindex $mapstring 1] b* bits string map $rmap $bits } # Self-referential test data... set fp [open [info script]] set data [read $fp] close $fp if 0 {You can now experiment with short strings: puts [decode [encode "Hello, world!"]] or longer strings, namely this page: puts [decode [encode $data]] and notice that the original string comes back well, except for maybe some stray characters added at the end. This is because ''[binary] format'' produces a sequence of bytes, so in cases where the bits string's length is not a multiple of eight, it is zero-padded in ''encode''. A simple way to get around this is to append a certain sequence (e.g. //) to the original and let the receiver discar the rest after that sequence. (Of course it should not appear in the original text - so don't use this for C++ code sources..) By comparing the string length of the original with the binary string, you get an idea how much compression is possible with this approach: even the custom Huffman map together with the binary string comes out shorter than the original, e.g. 2885/3428 = 84.2% You can use this for compressed encryption if both sender and receiver know the Huffman map, but then you have to take care that all characters in the input appear in the map - for this page (at some point of writing) the compression ratio (counting the binary string only) is 1991/3428 = 58.1% }