RHS 09Spt2004
LZ77 compression is a precursor to LZW compression (An LZW-compressing virtual filesystem), I believe. Lars H: Mostly no; the precursor to LZW is LZ78, which apart from (i) being invented by the same Lempel and Ziv and (ii) belongs to same general group of compression algorithms (dictionary-based), is quite different from LZ77 -- for one, LZ78 (implicitly) constructs a dictionary of phrases. When working on the Binary image compression challenge, I became interested in writing a pure-tcl implementation of gzip. As gzip uses LZ77 (and Huffman encoding), I decided to start by implementing such an algorithm. Sadly, it's spectacularly slow, so it's not worthwhile in its present form. I figured it would be worth posting, however, in case anyone wanted to play with it.
Random Note: The new binary search maxSubstring code takes the encoding of cat.gif from 7.5 hours to 3.5 minutes... yay :)
For a very simple explanation of LZ77 compression... It works iterating from the beginning to the end of the string. For each substring [string range $data $currentIndex end] it checks to see if it has seen that substring (or any shorter version of it with the same starting point) previously. If it hasn't, it outputs the char at $index. If it has, it outputs a code to say how long ago it saw the substring, and how long it is. As an example:
Now, for an interesting twist, the match can include the current substring. For example:
I chose to use the following data/constraints (since I'm not sure how gzip does it):
MS See also LZ77 Compression - take 2 for a different version.
RHS Finally found the bug... turns out (and this is obvious now that I noticed it) that \x02 cannot be escaped by a \x01 following it. It uses two bytes to encode the length, and \x02 \x01 \x.. is perfectly valid. I changed the "Escape of 02 escape" to be \x02 \x00 \x01, which is NOT valid. Code now passes all tests. MS: I chose instead to encode Escape1 as \x01\x01 and Escape2 as \x01\x02 - as maxSubstring never returns a length < 3, this is OK and simpler.
# ############################################# # FILE: lz77.tcl - the code for the compression namespace eval lz77 { variable lookback 255 variable Escape1 "\x01" variable Escape2 "\x02" variable EscEsc "\x01" set K32 [expr {32 * 1024}] } proc ::lz77::K {a b} { return $a } proc ::lz77::encode {data} { variable Escape1 variable Escape2 variable EscEsc variable K32 set output "" set dataLength [string length $data] append output [string index $data 0] for {set i 1} {$i < $dataLength} {incr i} { #puts -nonewline "\n$i: " set foundAt 0 set foundLen 0 # Get the first 32k bytes of the current string set found [maxSubstring $i $data length offset] if { $found && $length > 0 } { if { ($offset < 255) && ($foundLen < 255) } { set offChar [format %c $offset] set lenChar [format %c $length] append output $Escape1$lenChar$offChar incr i $length incr i -1 } else { append output $Escape2 append output [binary format S $length] append output [binary format S $offset] incr i $length incr i -1 } } else { set char [string index $data $i] if { [string equal $char "$Escape1"] } { append output "$char$EscEsc" } elseif { [string equal $char "$Escape2"] } { append output "$char\x00$EscEsc" } else { append output $char } } } return $output } proc ::lz77::decode {data {original ""}} { variable Escape1 variable Escape2 variable EscEsc set output "" for {set i 0} {$i < [string length $data]} {incr i} { set char [string index $data $i] if { ($char ne "$Escape1") && ($char ne "$Escape2") } { append output $char } elseif { ($char eq "$Escape1") && [string index $data [expr {$i + 1}]] eq "$EscEsc" } { append output $char incr i } elseif { ($char eq "$Escape2") && ([string index $data [expr {$i + 1}]] eq "\x00") && ([string index $data [expr {$i + 2}]] eq "\x01") } { append output $char incr i 2 } else { if { $char eq $Escape1 } { scan [string index $data [incr i]] %c length scan [string index $data [incr i]] %c offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} { set char [string index $output $index] append output $char incr index } } else { binary scan [string range $data [incr i] [incr i]] S length binary scan [string range $data [incr i] [incr i]] S offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} { set char [string index $output $index] append output $char incr index } } } } return $output } proc ::lz77::maxSubstring {index data &matchLength &matchOffset} { upvar ${&matchLength} returnLength upvar ${&matchOffset} returnOffset variable K32 set matchLength 0 set matchOffset 0 if { $index > $K32 } { set data [string range $data [expr {$index - $K32}] end] set index $K32 } set upperBound [expr {1 + [string length $data] - $index}] if { $upperBound >= $K32 } { set upperBound $K32 } set lowerBound 0 while { ($upperBound > 3) && (($upperBound - $lowerBound) > 1) } { set currentLength [expr {($upperBound + $lowerBound)/2}] if { $currentLength < 3 } { set currentLength 3 } set currString [string range $data $index \ [expr {$index + $currentLength -1}]] set currData [string range $data 0 [expr {$index + $currentLength -2}]] set found 0 if { ([set matchIndex [string last $currString $currData]] >= 0) } { set len [string length $currString] if { ($index - $matchIndex) > 255 && $len < 5 } { if { 5 > $matchLength } { set currString2 [string range $data $index [expr {$index +4}]] set matchIndex2 [string first $currString2 $currData $matchIndex] if { $matchIndex2 >= 0 } { set matchOffset [expr {$index - $matchIndex2}] set matchLength 5 set found 1 } } } else { if { $len > $matchLength } { set matchOffset [expr {$index - $matchIndex}] set matchLength $len set found 1 } } } if { $found } { set lowerBound $currentLength } else { set upperBound $currentLength } } if { $matchLength > 0 } { set returnOffset $matchOffset set returnLength $matchLength return 1 } else { set returnOffset 0 set returnLength 0 return 0 } }
And, the tests...
# ############################################# # FILE: lz77.test - tests for lz77 package require tcltest namespace import tcltest::* source [file join [file dirname [info script]] lz77.tcl] # ======================================== test encode-1.1 { encode a simple string in using LZ77 } -body { set encoded [::lz77::encode {abcdebcdef}] set expect "abcde\x01\x04\x04" append expect f if { [string length $expect] != [string length $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "ENCODED: $exList -> $enList" return "String lengths were not the same: \ [string length $expect] != [string length $encoded]" } if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECTED: $exList\nENCODED : $enList" return "Strings were not equal" } } -result {} test encode-1.2 { encode a simple string in using LZ77 } -body { set encoded [::lz77::encode {Blah blah blah blah blah!}] set expect "Blah b\x01\x12\x05!" if { [string length $expect] != [string length $encoded] } { return "String lengths were not the same: \ [string length $expect] != [string length $encoded]" } if { ![string equal $expect $encoded] } { return "Strings were not equal" } } -result {} test encode-1.3 { encode a string with multiple matches } -body { set string {This is a string with multiple strings within it} set expect "This \x01\x03\x03" append expect "a string with multiple\x01\x07\x15" append expect "s\x01\x05\x16" append expect "in it" set encoded [::lz77::encode $string] if { [string length $expect] != [string length $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECT: $exList" puts "ENCODE: $enList" return "String lengths were not the same: \ [string length $expect] != [string length $encoded]" } if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECT: $exList" puts "ENCODE: $enList" return "Strings were not equal" } } -result {} test encode-1.4 { Encode a long (>255 <255*255) string, to use second escape } -setup { set original "abcdefghij" for {set i 0} {$i < 254} {incr i} { append original [expr {$i % 10}] } append original "abcdefg" set expect "abcdefghij0123456789" append expect "\x01\xF4\x0a" append expect "\x02\x00\x07\x01\x08" } -body { set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECT: $exList" puts "ENCODE: $enList" return "Strings were not equal" } } -result {} # ======================================== test decode-1.1 { decode a simple string using LZ77 } -body { set decoded [::lz77::decode "Blah b\x01\x12\x05!"] set expect {Blah blah blah blah blah!} if { [string length $expect] != [string length $decoded] } { puts "$expect != $decoded" return "String lengths were not the same: \ [string length $expect] != [string length $decoded]" } if { ![string equal $expect $decoded] } { puts "$expect != $decoded" return "Strings were not equal" } } -result {} # ======================================== test cycle-1.1 { cycle a string through encode and decode } -body { set original "This is a string with multiple strings within it" set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded] if { ![string equal $original $changed] } { puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" binary scan $encoded c* enList puts $enList return "The strings were not equal" } return } -result {} test cycle-1.2 { cycle a string through encode and decode } -body { set original "This is a string of text, \ whereherehereherehe parts of the string\ have text that is in other parts of the string" set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded] if { ![string equal $original $changed] } { puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" binary scan $encoded c* enList puts $enList return "The strings were not equal" } return } -result {} test cycle-1.3 { Encode a long (>255 <255*255) string, to use second escape\ and decode it } -setup { set original "abcdefghij" for {set i 0} {$i < 254} {incr i} { append original [expr {$i % 10}] } append original "abcdefg" } -body { set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "ORIGINAL: $orList" puts "DECODED : $deList" return "Strings were not equal" } } -result {} test cycle-1.4 { For a very long string (>32k), remove everything 32k \ past the index } -constraints { slow } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" } append original "abcdef" } -body { set decoded [::lz77::decode [::lz77::encode $original]] string equal $decoded $original } -result {1} test cycle-1.5 { For a very long string (>32k), remove anything more than 32k\ before the index } -constraints { slow } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" } append original "abcdef" } -body { set decoded [::lz77::decode [::lz77::encode $original]] string equal $decoded $original } -result {1} test cycle-1.6 { } -setup { set data "abcdefghij012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123abcdefg" set index 274 } -body { set decoded [::lz77::decode [::lz77::encode $original]] string equal $decoded $original } -result {1} test cycle-1.7 { cycle a string through encode and decode } -setup { set f [open [file join [file dirname [info script]] lz77.tcl]] set original [read $f] close $f } -body { set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded] if { ![string equal $original $changed] } { set oList [split $original \n] set cList [split $changed \n] puts "original has [llength $oList] lines...\ changed has [llength $cList] lines" set line 0 set chars 0 foreach oLine $oList cLine $cList { incr line if { ![string equal $oLine $cLine] } { return "Line $line was difference\ \n\tOriginal: '$oLine'\ \n\tChanged : '$cLine'\ \n\t($chars from preceding lines)" } incr chars [string length $oLine] } #puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" return "The strings were not equal" } return } -result {} test cycle-1.8 { cycle a string through encode and decode } -setup { set original { scan [string index $data [incr i]] %c offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - else - binary scan [string range $data [incr i] [incr i]] S length binary scan [string range $data [incr i] [incr i]] S offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - } } -body { set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded $original] if { ![string equal $original $changed] } { set oList [split $original \n] set cList [split $changed \n] puts "original has [llength $oList] lines...\ changed has [llength $cList] lines" set line 0 foreach oLine $oList cLine $cList { incr line if { ![string equal $oLine $cLine] } { return "Line $line was difference\ \n\tOriginal: '$oLine'\ \n\tChanged : '$cLine'" } } #puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" return "The strings were not equal" } return } -result {} # ======================================== # Special cases # When there's an escape in the input text # Encode it as \x01\x01, since we can't have a repeat length of 1 test escape-1.1 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x01" ; append original cd set expect "ab\x01\x01" ; append expect cd set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "$exList ->\n$enList" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} test escape-1.2 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x01" ; append original cd set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "$orList ->\n$deList" return "The strings were not equal" } } -result {} test escape-1.3 { An escaped escape should not interfere with runs surrounding it } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "abcdebcde\x01" ; append original cd set expect "abcde\x01\x04\x04\x01\x01" ; append expect cd set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "$exList ->\n$enList" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} test escape-1.4 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x02" ; append original cd set expect "ab\x02\x00\x01" ; append expect cd set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "$exList ->\n$enList" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} test escape-1.5 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x02" ; append original cd set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "$orList ->\n$deList" return "The strings were not equal" } } -result {} test escape-1.6 { Both escapes } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x02\x01\x01\x02cd" ; append original cd set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "$orList ->\n$deList" return "The strings were not equal" } } -result {} # When the input text is > 255 chars long, since that's # the max we can look back, or have in a run # Check that the string we're looking for is no longer than 255 # Check that we don't look back more than 255 # If there's multiple matches, get the longest one possible # "These blah is blah blah blah!" # ^ the match for here # ^ should start here # ^ not here test longest-1.1 { Get the longest match possible } -setup { unset -nocomplain original expect encoded exList enList set original {These blah is blah blah blah!} set expect "These blah is\x01\x06\x08" append expect "\x01\x09\x05!" } -body { set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "Expect: $exList ->\nEncode: $enList" puts "DECODE: [::lz77::decode $encoded]" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} # ======================================== test maxSubstring-1.1 { Find the max substring for a string with only one match } -setup { unset -nocomplain length offset set string {abcdefcdefg} set index 6 } -body { set flag [::lz77::maxSubstring $index $string length offset] list $flag $length $offset } -result {1 4 4} test maxSubstring-1.2 { Make sure we can get matches of the min length } -setup { unset -nocomplain length offset set string {This is a} set index 5 } -body { set flag [::lz77::maxSubstring $index $string length offset] list $flag $length $offset } -result {1 3 3} test maxSubstring-1.3 { Return 0 if no match } -setup { unset -nocomplain length offset set string {abcdefghijk} set index 6 } -body { ::lz77::maxSubstring $index $string length offset } -result {0} test maxSubstring-1.4 { Find the max substring for a string with multiple matches } -setup { unset -nocomplain length offset set string {aaaabbbbaaaaaaaa} set index 9 } -body { set flag [::lz77::maxSubstring $index $string length offset] list $flag $length $offset } -result {1 7 1} test maxSubstring-1.5 { For a very long string (>32k), remove everything 32k \ past the index } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" } append original "abcdef" set index 17 } -body { set flag [::lz77::maxSubstring $index $original length offset] list $flag $length $offset } -result {1 32767 10} test maxSubstring-1.6 { For a very long string (>32k), remove anything more than 32k\ before the index } -constraints { slow } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" } append original "abcdef" set index 40000 } -body { set flag [::lz77::maxSubstring $index $original length offset] list $flag $length $offset } -result {1 32767 10} test maxSubstring-1.7 { } -setup { set data "abcdefghij012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123abcdefg" set index 274 } -body { set flag [::lz77::maxSubstring $index $data length offset] list $flag $length $offset } -result {1 7 274} test maxSubstring-1.8 { cycle a string through encode and decode } -setup { set original { scan [string index $data [incr i]] %c offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - else - binary scan [string range $data [incr i] [incr i]] S length binary scan [string range $data [incr i] [incr i]] S offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - } set index 466 } -body { set flag [::lz77::maxSubstring $index $original length offset] list $flag $length $offset } -result {1 258 411} # ======================================== cleanupTests return # ======================================== # Some timings puts SHORT:[time {::lz77::maxSubstring 9 aaaabbbbaaaaaaaa v1 v2} 10000] set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "01234567890123456789012345678901" } append original "abcdef" puts LONG:[time {::lz77::maxSubstring 3 $original v1 v2} 1]
PS 10Sep04: This brings us terribly close to pure tcl deflate/gzip. Reading http://www.gzip.org/deflate.html suggests that the Huffman coding is not terribly hard to do.
DKF: I'd estimate that the easiest way to speed this up would be to make [maxSubstring] return a list of its three result values instead of using upvar. Splitting the result list with foreach or lassign is much faster...
MS The usage of [K] can also be sped up noticeably: use the idiom [K $x [set x {}]]] instead of [K $x [unset x]]]. Even faster is the bytecompiled variant [lindex [list $x [set x{}]]] 0]. Which reminds me ... soon (?) to be the fastest is the plain $x[set x {}] - as soon as I get around to implementing FRE 924097 (optimise {}-appends [L1 ]) This optimisation has been committed to HEAD on 2004-09-10 (8.5a2)
Peter Spjuth: Is it intentional that maxSubstring prefers the earliest match? I would think that minimizing offset by prefering the latest match would be better. That would make a more efficient solution possible using [string last]. Also, is K really useful with [string range]?
RHS To add some comments...
AK: Here is another reference going into more detail of how zlib works: http://www.gzip.org/zlib/feldspar.html Together with the spec for the output format we should be able to go far.