Version 0 of LZ78 Compression

Lars H: I suspect whoever added [LZ78 Compression] to New Pages really meant to link to LZ77 Compression, but since it now exists it might as well get some content. The following is a Tcl implementation of LZ78 encoding of the data in a file:

 proc LZ78_encode {F} {
     set old ""
     set Dict() 0
     set res [list]
     while {![eof $F]} {
         set ch [read $F 1]
         if {[info exists Dict($old$ch)]} then {
             append old $ch
         } else {
             lappend res $Dict($old) $ch
             set Dict($old$ch) [array size Dict]
             set old ""
         }
     }
     if {[llength $old]} then {lappend res $Dict($old)}
     return $res
 }

The idea is to build a dictionary of character sequences and only output a "token number" for those phrases that are in this dictionary. The longest matching phrase found in the dictionary is chosen. The list element after a token is always a character. The dictionary (the Dict array) is extended with a new entry for every old token + 1 character sequence that hasn't been seen before.

As for compression rate of the above: the 92659 bytes file tclObj.c gets encoded as a list of 31863 elements. Achieving actual compression thus also requires finding a good binary format for encoding this list. But there are also ways of improving the LZ78 algorithm, which are used in LZW compression.

Category Compression