Lars H: I suspect whoever added [LZ78 Compression] to New Pages really meant to link to LZ77 Compression, but since it now exists it might as well get some content. The following is a Tcl implementation of LZ78 encoding of the data in a file:
proc LZ78_encode {F} { set old "" set Dict() 0 set res [list] while {![eof $F]} { set ch [read $F 1] if {[info exists Dict($old$ch)]} then { append old $ch } else { lappend res $Dict($old) $ch set Dict($old$ch) [array size Dict] set old "" } } if {[llength $old]} then {lappend res $Dict($old)} return $res }
The idea is to build a dictionary of character sequences and only output a "token number" for those phrases that are in this dictionary. The longest matching phrase found in the dictionary is chosen. The list element after a token is always a character. The dictionary (the Dict array) is extended with a new entry for every old token + 1 character sequence that hasn't been seen before.
As for compression rate of the above: the 92659 bytes file tclObj.c gets encoded as a list of 31863 elements. Achieving actual compression thus also requires finding a good binary format for encoding this list. But there are also ways of improving the LZ78 algorithm, which are used in LZW compression.