Lars H, 13 Sep 2004: The following is a Tcl implementation of LZ78 encoding of the data in a file:
proc LZ78_encode {F} { set old "" set Dict() 0 set res [list] while {![eof $F]} { set ch [read $F 1] if {[info exists Dict($old$ch)]} then { append old $ch } else { lappend res $Dict($old) $ch set Dict($old$ch) [array size Dict] set old "" } } if {[string length $old]} then {lappend res $Dict($old)} return $res }
The idea is to build a dictionary of character sequences and only output a "token number" for those phrases that are in this dictionary. The longest matching phrase found in the dictionary is chosen. The list element after a token is always a character. The dictionary (the Dict array) is extended with a new entry for every {old token + 1 character} sequence that hasn't been seen before -- that way there isn't any overhead for transmitting the dictionary.
Decoding is almost embarrasingly simple:
proc LZ78_decode {L} { set res "" set RevDict [list ""] foreach {token ch} $L { set new [lindex $RevDict $token]$ch append res $new lappend RevDict $new } return $res }
The dictionary produced can be viewed by instead using
proc LZ78_dictdump {L} { set RevDict [list ""] foreach {token ch} $L { lappend RevDict [lindex $RevDict $token]$ch } return $RevDict }
join [LZ78_dictdump] "" returns pretty much the entire text compressed (only the text from the last token is missing), so the dictionary is a bit extreme in that respect. Not all of it is used, though, and the chunks that the text are chopped up into look rather random.
As for compression rate of the above: the 92659 bytes file tclObj.c gets encoded as a list of 31863 elements. Achieving actual compression thus also requires finding a good binary format for encoding this list. But there are also ways of improving the LZ78 algorithm, which are used in LZW compression.
The main difference in LZW is that instead of transmitting a {token character token character ...} sequence, only a list of tokens are transmitted. The new entries for the dictionary is formed by appending the first character of the next token to the text of the current token. This requires however that every character has a token to start with, so the dictionary must be pre-filled with an alphabet. In the next example, that alphabet defaults to characters \x00 to \xFF inclusive.
set binary [list] for {set n 0} {$n < 256} {incr n} {lappend binary [format %c $n]} proc LZW_encode [list text [list alphabet $binary]] { foreach ch $alphabet {set Dict($ch) [array size Dict]} set old "" set res [list] foreach ch [split $text ""] { set new $old$ch if {[info exists Dict($new)]} then { set old $new } elseif {[string length $old]} then { lappend res $Dict($old) set Dict($new) [array size Dict] set old $ch } } lappend res $Dict($old) }
There is a catch with decoding in LZW: The "next token" may turn out to be the very token that is being defined! The encoder doesn't notice when it does this, because it adds the new token to its dictionary before it has seen more than the first character of the next token. This can however be detected, and the first character one needs is then simply the same as the first character of the current token.
proc LZW_decode [list L [list alphabet $binary]] { foreach ch $alphabet {lappend RevDict $ch} set old [lindex $RevDict [lindex $L 0]] set res $old foreach token [lrange $L 1 end] { if {$token < [llength $RevDict]} then { set new [lindex $RevDict [lindex $token 0]] append old [string index $new 0] lappend RevDict $old set old $new } else { append old [string index $old 0] lappend RevDict $old } append res $old } return $res }
In this case, tclObj.c gets encoded as only 18905 tokens. As another small test, we might consider encoding encoding names.
% string length [encoding names] 658 % set L [LZW_encode [encoding names]] 99 112 56 54 48 32 256 258 49 261 257 54 50 265 258 51 32 116 105 115 45 267 260 262 54 52 269 54 53 282 54 32 103 98 49 50 51 52 284 288 291 290 45 114 97 119 265 57 52 57 302 53 278 266 305 100 105 110 288 97 116 115 32 107 115 99 53 259 264 109 97 99 67 101 110 116 69 117 114 111 269 55 281 325 99 85 107 299 312 101 32 106 274 48 50 48 264 295 51 290 32 101 117 99 45 99 110 356 358 45 106 112 32 339 84 104 97 105 32 274 111 56 56 53 57 45 49 260 347 115 349 48 56 374 115 111 350 50 50 365 367 339 73 99 101 108 97 110 100 389 391 349 268 375 377 379 381 271 384 349 355 409 378 380 49 281 417 411 49 284 256 55 51 55 405 410 419 286 98 105 103 284 357 359 342 368 326 82 111 325 110 105 97 441 99 84 333 107 274 104 287 289 57 377 405 392 394 440 339 71 114 101 101 107 32 97 320 105 373 256 52 428 449 443 445 430 418 381 481 411 408 390 431 45 271 339 67 334 315 447 362 107 111 105 56 298 484 380 281 101 98 99 311 99 502 45 425 112 290 307 449 67 121 114 105 108 108 105 509 422 380 286 256 514 324 326 68 312 314 316 318 498 500 117 510 429 528 50 53 486 376 482 388 542 53 271 525 45 305 549 281 549 512 378 308 514 527 112 57 51 408 100 329 273 116 121 265 514 541 257 544 449 74 97 112 402 572 543 388 115 104 105 102 116 384 32 117 116 102 45 548 575 512 564 286 115 121 109 98 111 108 265 55 55 284 117 446 99 111 567 269 53 55 % llength $L 363 % LZW_decode $L cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857