Version 4 of vfs::zip

Updated 2004-06-22 21:20:12

NJG Jun 22, 2004

Recently I have fetched close on a hundred zip archives from the Internet each containing several tens of small ASCII files. (All describing sokoban levels if you want to know. What sokoban is you have to find out for yourself.) Having eyed tclvfs for some time I wrote up a small Tcl program that intended to do the job with the vfs::zip package.

I immediately got a laconic "bad header" message. Now I don't give up so easily. (Especially because all the other unzippers I have could handle the archive.) Took a zip format specification, displayed the archive in a hex editor and went at deciphering the content together with the operation of zipvfs.tcl (in Tcl/lib/vfs1.3/). Soon enough I located the cause of the problem: the code assumes that the start of the "end of archive" record is at offset 22 from the end of the file. Which is not necessarily true even according to the format spec. My files all contained a trailing 0x0A (end of line), which is how the original zipper represented the zero length string specified in the "end of archive" record. I think the correction is obvious (locate the record by searching for its signature) but I don't think I am the proper one to do it. But I did write a simple script that removed these trailing linefeeds.

My small Tcl program then happily run to completion and I was rather pleased with myself ... until I found the first file that contained seemingly binary garbage. (In fact I found 22 of them.) I copy it here in hex form:

 2dcc310a80301044d13e903b7c582bad6cedc4ca2244d41b684041118201bdbdc638d563d81d9018ad400a4814c9858c8fe4bc792991423c10be3689f4ffcf8cebb9b90ad336bd6dec60ec40596a558773397c457defceaf1373a073db112ead1e

(Let me not forget to mention that the compression method is deflate (#8).)

Back to debugging. The first fact I found was that the content was not garbage at all; it was the compressed form of the file copied unchanged from the archive. My second finding was that by default Trf (and eventually some version of zlib) was called to do uncompressing.

Then I run the program again but modified to collect all these cases. My third observation was that the first octet was always 0x2d and the second octets were very similar and seemingly following some pattern. Not by chance, according to the PKzip spec these two constitute a flag byte describing the representation of the following content.

That’s it. I am finished. Someone more knowledgeable in this area could do something about is. Would be real nice.