Hard Archiving Strategy using Tcl as master control

In April of 2011 I read that the Getty valued Mapplethorpe's work archive (slides, etc) at 30 million, 20 years after his death. Maybe a year earlier I read that somebody claimed to have a sheet of Ansel Adam's slides (which has since been considered untrue by experts) - however IF the slides were authentic they would have an estimated value of 200 million, about 30 years after his death. I realized that modern photographers who work in digital format have a problem - CD's, DVD's and Hard Drives might last ten years. Also, backups created by software applications may not be readable formats in the future. For example, I read an article about the US Gov't "losing" years of engineering/research information due to 'rapid digital aging'.

My work over the past several months has been to develop a strategy for 'digital work preservation', or 'hard archiving'. In this project I am not concerned with "prints", or finished product, but the "work" - the original files used/created by the artist. My initial concept is based on the 1980's magazine A.N.A.L.O.G (Atari computer mag, not sci fi) method of distributing binary software on paper. example: (mag issue 21, 1984 ) ("Avalance" Listing )

My experiments show it is possible to use base64 encoding to store the image/work files, which can be stored in a "durable" format. Paper might last 50 years, microfiche might last 500. It is technically possible for a human to re-enter base64 encoded data, much like the software in the magazine.

With 'untrained' open source OCR software we achieve approximately 70% accuracy in scanning this base64 encoded data. Through 'training' the accuracy improves. Using a CRC24 calculated checksum on each line allows the programmer to validate input and 'brute force' guess the correct sequence. (this is the 'out.txt' file, below {example pdf }). However, using the Quick Response Code technique to encode the data may prove to increase accuracy and minimize storage requirements.

It is perhaps a task similar to the "Voyager Golden Record " project, but we must communicate with humans in the future (who have unknown capabilities). We are concerned with defining the information in such a way that someone with intelligence and some sort of computer technology can make use of the data.

This is a work in progress, and of course there are numerous issues to consider. I believe most think it's a silly waste of time, but whatever, i'm still fascinated by the project. :-)

Arguments I've heard against:

  • the archive may be valuable to some collector in the future, but the expense to the artist will not be compensated - the artist will (probably) never see any return from the effort
  • why bother, just give it all up to the established data collectors (as we are doing, anyway). (dump your terabytes on the cloud)
  • it would be "easier and cheaper" just to make archival prints of all files.

(in progress)

Step 1. Scan local path and select files for archival. Create cover sheet.

hard_archive.tcl

set master "\n
Hard Archive \"[pwd]\"
Path Time: [file mtime [pwd]]
Time this script was executed: [clock seconds]
[clock format [clock seconds]]
Epoch time ([clock format 0])

Legend: 
\"Filename\" \[ModifyTime\] \[FileSize\] \"File Type\"
Please refer to documentation for information regarding
file types, times and sizes

"

set files [glob *]

foreach x $files {
        if { [file type $x] == "file" } {
                append master "\"$x\" \[[file mtime $x]\] \[[file size $x]\] \"[exec file -b $x]\"\n"

                exec ./a.exe -e $x out.txt

        }
}

puts $master

Unless I am missing something, your foreach x $files loop appears to overwrite out.txt with each new file, such that at the end of the loop, the only data in out.txt is the data from the final filename in $files.

waitman (2011-07-27) true. I need to make step 2 a procedure that runs after out.txt is created, and I don't need out.txt anymore after it's run.


Note: a.exe is my hacked version of b64 (C) Copr Bob Trower 1986-01. (I added line numbering and CRC24 checksum to each line) source code here: http://www.nefyou.com/b64-modified.c.txt

Step 2. Generate Quick Response Code images for out.txt

testout.tcl waitman (2011-07-28) update added montage/pdf. NOTE: still have some issues to resolve.

file delete -force work
file mkdir work
file delete -force book
file mkdir book
set fp [open "out.txt" r]
fconfigure $fp -buffering line
set mc 10
set lc 1
set pageno 1
set pageco 1
set work ""
set process ""

gets $fp data
while {$data != ""} {
        append process "$data\\n"
        set lc [incr lc]
        if { $lc>5 } {

                exec ./qrencode-3.1.1/qrencode.exe -o work/$mc.png -s 3 -8 "$process"
                lappend work "./work/$mc.png"
                set mc [incr mc 10]
                 set process ""
                set lc 1
                

                set pageco [incr pageco 1]
                if { $pageco>130 } {
                        lappend work "./work/master-out.png"
                        set filelist $work
                        exec montage -background white -density 300x300 -geometry +4+4 -tile 10x13 $filelist
                        exec convert -mattecolor white -gravity center -extent 2550x3300 ./work/master-out.png ./work/master-out.png
                        exec convert -background white -gravity North -pointsize 10 -annotate +0+40 '$pageno' -density 300 -depth 8 ./work/master-out.png ./work/master-out.png
                        exec convert ./work/master-out.png ./book/[format "book/$08d.pdf" $pageno].pdf
                        set pageco 1
                        set pageno [incr pageno 1]
                }
        }
          gets $fp data
     }
     close $fp
exec ./qrencode-3.1.1/qrencode.exe -o work/$mc.png -s 3 -8 "$process"

I am curious as to why you did not use file delete for "exec rm -Rf work" (file delete -force work), and why you did not use file mkdir for "exec mkdir work", thereby avoiding two exec calls. Also, if you want your "$md.png" files to be processed in order, you might consider [format "work/$08d.png" $mc] within your exec call to produce zero filled names, which will sort in numeric order. At the moment, as soon as you hit 100.png your sort order of files will be wrong.

waitman (2011-07-27) Thank you, very good points. The png are temporary and will be removed, Still working out the montage bit but at the moment I don't believe the names need padding.

RLE (2011-07-28) If you are not storing a sequence number with each chunk of data, and if your "convert to pdf" is performed anything like so: (png2pdf -montage *.png) you certainly do need to pad the names. Take a subset of the files named 1 to 100. Without padding, the sort order will be:

 1.png
 10.png
 100.png
 11.png
 2.png
 20.png

With padding, the sort order will be correct:

 001.png
 002.png
 010.png
 011.png
 020.png
 100.png

In the unpadded case, unless your montage program sequentially counts from one, or sorts incoming filenames by numerical value, your montage will contain chunks in this order: 1 10 100 11 2 20. That will produce an additional puzzle for a future archivist attempting to reassemble things. Which is something you probably do not what you want in the end.

In the padded case, whether the sort is by character string (which is usually the default for file names), or by numerical value (ignoring the issue of leading zero sometimes means octal), the order will always be: 1 2 10 11 20 100 (correct sequence).

waitman (2011-07-28) thank you. Of course this is true with sorting, however I have the filenames specified explicitly in the montage command, using lappend for each filename. (At the moment I am putting 130 png's per page.) Therefore the filename padding is not necessary, in that specific case. This will be used in assembling the final PDF though.


Note: qrencode is http://fukuchi.org/works/qrencode/index.en.html

Step 3: put png together in pdf

working... need to merge step 1&2 7z data is possible, it is well documented compression but adds another level of complexity.

RLE (2011-07-27) Tcl 8.6 includes built in zlib functionality, so you could "gzip" the data completely from within Tcl, reducing the complexity when compressing the data.

waitman (2011-07-27) Thanks, i'll check it out. Compression processing make it more difficult to describe data. I imagine that gzip is well documented :) In this project I must only assume the person receiving the information has some form of computer, and is intelligent enough to put the puzzle together.

But, here's the problem: JPEG is a "worthless" format (8bit color stores approx. 1% of the possible information, there are numerous versions and subformats of JPEG, and besides JPEG is on it's way out to obsolescence), a serious photographer is going to shoot RAW. RAW is proprietary to the manufacturer, impossible to define, and changes with each new camera model. Therefore RAW images must be converted to TIFF / 16bits per channel. TIFF is a static/stable format, well documented. Problem is: large files. Compression of TIFF file will get you down to around 2% - 10% of original, but we must be able to adequately define the reverse algorithm.

RLE (2011-07-28) Now that you have better explained what you are going after, I have a few more questions. You may have pondered these already, but if not, you may want to think about them some as well:

You are storing images (binary data). You are converting the binary images into base64 (with some useful extras). Base64 data can be directly entered into a computer by a human using only a keyboard. For the moment I'll ignore the need to type upwards of 16MB of base64 encoded data for one image, which while arduous is indeed possible.

But then, you are converting the base64 data into QR 3d barcodes. QR codes themselves are not directly human readable (I.e., I can't "type" a QR code at my keyboard) and manually/visually decoding the data stored in a QR code, while possibly feasible, is definitely going to be a very difficult task, by far more difficult than the difficulty of simply typing in upwards of 16MB of base64 data on a keyboard.

Which brings me to my first question. By encoding to QR codes, you presuppose a future archivist has a computer, and a QR code decoding program. In which case, why perform the base64 encoding (human readable, but producing a 3:2 expansion of your data) to then wrap the base64 into something a human will likely not be able to decoded without a QR reading program. Considering that QR codes are supposed to directly store binary data [1 ] why not simply chunk your binary data, maybe wrap each chunk in a checksum/sequence wrapper, and then directly encode that into the QR codes?

By using QR codes, you increase the ease of decoding the data (no typing of 16MB of base64 data), but you presuppose the ability to decode QR codes. It is easier to believe that maybe, 500 years into the future (your microfiche example), that archivists will be able to understand our English character set of today than to believe that 500 years into the future they will understand what is a QR code. Have you considered that while QR codes increase the ease of accessing the printed material, they may be detrimental to actually decoding the printed material at some point in the future?

Of course, this question may be rendered moot if that future archivist has no knowledge of the Tiff image format or of how to decode that format. I.e., even if they were to type in 16MB+ of base64 data, and successfully decode it (again presupposing they still understand what is base64 encoding) if the result is a bunch of bytes of a Tiff image, and they know nothing anymore of the Tiff format, then the result is the same as if they knew nothing of the QR barcode format.

I think here I am starting to delve into the unsolved issue with digital archiving. Understanding todays data formats in the land of tomorrow.

waitman (2011-07-28) It is true that archiving is a difficult challenge. I suppose I am counting on future humans to understand English. My strategy is to communicate the data format. For example, Adobe has published the TIFF specification - which has remained unchanged since the 1990's. The strategy is to define the data, so that some intelligent person could understand how the data is organized, and read it. Base64 encode/decode is also fairly simple to explain. Possibly also compression, I need to research. QR Codes are an experiment - of course it makes things a bit more complicated. But maybe it will work.

Sample Image