Version 9 of indexing a flat file, and a readBytes proc

Updated 2005-04-24 19:03:56

<jkock, 2005-04-20>

This is in fact a suggestion for new core functionality, perhaps a new flag to [read] to allow for reading a specified number of bytes instead of a specified number of chars. I suggest this because I don't know how to implement this in pure Tcl. The solution below uses pipe from TclX, and in fact it is not a good solution because of a supposed bug in pipe.

Of course the best would be to be proven wrong -- some expert showing how it should be done...

AF If you put the channel in binary encoding then bytes == chars

The problem

Suppose you have a huge flat text file, but the content actually comes in blocks. For example this could be a berkeley mbox, where some "From " pattern delimits the individual mails in the file. Now you want to create an index to this file, telling where each block starts and how big it is, and then later be able to read in only a specified block. At first sight this is an easy problem: First you create the index and then you do seek-and-read.

However if the file is multi-byte encoded you run into some more serious problems due to the fact that [seek] measures in bytes whereas [read] measures in chars. (You may also run into trouble just in the trivial situation where the file has DOS line endings.)

For example, to create the index, just read in the entire file (using fconfigure appropriately) and find what you want with some [regexp -indices]. But these indices won't help you to seek, because [seek] needs a pure byte measurement... Bad luck. (Of course your difficulty in producing a byte measurement is this: you would have to read the entire text chunk up to each index to count how many bytes each char occupies. The expense of doing this is of course also the reason why [seek] cannot measure in chars...)

Towards a solution

Alright, then let us do everything in bytes. To set up the index in bytes you can perhaps get away with using TclX [scanmatch]es, and get the byte offsets from $matchIndex(offset). Very well, now we've got the indices in bytes, and we can seek. But now we cannot easily read! In order to read a specified number of bytes, a trick is needed: we set up a translation pipe using the TclX command [pipe]. Now first read from the file in binary mode, then put it into the pipe in raw mode, and finally pick it out of the pipe with the appropriate encoding. E.g.

    package require Tclx
    set fd [open $bigFlatFile r]
    # We know this file is utf-8 encoded, but we want to read a 
    # certain number of bytes, not chars...
    fconfigure $fd -encoding binary
    pipe out in
    fconfigure $in -encoding binary -blocking 0 -buffering none
    fconfigure $out -encoding utf-8 -blocking 0 -buffering none
    seek $fd $offset
    puts $in [read $fd $numBytes]
    read -nonewline $out
    close $fd
    close $in 
    close $out

Although this is doable and works well for short chunks of text, I think we are witnessing a shortcoming in Tcl: it would be really nice (and quite logical, in view of the functionality provided by [seek] and [tell]) if [read] could accept a -bytes flag. The only thing needed is a convention about how to handle the situation where the number of bytes does not constitute a complete char. One convention could be: finish the char in that case. Another convention: discard the non-complete char. Or finally, just leave the fractional char as binary debris --- it is up to the caller to make sure this does not happen, and in the examples like the above this comes about naturally.

Here is how I think it should work:

    # This proc is supposed to work just like [read $fileHandle $numChars],
    # except that the size of the chunk to read is specified in bytes, not in
    # chars.  This is useful in connection with [seek] and [tell] which always
    # measure in bytes.
    package require Tclx
    proc readBytes { fileHandle numBytes } {
        # Record the original configuration:
        set fconf [fconfigure $fileHandle]
        # If anything goes wrong, at least we should restore the
        # original configuration, hence the catch:
        if { [catch {

            # Configure for binary read:
            fconfigure $fileHandle -encoding binary -translation binary
            # Set up a translation pipe:
            pipe out in
            # The input end should be as neutral as possible:
            fconfigure $in -encoding binary -translation binary -buffering none
            # The output part should mimic the original configuration...
            eval fconfigure {$out} $fconf
            # (Beware that it is up to the caller to ensure that the original
            # channel is nonblocking if this is required.)
            # Now read the bytes and put them into the translation pipe:
            puts -nonewline $in [read $fileHandle $numBytes]
            set content [read $out]
            # Clean up:
            close $in 
            close $out
            # And restore the original configuration:
            eval fconfigure {$fileHandle} $fconf

        } err] } {
            eval fconfigure {$fileHandle} $fconf
            error $err
        } else {
            return $content
        }
    }

Unfortunately, on big chunks of text (>8192), there seems to be a bug in pipe that obstructs this solution... In fact: makes the tcl interpreter hang...


See Also


Category Example Category Suggestions