'''The problem''' Suppose you have a huge flat text file, but the content actually comes in blocks. For example this could be a berkeley mbox, where some "From " pattern delimits the individual mails in the file. Now you want to create an index to this file, telling where each block starts and how big it is, and then later be able to read in only a specified block. At first sight this is an easy problem: First you create the index and then you do seek-and-read. However if the file is multi-byte encoded you run into some more serious problems due to the fact that [seek] measures in bytes whereas [read] measures in chars. (You may also run into trouble just in the trivial situation where the file has DOS line endings.) For example, to create the index, just read in the entire file (using fconfigure appropriately) and find what you want with some [regexp -indices]. But these indices won't help you to seek, because [seek] needs a pure byte measurement... Bad luck. (Of course your difficulty in producing a byte measurement is this: you would have to read the entire text chunk up to each index to count how many bytes each char occupies. The expense of doing this is of course also the reason why [seek] cannot measure in chars...) '''Towards a solution''' Alright, then let us do everything in bytes. To set up the index in bytes you can perhaps get away with using TclX [scanmatch]es, and get the byte offsets from $matchIndex(offset). Very well, now we've got the indices in bytes, and we can seek. But now we cannot easily read! In order to read a specified number of bytes, my first solution was a trick with a translation pipe using the TclX command [pipe]: first read from the file in binary mode, then put it into the pipe in raw mode, and finally pick it out of the pipe with the appropriate encoding. E.g. package require Tclx set fd [open $bigFlatFile r] # We know this file is utf-8 encoded, but we want to read a # certain number of bytes, not chars... fconfigure $fd -encoding binary pipe out in fconfigure $in -encoding binary -blocking 0 -buffering none fconfigure $out -encoding utf-8 -blocking 0 -buffering none seek $fd $offset puts $in [read $fd $numBytes] read -nonewline $out close $fd close $in close $out Unfortunately, on big chunks of text (>8192), there seems to be a [bug in pipe] that obstructs this solution... In fact: makes the tcl interpreter hang... In any case, [Lars H] pointed out that this could be done in a much cleaner way using [encoding]. Here is the final solution (so far): ---- # This proc is supposed to work just like [read $fileHandle $numChars], # except that the size of the chunk to read is specified in bytes, not in # chars. This is useful in connection with [seek] and [tell] which always # measure in bytes. The proc is supposed to respect the fileHandle's # configuration w.r.t. encoding, but it will not respect the configuration # w.r.t. eol convention, I think. proc readBytes { fileHandle numBytes } { # Record the original configuration: set enc [fconfigure $fileHandle -encoding] # Special treatment of encoding "binary", since this encoding is not # accepted by [encoding convertfrom]. But this case is trivial: if { $enc eq "binary" } { return [read $fileHandle $numBytes] } # We are going to reconfigure the channel. If anything goes wrong, at # least we should restore the original configuration, hence the catch: if { [catch { # Configure for binary read: fconfigure $fileHandle -encoding binary set binaryData [read $fileHandle $numBytes] set txt [encoding convertfrom $enc $binaryData] # And restore the original configuration: fconfigure $fileHandle -encoding $enc } err] } { fconfigure $fileHandle -encoding $enc error $err } else { return $txt } } ---- Older remark: it would be really nice (and quite logical, in view of the functionality provided by [seek] and [tell]) if [read] could accept a -bytes flag. The only thing needed is a convention about how to handle the situation where the number of bytes does not constitute a complete char. One convention could be: finish the char in that case. Another convention: discard the non-complete char. Or finally, just leave the fractional char as binary debris --- it is up to the caller to make sure this does not happen, and in the examples like the above this comes about naturally. ---- '''See Also''' * [Building a file index] * [Indexed file reading] ---- [Category Example] - [Category File]