This is in fact a suggestion for new core functionality, perhaps a new flag to [[read]] to allow for reading a specified number of bytes instead of a specified number of chars. I suggest this because I don't know how to implement this in pure Tcl. The solution below uses [pipe] from [TclX], and in fact it is not a good solution because of a supposed [bug in pipe]. Of course the best would be to be proven wrong -- some expert showing how it should be done... [AF] If you put the channel in binary encoding then bytes == chars That's exactly what the solution below proposes. If the channel is a [pipe] then you can configure the output end in the true encoding and you are done. If you haven't got [pipe] at your disposal, what do you do with the binary chunk then? Is there some trick with [scan] to allow you to re-read the content of your text chunk in another encoding? However, it is an idea for a workaround of the [bug in pipe]: first read in the number of bytes you need using a binary encoded [open]. Then fill it into a [pipe] as explained below, but in chunks of length <8192... Not particularly elegant, but better than having Tcl hang... [Lars H]: Joachim, to "re-read the content of your text chunk in another encoding" one uses [encoding convertfrom]. Thanks Lars -- it was just this sort of expert help I hoped for. Now I think the final proc looks like this: # This proc is supposed to work just like [read $fileHandle $numChars], # except that the size of the chunk to read is specified in bytes, not in # chars. This is useful in connection with [seek] and [tell] which always # measure in bytes. The proc is supposed to respect the fileHandle's # configuration w.r.t. encoding, but it will not respect the configuration # w.r.t. eol convention, I think. proc readBytes { fileHandle numBytes } { # Record the original configuration: set enc [fconfigure $fileHandle -encoding] # Special treatment of encoding "binary", since this encoding is not # accepted by [encoding convertfrom]. But this case is trivial: if { $enc eq "binary" } { return [read $fileHandle $numBytes] } # We are going to reconfigure the channel. If anything goes wrong, at # least we should restore the original configuration, hence the catch: if { [catch { # Configure for binary read: fconfigure $fileHandle -encoding binary set binaryData [read $fileHandle $numBytes] set txt [encoding convertfrom $enc $binaryData] # And restore the original configuration: fconfigure $fileHandle -encoding $enc } err] } { fconfigure $fileHandle -encoding $enc error $err } else { return $txt } } If this looks right, I will rewrite this wiki page for future reference. ---- '''The problem''' Suppose you have a huge flat text file, but the content actually comes in blocks. For example this could be a berkeley mbox, where some "From " pattern delimits the individual mails in the file. Now you want to create an index to this file, telling where each block starts and how big it is, and then later be able to read in only a specified block. At first sight this is an easy problem: First you create the index and then you do seek-and-read. However if the file is multi-byte encoded you run into some more serious problems due to the fact that [[seek]] measures in bytes whereas [[read]] measures in chars. (You may also run into trouble just in the trivial situation where the file has DOS line endings.) For example, to create the index, just read in the entire file (using fconfigure appropriately) and find what you want with some [[regexp -indices]]. But these indices won't help you to seek, because [[seek]] needs a pure byte measurement... Bad luck. (Of course your difficulty in producing a byte measurement is this: you would have to read the entire text chunk up to each index to count how many bytes each char occupies. The expense of doing this is of course also the reason why [[seek]] cannot measure in chars...) '''Towards a solution''' Alright, then let us do everything in bytes. To set up the index in bytes you can perhaps get away with using TclX [[scanmatch]]es, and get the byte offsets from $matchIndex(offset). Very well, now we've got the indices in bytes, and we can seek. But now we cannot easily read! In order to read a specified number of bytes, a trick is needed: we set up a translation pipe using the TclX command [[pipe]]. Now first read from the file in binary mode, then put it into the pipe in raw mode, and finally pick it out of the pipe with the appropriate encoding. E.g. package require Tclx set fd [open $bigFlatFile r] # We know this file is utf-8 encoded, but we want to read a # certain number of bytes, not chars... fconfigure $fd -encoding binary pipe out in fconfigure $in -encoding binary -blocking 0 -buffering none fconfigure $out -encoding utf-8 -blocking 0 -buffering none seek $fd $offset puts $in [read $fd $numBytes] read -nonewline $out close $fd close $in close $out Although this is doable and works well for short chunks of text, I think we are witnessing a shortcoming in Tcl: it would be really nice (and quite logical, in view of the functionality provided by [[seek]] and [[tell]]) if [[read]] could accept a -bytes flag. The only thing needed is a convention about how to handle the situation where the number of bytes does not constitute a complete char. One convention could be: finish the char in that case. Another convention: discard the non-complete char. Or finally, just leave the fractional char as binary debris --- it is up to the caller to make sure this does not happen, and in the examples like the above this comes about naturally. Here is how I think it should work: # This proc is supposed to work just like [read $fileHandle $numChars], # except that the size of the chunk to read is specified in bytes, not in # chars. This is useful in connection with [seek] and [tell] which always # measure in bytes. package require Tclx proc readBytes { fileHandle numBytes } { # Record the original configuration: set fconf [fconfigure $fileHandle] # If anything goes wrong, at least we should restore the # original configuration, hence the catch: if { [catch { # Configure for binary read: fconfigure $fileHandle -encoding binary -translation binary # Set up a translation pipe: pipe out in # The input end should be as neutral as possible: fconfigure $in -encoding binary -translation binary -buffering none # The output part should mimic the original configuration... eval fconfigure {$out} $fconf # (Beware that it is up to the caller to ensure that the original # channel is nonblocking if this is required.) # Now read the bytes and put them into the translation pipe: puts -nonewline $in [read $fileHandle $numBytes] set content [read $out] # Clean up: close $in close $out # And restore the original configuration: eval fconfigure {$fileHandle} $fconf } err] } { eval fconfigure {$fileHandle} $fconf error $err } else { return $content } } Unfortunately, on big chunks of text (>8192), there seems to be a [bug in pipe] that obstructs this solution... In fact: makes the tcl interpreter hang... ---- '''See Also''' * [Building a file index] * [Indexed file reading] * [pipe] ---- [Category Example] [Category Suggestions]