Version 19 of Working with binary data

Updated 2004-07-28 17:37:06

Purpose of this page: To collate our knowledge about the facilities provided by Tcl to work with binary data, for example to talk to other applications using a binary protocol for exchanging information and commands.


Definition of binary data in Tcl

"Everything is a string." But what is a string and what's in a string?

The Tcl language views strings as sequences of Unicode characters. That is, each character can have any Unicode value (actually UTF-16) from 0 to 0xFFFF. Usually that excludes character values that are unassigned or reserved by Unicode (Tcl doesn't check this though).

Binary data in Tcl is just strings consisting of pseudo-characters (bytes) with code points anywhere in the more limited range 0 to 255. Also there are no excluded characters with binary data.

On the C level the Tcl library uses UTF-8 encoded strings for compatibility with the pre-8.0 interfaces. In addition we have the byte array objects (Tcl_NewByteArrayObj and friends) to deal with binary data directly, without the re-interpretation that UTF-8 entails.


The main facility for working with binary data is the binary command with its subcommands to dissect (scan) and join (format) binary data into/from standard tcl values (strings, integers, lists, et cetera).

For dealing with binary data on C level see also http://www.tcl.tk/man/tcl8.4/TclLib/ByteArrObj.htm

To exchange the binary information with other applications all of the facilities of the I/O system are at our fingertips and ready to be used. But note:

  • When writing data with puts do not forget to use the option -nonewline or else puts will write an additional end-of-line character after the data you actually wanted to send out.
  • Speaking of end-of-line characters I should note that another common error when creating a channel destined for exchange of binary information is forgetting to use [fconfigure channel -translation binary]. This command reconfigures the channel to leave the characters \n and \r untouched. Without this Tcl will treat them as end-of-line characters and mangle them during input and output.
  • In most cases, binary data also needs to be input and output to and from channels in a raw form. Translating from (presumed) UTF-8 to your system's character set can be a disaster. You therefore almost certainly need [fconfigure $channel -encoding binary]. DKF sez: Note that setting the -translation to binary also sets the -encoding to binary, so you can usually ignore this one.
  • When reading binary information from a channel only read should be used. Avoid gets! The latter command will try to recognize end-of-line characters no matter what the channel is configured too. You can never be sure that such a character will not crop up in the middle of your packet.
  • When spawning an application which returns binary data via stdout do not use exec, but the [open "|..."] idiom as only the latter allows you to change the pipe channel to binary. exec hides the pipe channel and may use the wrong encoding and translation settings when reading the information from the external application.

On news:comp.lang.tcl , Mac Cody and Jeff David write:

Mac Cody wrote:

 > Here is a simple example that
 > first writes binary data to a file and then reads back the
 > binary data:
 > 
 > set outBinData [binary format s2Sa6B8 {100 -2} 100 foobar 01000001]
 > puts "Format done: $outBinData"
 > set fp [open binfile w]

Important safety tip. When dealing with binary files you should always do:

 fconfigure $fp -translation binary

I got bit hard on this one once when my \x0a and \x0d bytes got translated.

 > puts -nonewline $fp $outBinData
 > close $fp
 > set fp [open binfile r]

 fconfigure $fp -translation binary

 > set inBinData [read $fp]
 > close $fp
 > binary scan $inBinData s2Sa6B8 val1 val2 val3 val4
 > puts "Scan done: $val1 $val2 $val3 $val4"
 > 
 Jeff David

A post to comp.lang.tcl asks how best to embed binary data into a Tcl script. kennykb has this summary of the answer:

  • for the occasional non-printing character embedded in a string, use \xNN.
  • for binary data embedded in a script and simply processed as a unit, use base64 (examples include image files, files for foreign applications that you just want to write from the script, and blocks of cyphertext).
  • for byte sequences where it's important to preserve transparency, use hexadecimal and binary format.

In particular, you should avoid typing binary data directly into strings. While Tcl is able to handle binary data, there are places where you can run into problems. In particular, if you happen to have a Tcl script containing the literal character for a control-Z, you will find, as of Tcl 8.4, that you get a syntax error from Tcl. This is because beginning with 8.4, \u001a is an end-of-file character in scripts. See source (in particular the reference page) for more details.

Please note that the issue with control-Z is just a special case of a more general bit of advice for writing portable Tcl scripts. Whenever Tcl_EvalFile() (or the source command) reads in and evaluates the contents of a file, the reading in is done according to the system encoding. System encodings may be different on different systems. If your file of Tcl code is going to move from system to system, you should be sure that all characters in it are valid in all system encodings. This essentially means you should limit yourself to 7-bit ASCII. You can represent characters outside 7-bit ASCII using the \u quoting supported by the Tcl parser.

Hmmmm.... after a bit more reflection, it dawns on me that control-Z is part of 7-bit ASCII, so it's not a special case after all. Never mind.


Another tip (it's also mentioned on the string page, but I think it's worth repeating):

[string bytelength] should not be used with binary data. That command measures how long the UTF-8 representation of a string is in bytes. For binary data you don't want conversion to UTF-8, so you don't want [string bytelength] either. Use [string length] instead. It's confusing but probably logical.


See Binary representation of numbers and Dump a file in hex and ASCII for examples of usage.


What would be a way that one could read and write C structures in Tcl in a manner that would make the intents appear obvious to the reader?


What

Can anyone put the above page's recommendations together to form a best practices example?


[what categories should be added to this page?]