Working with binary data

Purpose of this page: To collate our knowledge about the facilities provided by Tcl to work with binary data, for example to talk to other applications using a binary protocol for exchanging information and commands.


Definition of binary data in Tcl

"Everything is a string." But what is a string and what's in a string?

The Tcl language views strings as sequences of Unicode characters. That is, each character can have any Unicode value from 0 to 0xFFFF. (Surrogate pairs can be used to express characters beyond this Basic Multilingual Plane, but Tcl itself gives them no special treatment.) Not all of these characters are assigned by Unicode, and some of them are reserved and should not be used, but Tcl doesn't check the Unicode validity of characters.

Binary data in the Tcl language is just strings consisting of characters with code points anywhere in the more limited range 0 to 255. All Tcl commands that operate on binary data can in principle be reimplemented using the ordinary string commands (e.g. format with %c conversion), but the standard implementations utilize various optimizations to be more efficient (both in memory and time).

On the C level the Tcl library may maintain a string representation, an internal representation, or both, for a value. The string representation uses UTF-8 encoded strings for compatibility with the pre-8.0 interfaces. For binary data, there is also an internal representation as byte array objects (Tcl_NewByteArrayObj and friends) which stores binary data in a machine native representation, thus avoiding the extra encoding level of the UTF-8 string representation.

For dealing with binary data on C level see also https://www.tcl-lang.org/man/tcl/TclLib/ByteArrObj.htm

To exchange the binary information with other applications all of the facilities of the I/O system are at our fingertips and ready to be used. But note:

  • When writing data with puts do not forget to use the option -nonewline or else puts will write an additional end-of-line character after the data you actually wanted to send out.
  • Speaking of end-of-line characters I should note that another common error when creating a channel destined for exchange of binary information is forgetting to use [fconfigure channel -translation binary]. This command reconfigures the channel to leave the characters \n and \r untouched. Without this Tcl will treat them as end-of-line characters and mangle them during input and output.
  • In most cases, binary data also needs to be input and output to and from channels in a raw form. Translating from (presumed) UTF-8 to your system's character set can be a disaster. You therefore almost certainly need [fconfigure $channel -encoding binary]. DKF sez: Note that setting the -translation to binary also sets the -encoding to binary, so you can usually ignore this one.
  • When reading binary information from a channel only read should be used. Avoid gets! The latter command will try to recognize end-of-line characters no matter how the channel is configured. You can never be sure that such a character will not crop up in the middle of your packet.
  • When spawning an application which returns binary data via stdout do not use exec, but the [open "|..."] idiom as only the latter allows you to change the pipe channel to binary. exec hides the pipe channel and may use the wrong encoding and translation settings when reading the information from the external application. (TIP#259 [1 ] addresses this issue.)

On news:comp.lang.tcl , Mac Cody and Jeff David write:

Mac Cody wrote:

 > Here is a simple example that
 > first writes binary data to a file and then reads back the
 > binary data:
 > 
 > set outBinData [binary format s2Sa6B8 {100 -2} 100 foobar 01000001]
 > puts "Format done: $outBinData"
 > set fp [open binfile w]

Important safety tip. When dealing with binary files you should always do:

 fconfigure $fp -translation binary

I got bit hard on this one once when my \x0a and \x0d bytes got translated.

 > puts -nonewline $fp $outBinData
 > close $fp
 > set fp [open binfile r]

 fconfigure $fp -translation binary

 > set inBinData [read $fp]
 > close $fp
 > binary scan $inBinData s2Sa6B8 val1 val2 val3 val4
 > puts "Scan done: $val1 $val2 $val3 $val4"
 > 
 Jeff David

A post to comp.lang.tcl asks how best to embed binary data into a Tcl script. kennykb has this summary of the answer:

  • for the occasional non-printing character embedded in a string, use \xNN.
  • for binary data embedded in a script and simply processed as a unit, use base64 (examples include image files, files for foreign applications that you just want to write from the script, and blocks of cyphertext).
  • for byte sequences where it's important to preserve transparency, use hexadecimal and binary format.

In particular, you should avoid typing binary data directly into strings. While Tcl is able to handle binary data, there are places where you can run into problems. In particular, if you happen to have a Tcl script containing the literal character for a control-Z, you will find, as of Tcl 8.4, that you get a syntax error from Tcl. This is because beginning with 8.4, \u001a is an end-of-file character in scripts. See source (in particular the reference page) for more details.

Please note that the issue with control-Z is just a special case of a more general bit of advice for writing portable Tcl scripts. Whenever Tcl_EvalFile() (or the source command) reads in and evaluates the contents of a file, the reading in is done according to the system encoding. System encodings may be different on different systems. If your file of Tcl code is going to move from system to system, you should be sure that all characters in it are valid in all system encodings. This essentially means you should limit yourself to 7-bit ASCII. You can represent characters outside 7-bit ASCII using the \u quoting supported by the Tcl parser.

Hmmmm.... after a bit more reflection, it dawns on me that control-Z is part of 7-bit ASCII, so it's not a special case after all. Never mind.

Remark: It is safer to use \u00NN instead \xNN for data with an ASCII number following: \x011 -> \u00011


Another tip (it's also mentioned on the string page, but I think it's worth repeating):

[string bytelength] should not be used with binary data. That command measures how long the UTF-8 representation of a string is in bytes. For binary data you don't want conversion to UTF-8, so you don't want [string bytelength] either. Use [string length] instead. It's confusing but probably logical.

SEH The money quote appeared in the docs sometime between 8.3 and 8.4: "If the object is a ByteArray object (such as those returned from reading a binary encoded channel), then this will return the actual byte length of the object." Until this quote appeared it was by no means clear that [string length] was a reliable tool for measuring data block lengths in bytes.

Of course this seems to contradict the concept that "everything is a string," since sometimes the thing is a byte array.

BR Well, [string length] has always been the correct way to measure the length AFAIK, so if this didn't work before that was a bug.


See Binary representation of numbers, bitstrings, and Dump a file in hex and ASCII for examples of usage.


What would be a way that one could read and write C structures in Tcl in a manner that would make the intents appear obvious to the reader?

Lars H: One approach is to use something like this proc

 proc Cstruct_scan {data formatL} {
    set formatStr ""
    set varL [list]
    foreach {code var} $formatL {append formatStr $code; lappend varL $var}
    uplevel 1 [linsert $varL 0 ::binary scan $data $formatStr]
 }

A usage example, based on [2 ] is

 Cstruct_scan $OTF_cmap_head {
    I   Table_version_number
    I   fontRevision
    I   checkSumAdjustment
    I   magicNumber
    B16 flags
    S   unitsPerEm
    W   created
    W   modified
    S   xMin
    S   yMin
    S   xMax
    S   yMax
    B16 macStyle
    S   lowestRecPPEM
    S   fontDirectionHint
    S   indexToLocFormat
    S   glyphDataFormat
 }

It may however be better (in cases such as this) to come up with a higher level proc than this Cstruct_scan. Even though the above decodes the binary data, it is still not quite in a natural Tcl format. Unsigned numbers may still have incorrect sign extensions, for one.

JAK I really like this approach but would the following be a little more efficient?

 proc Cstruct_scan {data formatL} {
    set formatStr ""
    foreach {code var} $formatL {append formatStr $code; lappend varL $var} ;# varL created on first lappend
    uplevel 1 [concat [list ::binary scan $data $formatStr] $varL] ;# creates and appends without copy
 }

Lars H: My primary concern was correctness and avoiding shimmering -- what if $formatL is empty? Also $varL probably has to be rather long if shifting its elements (a tight C loop copying pointers) takes longer time than handling the extra list command. OTOH, you don't need that concat.

JAK If $formatL is empty then $varL will be too - $formatStr will be null and the concat puts the command together into one list for eval to, well eval... without the concat eval will get two lists. But I could be miss interpreting the internals... The reason I was worried about efficiency is I need to use this in a tight, time bound loop. It seems to be the ticket. I originally thought I would need to use Swig or Critcl to do this...

Lars H: eval/uplevel always concats its arguments (which is a big reason why concat is not a pure list command). As for gaining more speed... Have you considered writing instead a typedef_Cstruct command that defines a proc that does parsing such as above? The format string would get constructed only once, so the run-time overhead should be minimal. You'd probably need to experiment to see whether upvar (of specified variables) or uplevel (of the entire binary scan command) is most efficient.

JAK Now we need the format...

 # write binary struct
 proc Cstruct_format {formatL} {
    set formatStr ""
    foreach {code var} $formatL { append formatStr $code; lappend varL \$$var }
    uplevel 1 [concat [list ::binary format $formatStr] [join $varL]]
 }

Now to test:

 Cstruct_scan [binary format aacc L C 3 0] {
     a type
     a cmd
     c data
     c resp
 }
 4
 % set type
 L
 % set data
 3
 set x [Cstruct_format {
     a type
     a cmd
     c data
     c resp
     }]
 LC??
 % set type x
 x
 % set cmd x
 x
 % set data x
 x
 % set resp x
 x
 %  Cstruct_scan $x {
     a type
     a cmd
     c data
     c resp
 }
 4
 % set type
 L
 % set data
 3

I need to make more tests...


Can anyone put the above page's recommendations together to form a best practices example?


In the old days of Tcl (pre-Tcl 8.0), one needed special extensions for binary data access - tclbin (Demailly)