Version 49 of Working with binary data

Updated 2020-07-22 12:56:49 by pooryorick

Working With Binary Data provides an overview of strategies and tools available to manipulating data in units of bytes rather than characters.

See Also

Binary representation of numbers
bitstrings
Dump a file in hex and ASCII
binary data access - tclbin (Demailly)
An extension for working with binary data and structures.

Description

Binary data is data that represents something other than text. Various data formats and communication protocols employ binary data. The key observation for working with binary data in Tcl where every value is a string is that any sequence of bytes can be represented and manipulated using a subset of Unicode composed of the first 256 characters. This subset happens to be equivalent to the iso8859-1 character encoding. When binary data is read using this encoding and without any end-of-line translation, the difference between byte and character units melts away, and standard string operations can be used to manipulate the data. With this one-to-one correspondance between a byte and a character, routines like chan seek which deal in bytes are in harmony with routines like string index which deal in characters.

binary format and binary scan are provided for convenience and performance, but they don't add any new functionality.

Guidelines for working with binary data:

  • Use -nonewline to tell puts not to append a a newline character.
  • -encoding binary is usually is insufficient because it doesn't disable end-of-line translation. Instead use -translation binary, which in addition to disabling end-of-line translations sets the encoding to iso8859-1.
  • gets swallows end-of-line characters. Use read instead
  • exec doesn't provide a way to configure the channel it uses with -translation binary, using the system encoding, which may corrupt the data. Use open instead and then configure the channel with -translation binary. TIP#259 addresses this issue.
  • If binary scan is used on a value that contains characters beyond the first 256 Unicode characters, it only looks at the lower byte of the code point for each character, resulting in data loss. When this happens, it's usually because the data was read in from some channel that wasn't configured with -translation binary.

Internals

Regardless of internal reprsentations, at the script level all values are simply strings.

At the C level every value is stored as a Tcl_Obj which may contain a string representation, an internal representation, or both. The string representation is a string encoded in utf-8 and the internal representation is a particular interpretation of the string. A byte arrary internal representation is a fixed-width representation for strings where each character can be stored in a single byte. It is more efficient than the utf-8 string encoding which for some characters uses multiple bytes. binary scan creates a byte array internal representation for the scanned value.

Examples

Mac Cody and Jeff David on comp.lang.tcl:

Mac Cody wrote:

 > Here is a simple example that
 > first writes binary data to a file and then reads back the
 > binary data:
 > 
 > set outBinData [binary format s2Sa6B8 {100 -2} 100 foobar 01000001]
 > puts "Format done: $outBinData"
 > set fp [open binfile w]

Important safety tip. When dealing with binary files you should always do:

 fconfigure $fp -translation binary

I got bit hard on this one once when my \x0a and \x0d bytes got translated.

 > puts -nonewline $fp $outBinData
 > close $fp
 > set fp [open binfile r]

 fconfigure $fp -translation binary

 > set inBinData [read $fp]
 > close $fp
 > binary scan $inBinData s2Sa6B8 val1 val2 val3 val4
 > puts "Scan done: $val1 $val2 $val3 $val4"
 > 
 Jeff David

Binary Data in a Script

A post to comp.lang.tcl asks how best to embed binary data into a Tcl script. kennykb has this summary of the answer:

  • for the occasional non-printing character embedded in a string, use \xNN.
  • for binary data embedded in a script and simply processed as a unit, use base64 (examples include image files, files for foreign applications that you just want to write from the script, and blocks of cyphertext).
  • for byte sequences where it's important to preserve transparency, use the hexadecimal representation and then binary format H* at runtime.

Avoid literal binary data in scripts that may be interpreted using the system encoding, which can vary by platform. tclsh by default uses the system encoding to read the script it is given, and literal binary data might be incorrectly interpreted as some character special to Tcl, such as whitespce, newline, or \u1a (control-Z), which as of Tcl Tcl 8.4 signifies the end of the script.

For the greatest portability, use only ASCII characters in scripts, and use some form of escaping or encoding for characters outside the ASCII range.

With \u or \U character substitution Tcl consumes up to 8 hexadecimal characters. To ensure that it doesn't consume characters that weren't intended as part of the hexadecimal representation of the character, pad \u and \U on the left with enough 0 characters to total 4 and 8 hexidecimal characters, respectively.

Do Not Use string bytelength

string bytelength is the wrong tool for getting the length of some binary data since it counts the number of bytes in the utf-8 representation of a string. Use string length instead, as the number of characters is equal to the number of bytes when -translation binary is uses.

Representing C structures

How could C structures be represented in a Tcl manner such that the structure and intent is clear to the reader?

Lars H PYK: One approach is to use something like this proc

proc Cstruct_scan {data formatL} {
    set formatStr {} 
    set varL [list]
    foreach {code var} $formatL {append formatStr $code; lappend varL $var}
    uplevel 1 [linsert $varL 0 ::binary scan $data $formatStr]
}

A usage example, based on OpenType font header , is

Cstruct_scan $OTF_cmap_head {
    I   Table_version_number
    I   fontRevision
    I   checkSumAdjustment
    I   magicNumber
    B16 flags
    S   unitsPerEm
    W   created
    W   modified
    S   xMin
    S   yMin
    S   xMax
    S   yMax
    B16 macStyle
    S   lowestRecPPEM
    S   fontDirectionHint
    S   indexToLocFormat
    S   glyphDataFormat
}

It may however be better in cases such as this to come up with a higher- level routine than Cstruct_scan. Even though the above decodes the binary data, it is still not quite in a natural Tcl format. Also, unsigned numbers may still have incorrect sign extensions.

JAK PYK: I really like this approach but would the following be a little more efficient?

proc Cstruct_scan {data formatL} {
    set formatStr {}
    foreach {code var} $formatL {
        append formatStr $code
        # varL created on first lappend
        lappend varL $var
    }
    # creates and appends without copy
    uplevel 1 [list ::binary scan $data $formatStr] $varL
}

Lars H: My primary concern was correctness and avoiding shimmering -- what if $formatL is empty? Also $varL probably has to be rather long if shifting its elements (a tight C loop copying pointers) takes longer time than handling the extra list command.

JAK PYK: If $formatL is empty then $varL will be too. $formatStr will be empty. The reason I was worried about efficiency is I need to use this in a tight, time bound loop. It seems to be the ticket. I originally thought I would need to use Swig or Critcl to do this...

Lars H PYK: As for gaining more speed... Have you considered writing instead a typedef_Cstruct routine that defines a procedure that does parsing such as above? The format string would get constructed only once, so the run-time overhead should be minimal. You'd probably need to experiment to see whether upvar (of specified variables) or uplevel of the entire binary scan command is most efficient.

JAK: Now we need the format...

# write binary struct
proc Cstruct_format {formatL} {
    set formatStr {}
    foreach {code var} $formatL {
        append formatStr $code
        lappend varL \$$var
    }
    uplevel 1 [list ::binary format $formatStr] $varL
}

Now to test:

Cstruct_scan [binary format aacc L C 3 0] {
    a type
    a cmd
    c data
    c resp
}
4
% set type
L
% set data
3
set x [Cstruct_format {
    a type
    a cmd
    c data
    c resp
    }]
LC??
% set type x
x
% set cmd x
x
% set data x
x
% set resp x
x
%  Cstruct_scan $x {
    a type
    a cmd
    c data
    c resp
}
4
% set type
L
% set data
3

I need to make more tests...

Page Authors

anonymous
Wrote the original page.
dkf
Some essential comments.
pyk
Rewrote much of the page for brevity and clarity.