Working With Binary Data provides an overview of strategies and tools available to manipulating data in units of bytes rather than characters.
Binary data is data that represents something other than text. Various data formats and communication protocols employ binary data. The key observation for working with binary data in Tcl where every value is a string is that any sequence of bytes can be represented and manipulated using a subset of Unicode composed of the first 256 characters. This subset happens to be equivalent to the iso8859-1 character encoding. When binary data is read using this encoding and without any end-of-line translation, the difference between byte and character units melts away, and standard string operations can be used to manipulate the data. With this one-to-one correspondance between a byte and a character, routines like chan seek which deal in bytes are in harmony with routines like string index which deal in characters.
binary format and binary scan are provided for convenience and performance, but they don't add any new functionality.
Guidelines for working with binary data:
Regardless of internal reprsentations, at the script level all values are simply strings.
At the C level every value is stored as a Tcl_Obj which may contain a string representation, an internal representation, or both. The string representation is a string encoded in utf-8 and the internal representation is a particular interpretation of the string. A byte arrary internal representation is a fixed-width representation for strings where each character can be stored in a single byte. It is more efficient than the utf-8 string encoding which for some characters uses multiple bytes. binary scan creates a byte array internal representation for the scanned value.
Mac Cody and Jeff David on comp.lang.tcl:
Mac Cody wrote:
> Here is a simple example that > first writes binary data to a file and then reads back the > binary data: > > set outBinData [binary format s2Sa6B8 {100 -2} 100 foobar 01000001] > puts "Format done: $outBinData" > set fp [open binfile w]
Important safety tip. When dealing with binary files you should always do:
fconfigure $fp -translation binary
I got bit hard on this one once when my \x0a and \x0d bytes got translated.
> puts -nonewline $fp $outBinData > close $fp > set fp [open binfile r] fconfigure $fp -translation binary > set inBinData [read $fp] > close $fp > binary scan $inBinData s2Sa6B8 val1 val2 val3 val4 > puts "Scan done: $val1 $val2 $val3 $val4" > Jeff David
A post to comp.lang.tcl asks how best to embed binary data into a Tcl script. kennykb has this summary of the answer:
Avoid literal binary data in scripts that may be interpreted using the system encoding, which can vary by platform. tclsh by default uses the system encoding to read the script it is given, and literal binary data might be incorrectly interpreted as some character special to Tcl, such as whitespce, newline, or \u1a (control-Z), which as of Tcl Tcl 8.4 signifies the end of the script.
For the greatest portability, use only ASCII characters in scripts, and use some form of escaping or encoding for characters outside the ASCII range.
With \u or \U character substitution Tcl consumes up to 8 hexadecimal characters. To ensure that it doesn't consume characters that weren't intended as part of the hexadecimal representation of the character, pad \u and \U on the left with enough 0 characters to total 4 and 8 hexidecimal characters, respectively.
string bytelength is the wrong tool for getting the length of some binary data since it counts the number of bytes in the utf-8 representation of a string. Use string length instead, as the number of characters is equal to the number of bytes when -translation binary is uses.
How could C structures be represented in a Tcl manner such that the structure and intent is clear to the reader?
Lars H PYK: One approach is to use something like this proc
proc Cstruct_scan {data formatL} { set formatStr {} set varL [list] foreach {code var} $formatL {append formatStr $code; lappend varL $var} uplevel 1 [linsert $varL 0 ::binary scan $data $formatStr] }
A usage example, based on OpenType font header , is
Cstruct_scan $OTF_cmap_head { I Table_version_number I fontRevision I checkSumAdjustment I magicNumber B16 flags S unitsPerEm W created W modified S xMin S yMin S xMax S yMax B16 macStyle S lowestRecPPEM S fontDirectionHint S indexToLocFormat S glyphDataFormat }
It may however be better in cases such as this to come up with a higher- level routine than Cstruct_scan. Even though the above decodes the binary data, it is still not quite in a natural Tcl format. Also, unsigned numbers may still have incorrect sign extensions.
JAK PYK: I really like this approach but would the following be a little more efficient?
proc Cstruct_scan {data formatL} { set formatStr {} foreach {code var} $formatL { append formatStr $code # varL created on first lappend lappend varL $var } # creates and appends without copy uplevel 1 [list ::binary scan $data $formatStr] $varL }
Lars H: My primary concern was correctness and avoiding shimmering -- what if $formatL is empty? Also $varL probably has to be rather long if shifting its elements (a tight C loop copying pointers) takes longer time than handling the extra list command.
JAK PYK: If $formatL is empty then $varL will be too. $formatStr will be empty. The reason I was worried about efficiency is I need to use this in a tight, time bound loop. It seems to be the ticket. I originally thought I would need to use Swig or Critcl to do this...
Lars H PYK: As for gaining more speed... Have you considered writing instead a typedef_Cstruct routine that defines a procedure that does parsing such as above? The format string would get constructed only once, so the run-time overhead should be minimal. You'd probably need to experiment to see whether upvar (of specified variables) or uplevel of the entire binary scan command is most efficient.
JAK: Now we need the format...
# write binary struct proc Cstruct_format {formatL} { set formatStr {} foreach {code var} $formatL { append formatStr $code lappend varL \$$var } uplevel 1 [list ::binary format $formatStr] $varL }
Now to test:
Cstruct_scan [binary format aacc L C 3 0] { a type a cmd c data c resp } 4 % set type L % set data 3 set x [Cstruct_format { a type a cmd c data c resp }] LC?? % set type x x % set cmd x x % set data x x % set resp x x % Cstruct_scan $x { a type a cmd c data c resp } 4 % set type L % set data 3
I need to make more tests...