'''Working With Binary Data''' provides an overview of strategies and tools available to manipulating data in units of [byte%|%bytes] rather than [character%|%characters]. ** See Also ** [Binary representation of numbers]: [bitstrings]: [Dump a file in hex and ASCII]: [binary data access - tclbin (Demailly)]: An [extension] for working with binary data and structures. ** Description ** '''Binary data''' is data that represents something other than text. Various [data format%|%data formats] and [category protocol%|%communication protocols] employ binary data. The key observation for working with binary data in Tcl where [eias%|%every value is a string] is that any sequence of bytes can be represented and manipulated using a subset of [Unicode] composed of the first 256 characters. This subset happens to be equivalent to the `iso8859-1` [encoding%|%character encoding]. When binary data is read using this encoding and without any end-of-line translation, the difference between byte and character units melts away, and standard [string] operations can be used to manipulate the data. With this one-to-one correspondance between a byte and a character, routines like `[chan seek]` which deal in bytes are in harmony with routines like `[string index]` which deal in characters. `[binary format]` and `[binary scan]` are provided for convenience and performance, but they don't add any new functionality. Guidelines for working with binary data: * Use `-nonewline` to tell `[puts]` not to append a a newline character. * `[chan configure%|%-encoding binary]` is usually is insufficient because it doesn't disable end-of-line translation. Instead use `[chan configure%|%-translation binary]`, which in addition to disabling end-of-line translations sets the encoding to `iso8859-1`. * `[gets]` swallows end-of-line characters. Use `[read]` instead * `[exec]` doesn't provide a way to configure the channel it uses with `-translation binary`, using the [encoding system%|%system encoding], which may corrupt the data. Use `[open]` instead and then configure the channel with `-translation binary`. [TIP][http://tip.tcl-lang.org/259.html%|%#259] addresses this issue. * If `[binary scan]` is used on a value that contains characters beyond the first 256 Unicode characters, it only looks at the lower byte of the code point for each character, resulting in data loss. When this happens, it's usually because the data was read in from some [chan%|%channel] that wasn't configured with `[chan configure%|%-translation binary]`. ** Internals ** Regardless of internal reprsentations, at the script level all values are simply [eias%|%strings]. At the [C] level every value is stored as a [Tcl_Obj] which may contain a '''string representation''', an '''internal representation''', or both. The string representation is a string encoded in [utf-8] and the internal representation is a particular interpretation of the string. A [https://www.tcl-lang.org/man/tcl/TclLib/ByteArrObj.htm%|%byte arrary] internal representation is a fixed-width representation for strings where each character can be stored in a single byte. It is more efficient than the [utf-8] string encoding which for some characters uses multiple bytes. `[binary scan]` creates a byte array internal representation for the scanned value. ** Examples ** [Mac Cody] and [Jeff David] on [comp.lang.tcl]: Mac Cody wrote: > Here is a simple example that > first writes binary data to a file and then reads back the > binary data: > > set outBinData [binary format s2Sa6B8 {100 -2} 100 foobar 01000001] > puts "Format done: $outBinData" > set fp [open binfile w] Important safety tip. When dealing with binary files you should always do: fconfigure $fp -translation binary I got bit hard on this one once when my `\x0a` and `\x0d` bytes got translated. > puts -nonewline $fp $outBinData > close $fp > set fp [open binfile r] fconfigure $fp -translation binary > set inBinData [read $fp] > close $fp > binary scan $inBinData s2Sa6B8 val1 val2 val3 val4 > puts "Scan done: $val1 $val2 $val3 $val4" > Jeff David ** Binary Data in a Script ** A post to [comp.lang.tcl] asks how best to embed binary data into a Tcl script. [kennykb] has this summary of the answer: * for the occasional non-printing character embedded in a string, use `\x`''`NN`''. * for binary data embedded in a script and simply processed as a unit, use [base64] (examples include image files, files for foreign applications that you just want to write from the script, and blocks of cyphertext). * for byte sequences where it's important to preserve transparency, use the hexadecimal representation and then [binary format%|%binary format H*] at runtime. Avoid literal binary data in scripts that may be interpreted using the [encoding system%|%system encoding], which can vary by platform. [tclsh] by default uses the system encoding to read the script it is given, and literal binary data might be incorrectly interpreted as some character special to Tcl, such as whitespce, newline, or `[\u001a is an end-of-file character in scripts%|%\u1a]` (control-Z), which as of Tcl [Changes in Tcl/Tk 8.4%|%Tcl 8.4] signifies the end of the script. For the greatest portability, use only [ASCII] characters in scripts, and use some form of escaping or encoding for characters outside the ASCII range. With `\u` or `\U` character substitution Tcl consumes up to 8 hexadecimal characters. To ensure that it doesn't consume characters that weren't intended as part of the hexadecimal representation of the character, pad `\u` and `\U` on the left with enough `0` characters to total 4 and 8 hexidecimal characters, respectively. ** Do Not Use `[string bytelength]` ** `[string bytelength]` is the wrong tool for getting the length of some binary data since it counts the number of bytes in the [utf-8] representation of a string. Use `[string length]` instead, as the number of characters is equal to the number of bytes when `-translation binary` is uses. ** Representing [C] structures ** How could [C] structures be represented in a Tcl manner such that the structure and intent is clear to the reader? [Lars H] [PYK]: One approach is to use something like this proc ====== proc Cstruct_scan {data formatL} { set formatStr {} set varL [list] foreach {code var} $formatL {append formatStr $code; lappend varL $var} uplevel 1 [linsert $varL 0 ::binary scan $data $formatStr] } ====== A usage example, based on [https://web.archive.org/web/20041022194341/http://partners.adobe.com/asn/tech/type/opentype/head.jsp%|%OpenType font header], is ====== Cstruct_scan $OTF_cmap_head { I Table_version_number I fontRevision I checkSumAdjustment I magicNumber B16 flags S unitsPerEm W created W modified S xMin S yMin S xMax S yMax B16 macStyle S lowestRecPPEM S fontDirectionHint S indexToLocFormat S glyphDataFormat } ====== It may however be better in cases such as this to come up with a higher- level routine than `Cstruct_scan`. Even though the above decodes the binary data, it is still not quite in a natural Tcl format. Also, unsigned numbers may still have incorrect sign extensions. [JAK] [PYK]: I really like this approach but would the following be a little more efficient? ====== proc Cstruct_scan {data formatL} { set formatStr {} foreach {code var} $formatL { append formatStr $code # varL created on first lappend lappend varL $var } # creates and appends without copy uplevel 1 [list ::binary scan $data $formatStr] $varL } ====== [Lars H]: My primary concern was correctness and avoiding shimmering -- what if `$formatL` is empty? Also `$varL` probably has to be rather long if shifting its elements (a tight C loop copying pointers) takes longer time than handling the extra [list] command. [JAK] [PYK]: If `$formatL` is empty then `$varL` will be too. `$formatStr` will be empty. The reason I was worried about efficiency is I need to use this in a tight, time bound loop. It seems to be the ticket. I originally thought I would need to use [Swig] or [Critcl] to do this... [Lars H] [PYK]: As for gaining more speed... Have you considered writing instead a '''typedef_Cstruct''' routine that defines a [proc%|%procedure] that does parsing such as above? The format string would get constructed only once, so the run-time overhead should be minimal. You'd probably need to experiment to see whether `[upvar]` (of specified variables) or `[uplevel]` of the entire `[binary scan]` command is most efficient. [JAK]: Now we need the format... ====== # write binary struct proc Cstruct_format {formatL} { set formatStr {} foreach {code var} $formatL { append formatStr $code lappend varL \$$var } uplevel 1 [list ::binary format $formatStr] $varL } ====== Now to test: ======none Cstruct_scan [binary format aacc L C 3 0] { a type a cmd c data c resp } 4 % set type L % set data 3 set x [Cstruct_format { a type a cmd c data c resp }] LC?? % set type x x % set cmd x x % set data x x % set resp x x % Cstruct_scan $x { a type a cmd c data c resp } 4 % set type L % set data 3 ====== I need to make more tests... ** Page Authors ** anonymous: Wrote the original page. [dkf]: Some essential comments. [pyk]: Rewrote much of the page for brevity and clarity. <> Binary Data | Characters | File