Purpose of this page: To collate our knowledge about the facilities provided by Tcl to work with binary data, for example to talk to other applications using a binary protocol for exchanging information and commands. ----- '''Definition of binary data in Tcl''' "Everything is a string." But what is a string and what's ''in'' a string? The Tcl language views strings as sequences of Unicode characters. That is, each character can have any Unicode value from 0 to 0xFFFF. (Surrogate pairs can be used to express characters beyond this Basic Multilingual Plane, but Tcl itself gives them no special treatment.) Not all of these characters are assigned by Unicode, and some of them are reserved and should not be used, but Tcl doesn't check the Unicode validity of characters. Binary data in the Tcl language is just strings consisting of characters with code points anywhere in the more limited range 0 to 255. All Tcl commands that operate on binary data can in principle be reimplemented using the ordinary string commands (e.g. [format] with %c conversion), but the standard implementations utilize various optimizations to be more efficient (both in memory and time). On the C level the Tcl library may maintain a ''string representation'', an ''internal representation'', or both, for a value. The string representation uses UTF-8 encoded strings for compatibility with the pre-8.0 interfaces. For binary data, there is also an internal representation as byte array objects (Tcl_NewByteArrayObj and friends) which stores binary data in a machine native representation, thus avoiding the extra encoding level of the UTF-8 string representation. The return values from [binary format], most [encoding convertto], and [read] on a binary channel can be considered to be "pure binary data" (only a byte array internal representation, no UTF-8 string representation). It may be a good idea to treat these with the same kind of care one would treat a [pure list] to keep this purity. ----- The main facility for working with binary data is the [binary] command with its subcommands to dissect (scan) and join (format) binary data into/from standard tcl values (strings, integers, lists, et cetera). For dealing with binary data on C level see also http://www.tcl.tk/man/tcl8.4/TclLib/ByteArrObj.htm To exchange the binary information with other applications all of the facilities of the I/O system are at our fingertips and ready to be used. But note: * When writing data with [puts] do not forget to use the option -nonewline or else ''puts'' will write an additional end-of-line character after the data you actually wanted to send out. * Speaking of end-of-line characters I should note that another common error when creating a channel destined for exchange of binary information is forgetting to use '''[[[fconfigure] channel -translation binary]]'''. This command reconfigures the channel to leave the characters \n and \r untouched. Without this Tcl will treat them as end-of-line characters and mangle them during input and output. * In most cases, binary data also needs to be input and output to and from channels in a raw form. Translating from (presumed) UTF-8 to your system's character set can be a disaster. You therefore almost certainly need '''[[[fconfigure] $channel -encoding binary]]'''. '''[DKF] sez:''' Note that setting the ''-translation'' to binary also sets the ''-encoding'' to binary, so you can usually ignore this one. * When reading binary information from a channel only [read] should be used. Avoid [gets]! The latter command will try to recognize end-of-line characters no matter what the channel is configured too. You can never be sure that such a character will not crop up in the middle of your packet. * When spawning an application which returns binary data via stdout do '''not''' use [exec], but the [[open "|..."]] idiom as only the latter allows you to change the pipe channel to binary. [exec] hides the pipe channel and may use the wrong encoding and translation settings when reading the information from the external application. ---- On news:comp.lang.tcl, [Mac Cody] and [Jeff David] write: Mac Cody wrote: > Here is a simple example that > first writes binary data to a file and then reads back the > binary data: > > set outBinData [binary format s2Sa6B8 {100 -2} 100 foobar 01000001] > puts "Format done: $outBinData" > set fp [open binfile w] Important safety tip. When dealing with binary files you should always do: fconfigure $fp -translation binary I got bit hard on this one once when my \x0a and \x0d bytes got translated. > puts -nonewline $fp $outBinData > close $fp > set fp [open binfile r] fconfigure $fp -translation binary > set inBinData [read $fp] > close $fp > binary scan $inBinData s2Sa6B8 val1 val2 val3 val4 > puts "Scan done: $val1 $val2 $val3 $val4" > Jeff David ---- A post to comp.lang.tcl asks how best to embed binary data into a Tcl script. [kennykb] has this summary of the answer: * for the occasional non-printing character embedded in a string, use \xNN. * for binary data embedded in a script and simply processed as a unit, use [base64] (examples include image files, files for foreign applications that you just want to write from the script, and blocks of cyphertext). * for byte sequences where it's important to preserve transparency, use hexadecimal and [binary] format. In particular, you should avoid typing binary data directly into strings. While Tcl is able to handle binary data, there are places where you can run into problems. In particular, if you happen to have a Tcl script containing the literal character for a control-Z, you will find, as of Tcl 8.4, that you get a syntax error from Tcl. This is because beginning with 8.4, [\u001a is an end-of-file character in scripts]. See [source] (in particular the reference page) for more details. Please note that the issue with control-Z is just a special case of a more general bit of advice for writing portable Tcl scripts. Whenever Tcl_EvalFile() (or the [source] command) reads in and evaluates the contents of a file, the reading in is done according to the system encoding. System encodings may be different on different systems. If your file of Tcl code is going to move from system to system, you should be sure that all characters in it are valid in all system encodings. This essentially means you should limit yourself to 7-bit ASCII. You can represent characters outside 7-bit ASCII using the \u quoting supported by the Tcl parser. Hmmmm.... after a bit more reflection, it dawns on me that control-Z is part of 7-bit ASCII, so it's not a special case after all. Never mind. ---- Another tip (it's also mentioned on the [string] page, but I think it's worth repeating): [[string bytelength]] should not be used with binary data. That command measures how long the UTF-8 representation of a string is in bytes. For binary data you don't want conversion to UTF-8, so you don't want [[string bytelength]] either. Use [[string length]] instead. It's confusing but probably logical. [SEH] The money quote appeared in the docs sometime between 8.3 and 8.4: "If the object is a ByteArray object (such as those returned from reading a binary encoded channel), then this will return the actual byte length of the object." Until this quote appeared it was by no means clear that [[string length]] was a reliable tool for measuring data block lengths in bytes. Of course this seems to contradict the concept that "everything is a string," since sometimes the thing is a byte array. ---- See [Binary representation of numbers] and [Dump a file in hex and ASCII] for examples of usage. ---- What would be a way that one could read and write C structures in Tcl in a manner that would make the intents appear obvious to the reader? [Lars H]: One approach is to use something like this proc proc Cstruct_scan {data formatL} { set formatStr "" set varL [list] foreach {code var} {append formatStr $code; lappend varL $var} uplevel 1 [linsert $varL 0 ::binary scan $data $formatStr] } A usage example, based on [http://partners.adobe.com/asn/tech/type/opentype/head.jsp] is Cstruct_scan $OTF_cmap_head { I Table_version_number I fontRevision I checkSumAdjustment I magicNumber B16 flags S unitsPerEm W created W modified S xMin S yMin S xMax S yMax B16 macStyle S lowestRecPPEM S fontDirectionHint S indexToLocFormat S glyphDataFormat } It may however be better (in cases such as this) to come up with a higher level proc than this '''Cstruct_scan'''. Even though the above decodes the binary data, it is still not quite in a natural Tcl format. Unsigned numbers may still have incorrect sign extensions, for one. ----- What Can anyone put the above page's recommendations together to form a best practices example? ---- [Category Binary Data] - [Category Characters]