Version 53 of binary scan

Updated 2013-12-01 02:26:05 by AMG

Summary

[binary scan] parses fields from a binary string, returning the number of conversions performed.

See Also

binary
binary format
format
scan
IEEE binary float to string conversion
NaN
Tcl syntax

Synopsis

binary scan string formatString ?varName varName ...?

Description

String gives the input to be parsed and formatString indicates how to parse it. Each varName gives the name of a variable; when a field is scanned from string the result is assigned to the corresponding variable.

As with [[binary format], formatString consists of a sequence of zero or more field specifiers separated by zero or more spaces. Each field specifier is a single type character followed by an optional numeric count. Most field specifiers consume one argument to obtain the variable into which the scanned values should be placed. The type character specifies how the binary data is to be interpreted. The count typically indicates how many items of the specified type are taken from the data.

If present, the count is a non-negative decimal integer or *, which normally indicates that all of the remaining items in the data are to be used. If there are not enough bytes left after the current cursor position to satisfy the current field specifier, then the corresponding variable is left untouched and [binary scan] returns immediately, and the value is the number of variables that were set. If there are not enough arguments for all of the fields in the format string that consume arguments, then an error is generated.

A similar example as with [[binary format] should explain the relation between field specifiers and arguments in case of the binary scan subcommand:

binary scan $bytes s3s first second

Provided the binary string in the $bytes is long enough, assigns a list of three integers to $first, and assigns a single value to the $second. If $bytes contains fewer than 8 bytes (i.e. four 2-byte integers), no assignment to $second will be made, and if $bytes contains fewer than 6 bytes (i.e. three 2-byte integers), no assignment to $first will be made. Hence:

puts [binary scan abcdefg s3s first second]
puts $first
puts $second

will, assuming neither variable is set previously, print:

1
25185 25699 26213
can't read "second": no such variable

When specified-width numberic types (c, s, S, i, and I s) are scanned, the high bit is extended to the type that is actually used to store the scanned value, so arguments that have their high bit set (0x80 for chars, 0x8000 for shorts, 0x80000000 for ints), will be sign extended. Thus the following will occur:

% set signShort [binary format s1 0x8000]
% binary scan $signShort s1 val
1
% set val
-32768
% format %x $val
FFFF8000

If you want to produce an unsigned value, then you can mask the return value to the desired size. For example, to produce an unsigned short value:

% set val [expr {$val & 0xFFFF}]
32768
% format %x $val
8000

Since Tcl 8.5, one may use the `u1 modifier to get unsigned interpretations:

% set signShort [binary format s1 0x8000]
% binary scan $signShort su1 val
1
% set val
32768
% format %x $val
8000

Each type-count pair moves an imaginary cursor through the binary data, reading bytes from the current position. The cursor is initially at position 0 at the beginning of the data.


DMG 2003-12-02: It is also important to note that the scanning of float types is limited to the "endian" of the scanner. IEEE binary float to string conversion provides one way of converting them. Another way is to [binary scan] the characters, [binary format] them in the proper order, and [binary scan the now-native order.

DKF: Tcl 8.5 includes additional format types (due to TIP #129) that allow float types of specified endian-ness to be handled.

Example: Padding Bytes

Arjen Markus: Suppose you have a sequence of bytes that are read from a file. Two of these, say at position 6 and 7 (counting from 0), make up an integer number. If the original data are in big-endian order, then

set two_bytes [string range $str 6 7]
binary scan "\0\0$two_bytes" I intvalue

will turn these two bytes into an integer (consisting of 4 bytes, hence the two leading nulls),

If they are in little-endian order, use:

set two_bytes [string range $str 6 7]
binary scan "${two_bytes}\0\0" i intvalue

Handling Null Characters

DMG, 2003-12-02: Question: Does anyone know of a way/hack to scan in null-terminated strings? I was somewhat surprised to see they were not part of the formatString set as they naturally fall into how Tcl works (well, how it used to). For example, I'm trying to read a file that has a 30-byte space allocated to hold 2 null-terminated strings.

2004-10-19: Try

set null_term_string [lindex [split $string \000 ] 0]

Unsigned Types

TIP #275 , which was implemented for Tcl 8.5, adds support for scanning unsigned data types.


sbron 2005-09-27: I typically need unsigned results from [binary scan]. I created my own proc that adds a few new field specifiers that return unsigned values: C - unsigned byte, u - unsigned little-endian short, U - unsigned big-endian short, l - 32-bit unsigned little endian integer, and L - 32-bit unsigned big-endian integer.

proc binscan {str fmtstr args} {
    # Create a format string using the built-in signed versions
    set format [string map {C c u s U S l i L I} $fmtstr]
    # Split the formatstring into the separate terms
    set i 0; set vars ""; set fmtlist ""
    foreach n [regexp -all -inline {[a-wA-W][0-9* ]*} $fmtstr] {
        lappend fmtlist $n
        lappend vars term([incr i])
    }

    # Execute the signed binary scan
    eval [linsert $vars 0 binary scan $str $format]
    #binary scan $str $format {*}$vars

    # Define the mask values to apply to the special format specifiers
    array set mask {C 0xff u 0xffff U 0xffff l 0xffffffff L 0xffffffff}
    # Apply the mask and assign the results to the specified variables
    set i 0
    foreach n $fmtlist v $args {
        set type [string index $n 0]
        # Link to the variable in the calling stack frame
        upvar 1 $v var
        if {[info exists mask($type)]} {
            set list ""
            foreach t $term([incr i]) {
                lappend list [expr {$t & $mask($type)}]
            }
            set var $list
        } else {
            set var $term([incr i])
        }
    }
}

kostix 2007-06-19: another wrapper for [binary scan] that mimics TIP #275 for handling of unsigned integers (so note that other enhancements introduced in 8.5 aren't emulated):

proc bscan {data f args} {
    set c 0xFF
    set s 0xFFFF
    set S 0xFFFF
    set i 0xFFFFFFFF
    set I 0xFFFFFFFF

    set outf ""
    set upos {}
    set pos -1
    set last ?

    foreach a [split $f ""] {
        if {[string equal $a u] && [string first $last csSiI] >= 0} {
            lappend upos $pos [set $last]
        } else {
            append outf $a

            if {[string first $a aAbBhHcsSiIwWfdxX@] >= 0} {
                set last $a
                incr pos
            }
        }
    }

    set count [uplevel 1 [list binary scan $data $outf] $args]

    foreach {pos mask} $upos {
        upvar 1 [lindex $args $pos] v
        set v [expr {$v & $mask}]
    }

    set count
}

Works about 10 times slower than [binary scan] itself.

May be used to provide 8.5 compatibility like this:

if {[package vsatisfies $tcl_version 8.5]} {
    interp alias {} bscan {} binary scan
} else {
    # define the above proc here
}

Note also that this proc isn't 100% compatible with the real [binary scan], since it doesn't do any formal syntax checking of the format string.

The Secret of the B's

The following example illustrates behaviour of the B and b field specifiers that may not be obvious:

% binary scan \x07\x87\x05 b5b* var1 var2
2
% puts $var1
11100
% puts $var2
1110000110100000

The first thing to note is that the bits are "backwards", because that's what b does. B would have produced the left-to-right results humans are more used to.

The critical thing to catch is that b5 consumes at least one entire byte. This is documented by the phrase, Any extra bits in the last byte are ignored. So $var contains as much of \x07 as you asked for, and the rmaining three bits are simply discarded. That leaves \x87\x05 in its entirety for $var2

No bits are lost in the following example:

% binary scan \x07\x87\x05 b* var3
111000001110000110100000

1110 0000 1110 0001 1010 0000
7    0    7    8    5    0

In the following example, three bits of the first byte are once again lost, leading to 21 bytes in the result instead of 24:

binary scan \xff\xff\xff b5b* var7 var8
2
% puts $var7
11111
% puts $var8
1111111111111111
% string length $var8
16

Of course, using b8 for the first byte captures all its bits:

% binary scan \xff\xff\xff b8b* var3 var4
2
% string length $var3
8
% string length $var4
16

In the following example, 9 bits are consumed by the first field specifier:

% binary scan \xff\xff\xff b9b* var3 var4
2
% string length $var3
9
% string length $var4
8

The last 7 bits of the second byte are lost, as b* starts scanning at the beginning of the third byte.

Proposal: Query Cursor Position

NickH: We frequently need to make several calls to binary scan to fully parse input, usually because of discriminators or counters. While it is obviously possible in principle to keep track of where we are in the scan, in practice it is difficult and/or ugly and the command already knows.

I would like to see the formats for both [binary scan] and [[binary format] extended to allow the extraction of the current byte position. If this were done by adding meaning to @ with no count (currently an error) I can't see how anyone would object (an alternative would be a new format char possibly = or ?)

typical use would then be something like:

set offset 0

binary scan @${offset}<lots of stuff>@ <lots of vars> offset
if {????} {
    binary scan @${offset}<lots of stuff>@ <lots of vars> offset
} else {
    binary scan @${offset}<lots of other stuff>@ <lots of vars> offset
}

puts "bytes read = $offset"

For [[binary format] the use would be contrary to other parameters but it can sometimes be useful for back-patching lengths instead of using [[string length] and [[expr]

Functional Style

AMG: It bothers me that [binary scan] insists on writing to variables rather than simply returning a list. This prevents it from being directly used in a functional context. Well, I just discovered a handy workaround. The key insight is that [brackets] perform script substitution, not merely command substitution. This means it's legal to put more than one command in a pair of brackets; since it's a script, the commands are delimited by semicolons or newlines. The substituted value is the result of the last command. Here, see for yourself:

set bytes Hello!
puts [binary scan $bytes B* tempvar; set tempvar]
# 010010000110010101101100011011000110111100100001

More than one conversion? Instead of [[set], use [[list] if a list is desired. Or use [[format] to concatenate the values and format them further. Single-argument [[lindex] may also be useful; it simply returns its argument.

set bytes Tcl
puts [binary scan $bytes B*X*H* a b; list $a $b]
# 010101000110001101101100 54636c
puts [binary scan $bytes B*X*H* a b; format "%s = %8s" $a $b]
# 010101000110001101101100 =   54636c
puts [binary scan $bytes B*X*H* a b; lindex "$a = $b"]
# 010101000110001101101100 = 54636c
puts [binary scan $bytes B*X*H* a b; lindex [string map {0 o 1 i} $a]$b]
# oioioioooiioooiioiioiioo54636c

Obviously, [[puts] can be replaced by other commands. I'm only using it for demonstration purposes.

Misc