[[Created Dec 9 2005 by [lexfiend].]] [netstrings] is a string encoding invented by D. J. Bernstein (formal definition here: [http://cr.yp.to/proto/netstrings.txt]). In a nutshell: :, It's 8-bit clean, easy to implement, interoperable and "safe" (at least when used correctly). Netstrings can also be trivially nested, so the list {This is a test} can be represented as 23:4:This,2:is,1:a,4:test,, More usefully, netstrings can be used to safely send/receive chunked ASCII/binary data between any two arbitrary platforms (implemented in any language) with trivial framing and length checks, and none of that nonsense about having to escape special characters in the datastream. The Quick Mail Queuing Protocol (QMQP) [http://cr.yp.to/proto/qmqp.html] that qmail [http://cr.yp.to/qmail.html] implements is an example of real-life netstrings use. # readNetString # Purpose: Reads and returns a netstring from proc readNetString {chan} { set nslen 0 set nstr "" while {![eof $chan]} { set nschar [read $chan 1] switch -glob $nschar { [0-9] { # Next char in length - add it in set nslen [expr {$nslen * 10 + $nschar}] } : { # End of length - read in $nslen bytes set nstr [read $chan $nslen] # Check to see if we've got there set readlen [string length $nstr] if {$readlen == $nslen} { # Now we must have a trailing comma if {[read $chan 1] == ","} { # We're done return $nstr } else { error "Missing delimiter at end of netstring" } } else { error "Netstring length ($nslen) does not match read length ($readlen)" } } default { error "Illegal character '$nschar' in netstring length" } } } } # writeNetString ? ...? # Purpose: Writes all s to as netstrings proc writeNetString {chan args} { foreach str $args { set writelen [string length $str] puts -nonewline $chan "${writelen}:${str}," } flush $chan } ---- [AMG], 9 Dec 2005: I changed the "Illegal character" switch arm pattern specifier from "-" to "default". Note that the above code differs from djb's netstring specification which forbids superfluous leading 0's in length prefixes, such that there is exactly one legal encoding for any string (e.g. empty string is "0:,"). [lexfiend]'s code doesn't help much with picking apart nested netstrings because it only accepts characters from a [[[read]]]able channel. Here's some code that works with Tcl strings rather than channels. proc ns_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"ns_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*):} $data header length]} { error "missing or corrupt netstring header" } incr pos [string length $header] if {![regexp -start $pos "\\A(.{$length})," $data body payload]} { error "missing, corrupt, or truncated netstring body" } incr pos [string length $body] return $payload } proc ns_encode {data} { return [string length $data]:$data, } My code isn't good for data arriving from a channel since the channel reader doesn't know in advance how many bytes to read, but it works for data that's already read, as in the case of nested netstrings. It would be pretty nice (and semantically WEIRD!) if we could unify these two cases, that is, somehow have Tcl objects backed by channels rather than simple RAM, and be able to [[[regexp]]] over them and such. The following works with lists in the format given by [lexfiend], that is, netstring wrapping a simple concatenation of netstrings: proc list_ns_decode {data} { set pos 0 set result [list] while {$pos < [string length $data]} { lappend result [ns_decode $data pos] } return $result } proc list_ns_encode {args} { set result "" foreach arg $args { append result [ns_encode $arg] } return $result } % ns_encode [list_ns_encode This is a test] 23:4:This,2:is,1:a,4:test,, % list_ns_decode [ns_decode 23:4:This,2:is,1:a,4:test,,] This is a test In practice I wouldn't use this "tree" (nested list) format since it doesn't clearly express the difference between interior and leaf nodes; given arbitrary data, you don't know how many levels deep you need to decode. Perhaps I should illustrate: % interp alias {} unknown "" list_ns_encode % ns_encode [a [b [c d]] e f [g] [] [h []]] 51:1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,, (That's possibly the strangest use of [[[unknown]]] I've ever seen.) Anyway, this string of bits could legitimately be an encoding of any of the following lists: {1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,} { a 1:b,8:1:c,1:d,, e f 1:g, { } 1:h,0:, } { a { b 1:c,1:d,} e f {g} { } { h { }}} { a { b { c d}} e f {g} { } { h { }}} Plus other variations. So the recipient of encoded data must know in advance how it is supposed to be structured; this encoding is not (reliably, unambiguously) self-describing. Perhaps use type characters, for instance '''d''' for leaf nodes and '''l''' for interior nodes. Also, in keeping with the netstring spirit, for interior nodes give the number of child nodes (i.e. the [[[llength]]]) in the encoding. I'll call this nl for netlist. proc nl_data_encode {data} { return [string length $data]d:$data, } proc nl_list_encode {args} { return [string length [join $args ""]]l[llength $args]:[join $args ""], } proc nl_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"nl_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*)(?:(d)|(l)(0|[1-9]\d*)):} $data\ header length d_type l_type l_length]} { error "missing or corrupt netlist header" } incr pos [string length $header] set end [expr {$pos + $length - 1}] if {[string index $data [expr {$end + 1}]] ne ","} { error "missing or corrupt netlist trailer" } if {$d_type eq "d"} { set payload [string range $data $pos $end] } elseif {$l_type eq "l"} { set payload [list] while {[llength $payload] < $l_length} { lappend payload [nl_decode [string range $data 0 $end] pos] } if {$pos != $end + 1} { error "extra characters after final netlist element" } } set pos [expr {$end + 2}] return $payload } Considerably more complex... Differentiating between "data" elements and "list" elements is neither easy nor comfortable. Oh well, let's test. % interp alias {} d {} nl_data_encode % interp alias {} l {} nl_list_encode % d foobar 6d:foobar, % l [d foobar] [d quux] 18l2:6d:foobar,4d:quux,, % l [d foobar] [l [d basby] [l]] [d quux] 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, % nl_decode 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, foobar {basby {}} quux Seems to work. If you haven't noticed, there's an impedance mismatch here between Tcl and, well, the rest of the world. Tcl has indistinguishable encodings for any data object and a single element list containing that same data object. (See [K] for '''K*'''.) % K* a % ns_encode a a 1:a, % list a % ns_encode [list_ns_encode a] a 4:1:a,, % list [list a] % ns_encode [list_ns_encode [list_ns_encode a]] a 7:4:1:a,,, As a consequence, for all the complex unambiguous list representation infrastructure I put into netlist, nl_decode can't always communicate the list structure to its caller. Watch: % d frobozz 7d:frobozz, % nl_decode 7d:frobozz, frobozz % l [l [l [l [l [d frobozz]]]]] 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, % nl_decode 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, frobozz Aw, that's a shame. ---- [[ [Category Security] ]] Why did I ([AMG]) put this in Security? Even though Tcl is thought not to have buffer overflows and other such problems, many other programs do, especially those written for ad-hoc network protocols. Netstrings were developed to make it easy to design protocols for which it's difficult to write buffer overflow-ridden software. Apparently protocols need to be hardened against programmer stupidity. :^)