[[Started Dec 9 2005 by [lexfiend], who's quite astonished that ''anyone'' took any notice of this obscure encoding.]] [AMG]: s/obscure/awesome/ [netstrings] is a string encoding invented by D. J. Bernstein (formal definition here: [http://cr.yp.to/proto/netstrings.txt]). In a nutshell: :, It's 8-bit clean, easy to implement, interoperable and "safe" (at least when used correctly). Netstrings can also be trivially nested, so the list {This is a test} can be represented as 23:4:This,2:is,1:a,4:test,, More usefully, netstrings can be used to safely send/receive chunked ASCII/binary data between any two arbitrary platforms (implemented in any language) with trivial framing and length checks, and none of that nonsense about having to escape special characters in the datastream. The Quick Mail Queuing Protocol (QMQP) [http://cr.yp.to/proto/qmqp.html] that qmail [http://cr.yp.to/qmail.html] implements is an example of real-life netstrings use. # readNetString # Purpose: Reads and returns a netstring from proc readNetString {chan} { set nslen 0 set nstr "" while {![eof $chan]} { set nschar [read $chan 1] switch -glob $nschar { [0-9] { # Next char in length - add it in set nslen [expr {$nslen * 10 + $nschar}] } : { # End of length - read in $nslen bytes set nstr [read $chan $nslen] # Check to see if we've got there set readlen [string length $nstr] if {$readlen == $nslen} { # Now we must have a trailing comma if {[read $chan 1] == ","} { # We're done return $nstr } else { error "Missing delimiter at end of netstring" } } else { error "Netstring length ($nslen) does not match read length ($readlen)" } } default { error "Illegal character '$nschar' in netstring length" } } } } # writeNetString ? ...? # Purpose: Writes all s to as netstrings proc writeNetString {chan args} { foreach str $args { set writelen [string length $str] puts -nonewline $chan "${writelen}:${str}," } flush $chan } ---- [AMG], 9 Dec 2005: I changed the "Illegal character" switch arm pattern specifier from "-" to "default". ''[lexfiend], 10 Dec 2005: Oof! Talk about a PEBKAC error! Thanks for catching that -- it was very late when I wrote that code.'' [AMG]: Note that the above code differs from djb's netstring specification which forbids superfluous leading 0's in length prefixes, such that there is exactly one legal encoding for any string (e.g. empty string is "0:,"). ''[lexfiend]: Good point. That's in fact a not-so-well-known aspect of netstrings, and I notice even djb's sample code doesn't handle this case correctly. Will fix it some other time, when it's not so late...'' [lexfiend]'s code doesn't help much with picking apart nested netstrings because it only accepts characters from a [[[read]]]able channel. Here's some code that works with Tcl strings rather than channels. proc ns_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"ns_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*):} $data header length]} { error "missing or corrupt netstring header" } incr pos [string length $header] if {![regexp -start $pos "\\A(.{$length})," $data body payload]} { error "missing, corrupt, or truncated netstring body" } incr pos [string length $body] return $payload } proc ns_encode {data} { return [string length $data]:$data, } My code isn't good for data arriving from a channel since the channel reader doesn't know in advance how many bytes to read, but it works for data that's already read, as in the case of nested netstrings. It would be pretty nice (and semantically WEIRD!) if we could unify these two cases, that is, somehow have Tcl objects backed by channels rather than simple RAM, and be able to [[[regexp]]] over them and such. ''[lexfiend]: Thanks for writing the equivalent string-based implementation. The use-case that I envisioned for my code simply to read/write non-nested netstrings as part of a simple application protocol like PIRP [http://cr.yp.to/proto/pirp.txt], but your code would certainly come in handy for, say, implementing a QMQP client or server.'' [AMG]: The following works with lists in the format given by [lexfiend], that is, netstring wrapping a simple concatenation of netstrings: proc list_ns_decode {data} { set pos 0 set result [list] while {$pos < [string length $data]} { lappend result [ns_decode $data pos] } return $result } proc list_ns_encode {args} { set result "" foreach arg $args { append result [ns_encode $arg] } return $result } % ns_encode [list_ns_encode This is a test] 23:4:This,2:is,1:a,4:test,, % list_ns_decode [ns_decode 23:4:This,2:is,1:a,4:test,,] This is a test In practice I wouldn't use this "tree" (nested list) format since it doesn't clearly express the difference between interior and leaf nodes; given arbitrary data, you don't know how many levels deep you need to decode. Perhaps I should illustrate: % interp alias {} unknown "" list_ns_encode % ns_encode [a [b [c d]] e f [g] [] [h []]] 51:1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,, (That's possibly the strangest use of [[[unknown]]] I've ever seen.) Anyway, this string of bits could legitimately be an encoding of any of the following lists: {1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,} { a 1:b,8:1:c,1:d,, e f 1:g, { } 1:h,0:, } { a { b 1:c,1:d,} e f {g} { } { h { }}} { a { b { c d}} e f {g} { } { h { }}} Plus other variations. So the recipient of encoded data must know in advance how it is supposed to be structured; this encoding is not (reliably, unambiguously) self-describing. ''[lexfiend]: Personally, I thought an underlying assumption behind netstrings was that both sides already knew what to expect, so a self-describing format was neither needed nor desired (along that path lies XML), though it's certainly an interesting mental exercise. 8-)'' Perhaps use type characters, for instance '''d''' for leaf nodes and '''l''' for interior nodes. Also, in keeping with the netstring spirit, for interior nodes give the number of child nodes (i.e. the [[[llength]]]) in the encoding. I'll call this nl for netlist. proc nl_data_encode {data} { return [string length $data]d:$data, } proc nl_list_encode {args} { return [string length [join $args ""]]l[llength $args]:[join $args ""], } proc nl_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"nl_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*)(?:(d)|(l)(0|[1-9]\d*)):} $data\ header length d_type l_type l_length]} { error "missing or corrupt netlist header" } incr pos [string length $header] set end [expr {$pos + $length - 1}] if {[string index $data [expr {$end + 1}]] ne ","} { error "missing or corrupt netlist trailer" } if {$d_type eq "d"} { set payload [string range $data $pos $end] } elseif {$l_type eq "l"} { set payload [list] while {[llength $payload] < $l_length} { lappend payload [nl_decode [string range $data 0 $end] pos] } if {$pos != $end + 1} { error "extra characters after final netlist element" } } set pos [expr {$end + 2}] return $payload } Considerably more complex... Differentiating between "data" elements and "list" elements is neither easy nor comfortable. Oh well, let's test. % interp alias {} d {} nl_data_encode % interp alias {} l {} nl_list_encode % d foobar 6d:foobar, % l [d foobar] [d quux] 18l2:6d:foobar,4d:quux,, % l [d foobar] [l [d basby] [l]] [d quux] 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, % nl_decode 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, foobar {basby {}} quux Seems to work. If you haven't noticed, there's an impedance mismatch here between Tcl and, well, the rest of the world. Tcl has indistinguishable encodings for any data object and a single element list containing that same data object. (See [K] for '''K*'''.) % K* a % ns_encode a a 1:a, % list a % ns_encode [list_ns_encode a] a 4:1:a,, % list [list a] % ns_encode [list_ns_encode [list_ns_encode a]] a 7:4:1:a,,, - [jmn] 2005-12-11. Regarding this 'impedance mismatch'. One way to deal with this is with 'tagged' lists in Tcl. Nested empty lists/tuples, or nested lists/tuples where the inner list/tuple has one element are reasonably common in Erlang and manipulating such lists as ordinary Tcl lists can be tricky to say the least. I use a tagged list mechanism when transferring data back & forth between Tcl & [Erlang]. e.g %package require erlish %erlish::term tcl {"somestring",someatom,6,[],[[]],[x],[[x]],{{{}}}} TERMLIST {STRING mystring} {ATOM someatom} {INT 6} LIST {LIST LIST} {LIST {ATOM x}} {LIST {LIST {ATOM x}}} {TUPLE {TUPLE TUPLE}} With this package you can even preserve whitespace by using 'compound' tags. 'lindex 0 0' will always give you the 'tag' i.e the type of list you're dealing with, and the rest of the tag contains a whitespace encoding pointing to the positions in the data where it occurs. This is similar in concept to transmitting a list's 'nesting traversal' talked about below... but of course both these mechanisms require you have all the data prior to transmitting.. so it can make streaming a little difficult. anyway.. sorry for the sidetrack - I'll make a page for the erlish package some time. As a consequence, for all the complex unambiguous list representation infrastructure I put into netlist, nl_decode can't always communicate the list structure to its caller. Watch: % d frobozz 7d:frobozz, % nl_decode 7d:frobozz, frobozz % l [l [l [l [l [d frobozz]]]]] 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, % nl_decode 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, frobozz Aw, that's a shame. ---- ''[lexfiend]: Actually, I'd embed the type signature as a netstring rather than shoehorning it into the length field, like this:'' ::,:,...:,, ''For example, if '{n}' is the type signature for a list of length :'' 26:2:{1},15:2:{1},7:frobozz,,, ''would represent a list of a list of 1 element (frobozz). In fact, with type signature compression, your last example:'' {{{{{frobozz}}}}} ''could be encoded thusly:'' 26:11:{{{{{1}}}}},7:frobozz,, ([AMG] interrupts: how does one programmatically construct a list such as {{{{{frobozz}}}}}?) ''and:'' {{foo {bar {baz 321} this {is {a}}}} test} ''would be:'' 70:19:{{1{1{2}1{1{1}}}}1},3:foo,3:bar,3:baz,3:321,4:this,2:is,1:a,4:test,, ''This method has the advantage of interoperating with any standard netstring implementation, yet any program that can't understand or wasn't expecting the (because, for instance, it was a misrouted message) can safely junk the entire data structure without losing sync with the other side.'' [AMG]: That's very clever. I like! Eschewing nesting/recursion in favor of transmitting the infix traversal... radical. Maybe I'll write an implementation for it here, if someone doesn't beat me to it. I called my little protocol netlist instead of netstring to emphasize the incompatibility, but about said incompatibility I wasn't happy at the time and I'm not happy now because I'd like to be able to use good, existing netstring-aware code, for instance execline [http://www.skarnet.org/software/execline/] (see [http://www.skarnet.org/software/execline/el_transform.html#netstrings] for the specific usage). Hmm, [execline] is interesting stuff; we should have a page on it! From a Tcl perspective I say that for self-describing data, list structure is the only type information I think is worth advertising: everything else is ordinary strings (i.e. isn't further encoded*). If this was wiki.[python].tk, we probably wouldn't even be talking about netstrings since Pickle does everything we want anyway, and we wouldn't be concerned about compatibility with inferior languages. :^) Okay, actually I don't know what Pythoneers think but isn't this what Pickle is for? * Of course you can argue that in netstrings every data type is encoded as strings, for instance the string "50" is an encoding of the integer that is equal to the number of stars on the circa-2005 US flag. But Tcl'ers don't care since Tcl performs the same encoding but does so without bothering the programmer about it. ---- Why did I ([AMG]) put this in Security? Even though Tcl is thought not to have buffer overflows and other such problems, many other programs do, especially those written for ad-hoc network protocols. Netstrings were developed to make it easy to design protocols for which it's difficult to write buffer overflow-ridden software. Apparently protocols need to be hardened against programmer stupidity. :^) [lexfiend]: Netstrings are also designed to make it easy to design '''robust''' protocols, in that the simple combination of explicit length and framing characters help ensure that ''any'' traffic that obeys netstring conventions can be read and dealt with reliably and predictably, unlike (say) line-oriented protocols whose implementors/users occasionally forget to escape line-breaks in the actual data... [AMG]: djb says that there are two kinds of interfaces: '''user''' interfaces and '''good''' interfaces. Line-oriented network protocols (file formats, too!) fall in the first category, and they're plagued with all sorts of funny problems. But they're easy for puny hu-mans to read and write directly. Netstring falls in the second category; it's great for computers to talk to each other, but users need an intermediary tool in order to correctly read and write. Maybe that's not so bad, as long as the encoding protocol is simple enough that the intermediary tool itself is unlikely to be buggy. In other words, direct compatibility with [[[telnet]]] isn't a worthwhile feature. I'm spoiled on Tcl's nice buffering, so I usually wind up making my protocols line-oriented. Maybe I should break this habit and stop trying to handle packet arrival events directly in my [[[fileevent] readable]] callback. Instead I disable buffering and handle characters as they come; the readable callback slices things into packets and calls the appropriate registered high-level handler. (By the way, is it possible to [[[event generate]]] without a widget?) Okay, that gives me an idea for how to unify code that reads netstrings from the wire and from inside of already-read netstrings. Actually I've done this before, with a few minor variations, and it works but looks complex. Yet I also find it to be inevitable: does anyone have a simpler idea that'll still yield a correct, full-fledged I/O handler? The readable callback [[[read]]]s all available data, then passes it to a netstring decoder function capable of returning all the full packets it sees in the data, telling the caller how many input characters it consumed and will never need again, reporting encoding errors, and attempting to resynchronize after error. The readable callback stuffs read data into one end of a ring buffer and the netstring decoder pulls it out at the other end, leaving behind bits that comprise partially received packets. This netstring decoder function doesn't know or care where its data comes from, so it can be used to get at nested netstrings. ---- [[ [Category Interprocess Communication] | [Category Security] ]]