[[Started Dec 9 2005 by [lexfiend], who's quite astonished that ''anyone'' took any notice of this obscure encoding.]] [AMG]: s/obscure/awesome/ [netstrings] is a string encoding invented by D. J. Bernstein (formal definition here: [http://cr.yp.to/proto/netstrings.txt]). In a nutshell: :, It's 8-bit clean, easy to implement, interoperable and "safe" (at least when used correctly). Netstrings can also be trivially nested, so the list {This is a test} can be represented as 23:4:This,2:is,1:a,4:test,, More usefully, netstrings can be used to safely send/receive chunked ASCII/binary data between any two arbitrary platforms (implemented in any language) with trivial framing and length checks, and none of that nonsense about having to escape special characters in the datastream. The Quick Mail Queuing Protocol (QMQP) [http://cr.yp.to/proto/qmqp.html] that qmail [http://cr.yp.to/qmail.html] implements is an example of real-life netstrings use. # readNetString # Purpose: Reads and returns a netstring from proc readNetString {chan} { set nslen 0 set nstr "" while {![eof $chan]} { set nschar [read $chan 1] switch -glob $nschar { [0-9] { # Next char in length - add it in set nslen [expr {$nslen * 10 + $nschar}] } : { # End of length - read in $nslen bytes set nstr [read $chan $nslen] # Check to see if we've got there set readlen [string length $nstr] if {$readlen == $nslen} { # Now we must have a trailing comma if {[read $chan 1] == ","} { # We're done return $nstr } else { error "Missing delimiter at end of netstring" } } else { error "Netstring length ($nslen) does not match read length ($readlen)" } } default { error "Illegal character '$nschar' in netstring length" } } } } # writeNetString ? ...? # Purpose: Writes all s to as netstrings proc writeNetString {chan args} { foreach str $args { set writelen [string length $str] puts -nonewline $chan "${writelen}:${str}," } flush $chan } ---- [AMG], 9 Dec 2005: I changed the "Illegal character" switch arm pattern specifier from "-" to "default". ''[lexfiend], 10 Dec 2005: Oof! Talk about a PEBKAC error! Thanks for catching that -- it was very late when I wrote that code.'' [AMG]: Note that the above code differs from djb's netstring specification which forbids superfluous leading 0's in length prefixes, such that there is exactly one legal encoding for any string (e.g. empty string is "0:,"). ''[lexfiend]: Good point. That's in fact a not-so-well-known aspect of netstrings, and I notice even djb's sample code doesn't handle this case correctly. Will fix it some other time, when it's not so late...'' [lexfiend]'s code doesn't help much with picking apart nested netstrings because it only accepts characters from a [[[read]]]able channel. Here's some code that works with Tcl strings rather than channels. proc ns_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"ns_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*):} $data header length]} { error "missing or corrupt netstring header" } incr pos [string length $header] if {![regexp -start $pos "\\A(.{$length})," $data body payload]} { error "missing, corrupt, or truncated netstring body" } incr pos [string length $body] return $payload } proc ns_encode {data} { return [string length $data]:$data, } My code isn't good for data arriving from a channel since the channel reader doesn't know in advance how many bytes to read, but it works for data that's already read, as in the case of nested netstrings. It would be pretty nice (and semantically WEIRD!) if we could unify these two cases, that is, somehow have Tcl objects backed by channels rather than simple RAM, and be able to [[[regexp]]] over them and such. ''[lexfiend]: Thanks for writing the equivalent string-based implementation. The use-case that I envisioned for my code simply to read/write non-nested netstrings as part of a simple application protocol like PIRP [http://cr.yp.to/proto/pirp.txt], but your code would certainly come in handy for, say, implementing a QMQP client or server.'' [AMG]: The following works with lists in the format given by [lexfiend], that is, netstring wrapping a simple concatenation of netstrings: proc list_ns_decode {data} { set pos 0 set result [list] while {$pos < [string length $data]} { lappend result [ns_decode $data pos] } return $result } proc list_ns_encode {args} { set result "" foreach arg $args { append result [ns_encode $arg] } return $result } % ns_encode [list_ns_encode This is a test] 23:4:This,2:is,1:a,4:test,, % list_ns_decode [ns_decode 23:4:This,2:is,1:a,4:test,,] This is a test In practice I wouldn't use this "tree" (nested list) format since it doesn't clearly express the difference between interior and leaf nodes; given arbitrary data, you don't know how many levels deep you need to decode. Perhaps I should illustrate: % interp alias {} unknown "" list_ns_encode % ns_encode [a [b [c d]] e f [g] [] [h []]] 51:1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,, (That's possibly the strangest use of [[[unknown]]] I've ever seen.) Anyway, this string of bits could legitimately be an encoding of any of the following lists: {1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,} { a 1:b,8:1:c,1:d,, e f 1:g, { } 1:h,0:, } { a { b 1:c,1:d,} e f {g} { } { h { }}} { a { b { c d}} e f {g} { } { h { }}} Plus other variations. So the recipient of encoded data must know in advance how it is supposed to be structured; this encoding is not (reliably, unambiguously) self-describing. ''[lexfiend]: Personally, I thought an underlying assumption behind netstrings was that both sides already knew what to expect, so a self-describing format was neither needed nor desired (along that path lies XML), though it's certainly an interesting mental exercise. 8-)'' Perhaps use type characters, for instance '''d''' for leaf nodes and '''l''' for interior nodes. Also, in keeping with the netstring spirit, for interior nodes give the number of child nodes (i.e. the [[[llength]]]) in the encoding. I'll call this nl for netlist. proc nl_data_encode {data} { return [string length $data]d:$data, } proc nl_list_encode {args} { return [string length [join $args ""]]l[llength $args]:[join $args ""], } proc nl_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"nl_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*)(?:(d)|(l)(0|[1-9]\d*)):} $data\ header length d_type l_type l_length]} { error "missing or corrupt netlist header" } incr pos [string length $header] set end [expr {$pos + $length - 1}] if {[string index $data [expr {$end + 1}]] ne ","} { error "missing or corrupt netlist trailer" } if {$d_type eq "d"} { set payload [string range $data $pos $end] } elseif {$l_type eq "l"} { set payload [list] while {[llength $payload] < $l_length} { lappend payload [nl_decode [string range $data 0 $end] pos] } if {$pos != $end + 1} { error "extra characters after final netlist element" } } set pos [expr {$end + 2}] return $payload } Considerably more complex... Differentiating between "data" elements and "list" elements is neither easy nor comfortable. Oh well, let's test. % interp alias {} d {} nl_data_encode % interp alias {} l {} nl_list_encode % d foobar 6d:foobar, % l [d foobar] [d quux] 18l2:6d:foobar,4d:quux,, % l [d foobar] [l [d basby] [l]] [d quux] 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, % nl_decode 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, foobar {basby {}} quux Seems to work. If you haven't noticed, there's an impedance mismatch here between Tcl and, well, the rest of the world. Tcl has indistinguishable encodings for any data object and a single element list containing that same data object. (See [K] for '''K*'''.) % K* a % ns_encode a a 1:a, % list a % ns_encode [list_ns_encode a] a 4:1:a,, % list [list a] % ns_encode [list_ns_encode [list_ns_encode a]] a 7:4:1:a,,, As a consequence, for all the complex unambiguous list representation infrastructure I put into netlist, nl_decode can't always communicate the list structure to its caller. Watch: % d frobozz 7d:frobozz, % nl_decode 7d:frobozz, frobozz % l [l [l [l [l [d frobozz]]]]] 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, % nl_decode 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, frobozz Aw, that's a shame. [jmn] 2005-12-11. Regarding this 'impedance mismatch'. One way to deal with this is with 'tagged' lists in Tcl. Nested empty lists/tuples, or nested lists/tuples where the inner list/tuple has one element are reasonably common in Erlang and manipulating such lists as ordinary Tcl lists can be tricky to say the least. I use a tagged list mechanism when transferring data back & forth between Tcl & [Erlang]. e.g. %package require erlish %erlish::term tcl {"somestring",someatom,6,[],[[]],[x],[[x]],{{{}}}} TERMLIST {STRING somestring} {ATOM someatom} {INT 6} LIST {LIST LIST} {LIST {ATOM x}} {LIST {LIST {ATOM x}}} {TUPLE {TUPLE TUPLE}} With this package you can even preserve whitespace by using 'compound' tags. 'lindex 0 0' will always give you the 'tag' i.e the type of list you're dealing with, and the rest of the tag contains a whitespace encoding pointing to the positions in the data where it occurs. This is similar in concept to transmitting a list's 'nesting traversal' talked about below... but of course both these mechanisms require you have all the data prior to transmitting.. so it can make streaming a little difficult. Anyway.. sorry for the sidetrack - I'll make a page for the erlish package some time. ---- ''[lexfiend]: Actually, I'd embed the type signature as a netstring rather than shoehorning it into the length field, like this:'' ::,:,...:,, ''For example, if '{n}' is the type signature for a list of length :'' 26:2:{1},15:2:{1},7:frobozz,,, ''would represent a list of a list of 1 element (frobozz). In fact, with type signature compression, your last example:'' {{{{{frobozz}}}}} ''could be encoded thusly:'' 26:11:{{{{{1}}}}},7:frobozz,, ([AMG] interrupts: how does one programmatically construct a list such as {{{{{frobozz}}}}}? - [RS]: See below) ''and:'' {{foo {bar {baz 321} this {is {a}}}} test} ''would be:'' 70:19:{{1{1{2}1{1{1}}}}1},3:foo,3:bar,3:baz,3:321,4:this,2:is,1:a,4:test,, ''This method has the advantage of interoperating with any standard netstring implementation, yet any program that can't understand or wasn't expecting the (because, for instance, it was a misrouted message) can safely junk the entire data structure without losing sync with the other side.'' [AMG]: That's very clever. I like! Eschewing nesting/recursion in favor of transmitting the infix traversal... radical. Maybe I'll write an implementation for it here, if someone doesn't beat me to it. I called my little protocol netlist instead of netstring to emphasize the incompatibility, but about said incompatibility I wasn't happy at the time and I'm not happy now because I'd like to be able to use good, existing netstring-aware code, for instance execline [http://www.skarnet.org/software/execline/] (see [http://www.skarnet.org/software/execline/el_transform.html#netstrings] for the specific usage). Hmm, [execline] is interesting stuff; we should have a page on it! From a Tcl perspective I say that for self-describing data, list structure is the only type information I think is worth advertising: everything else is ordinary strings (i.e. isn't further encoded*). If this was wiki.[python].tk, we probably wouldn't even be talking about netstrings since Pickle does everything we want anyway, and we wouldn't be concerned about compatibility with inferior languages. :^) Okay, actually I don't know what Pythoneers think but isn't this what Pickle is for? * Of course you can argue that in netstrings every data type is encoded as strings, for instance the string "50" is an encoding of the integer that is equal to the number of stars on the circa-2005 US flag. But Tcl'ers don't care since Tcl performs the same encoding but does so without bothering the programmer about it. [Lars H]: Is that "Pickle" the same as the Bencoding used by BitTorrent [http://www.bittorrent.com/protocol.html]? Bencoding is pretty close to the "netstring with type characters" described above. [AMG]: I refer to [http://docs.python.org/lib/module-pickle.html]. The BitTorrent document's choice of metasyntactic variables (spam and eggs) strongly suggests that it's Python-based, so yes it's quite likely that the two encodings are the same. I'm having trouble finding documentation for the official Python pickling protocols (there are several numbered versions plus binary and textual variants), but I remember reading about it in a book I loaned away once upon a time. It was a good book: the Visual Quick Start guide to Python. Check it out if you want to learn Python. [Lars H]: Well, browsing a bit of documentation about Pickle quickly killed off any inclinations I might have had towards learning [Python]. (Is this what these people have to go through to serialize data?? The horror, the horror!!!) This only strengthens my conviction that Tcl has gotten it right -- [everything is a string]. ---- Why did I ([AMG]) put this in Security? Even though Tcl is thought not to have buffer overflows and other such problems, many other programs do, especially those written for ad-hoc network protocols. Netstrings were developed to make it easy to design protocols for which it's difficult to write buffer overflow-ridden software. Apparently protocols need to be hardened against programmer stupidity. :^) [lexfiend]: Netstrings are also designed to make it easy to design '''robust''' protocols, in that the simple combination of explicit length and framing characters help ensure that ''any'' traffic that obeys netstring conventions can be read and dealt with reliably and predictably, unlike (say) line-oriented protocols whose implementors/users occasionally forget to escape line-breaks in the actual data... [AMG]: djb says that there are two kinds of interfaces: '''user''' interfaces and '''good''' interfaces. Line-oriented network protocols (file formats, too!) fall in the first category, and they're plagued with all sorts of funny problems. But they're easy for puny hu-mans to read and write directly. Netstring falls in the second category; it's great for computers to talk to each other, but users need an intermediary tool in order to correctly read and write. Maybe that's not so bad, as long as the encoding protocol is simple enough that the intermediary tool itself is unlikely to be buggy. In other words, direct compatibility with [[[telnet]]] and soft, fleshy brains isn't a worthwhile feature. I'm spoiled on Tcl's nice buffering, so I usually wind up making my protocols line-oriented. Maybe I should break this habit and stop trying to handle packet arrival events directly in my [[[fileevent] readable]] callback. Instead I disable buffering and handle characters as they come; the readable callback slices things into packets and calls the appropriate registered high-level handler. (By the way, is it possible to [[[event generate]]] without a widget? [lexfiend]: Do you mean [Custom event notification]?) Okay, that gives me an idea for how to unify code that reads netstrings from the wire and from inside of already-read netstrings. Actually I've done this before, with a few minor variations, and it works but looks complex. Yet I also find it to be inevitable: does anyone have a simpler idea that'll still yield a correct, full-fledged I/O handler? The readable callback [[[read]]]s all available data, then passes it to a netstring decoder function capable of returning all the full packets it sees in the data, telling the caller how many input characters it consumed and will never need again, reporting encoding errors, and attempting to resynchronize after error. The readable callback stuffs read data into one end of a ring buffer and the netstring decoder pulls it out at the other end, leaving behind bits that comprise partially received packets. This netstring decoder function doesn't know or care where its data comes from, so it can be used to get at nested netstrings. ---- [RS] How to make deep nestings programmatically? By constructing a suitable string :) proc deeplist {n content} { return [string repeat \{ $n]$content[string repeat \} $n] } % deeplist 5 frobozz {{{{{frobozz}}}}} [Lars H] thinks the problem was rather about how to make [list] generate these deep nestings, like so: proc repeatlist {n content} { for {set k 1} {$k<=$n} {incr k} { set content [list $content] } return $content } The trick there is merely to pick a nontrivial content: % repeatlist 5 "fro bozz" {{{{{fro bozz}}}}} % repeatlist 5 "{frobozz}" {{{{{{frobozz}}}}}} [AMG]: Is there an easy, reliable way to transform "trivial" content into a "nontrivial" form without changing its meaning? [Lars H]: Not in general, because if that content is a string then encoding and meaning is the same. However if the content is itself encoded then there are often loopholes that can be exploited. E.g. if the content is the ''list'' "frobozz", then it can be encoded also as the string "\146robozz", which due to the backslash will be braced by [list]. ---- ''[escargo]'' - This ''netstrings'' seems like a way of representing ''values'' without much notion of ''structure''. Some of the encodings used above try to encode structure as well as value. The notion of ''netstrings'' seems to be almost the opposite of [XML]. If one does not want the overhead of [XML], than imposing structure could be done with [JSON] or [YAML]. [AMG]: Netstring doesn't know anything about structure, but that doesn't mean its payload can't be self-describing. [lexfiend] and I have been bandying about with methods of encoding ''lists'' in netstring style. Sure, we could just put the Tcl string representation in a netstring, but generating and dissecting Tcl-style lists is comparatively hard to do correctly and securely. These list encodings, particularly the two [lexfiend] came up with, are much better I think. ([lexfiend]'s first is for when both sides know the structure, and his second is self-describing for sending arbitrary lists and trees. My proposal probably shouldn't be used because it's incompatible with existing code, fairly complicated, and doesn't offer any benefit over the alternatives.) ---- About the netstring spec... I'm probably missing the point (in a big way), but I don't see how Bernstein's netstring protocol is any better than (something a little easier to process, such as) this: <,> where NNNNNNNN is always 9 bytes (ASCII encoded length of DATA), immediately followed by DATA and ending with a sanity byte ",". His protocol, due to the variable length of N, trades CPU for bandwidth since you have to make up to 9 reads (or iterative checks if system I/O reads are buffered) just to get the length. By requiring all nine bytes, you can safely do a single read. Nine bytes is not that a lot out of today's bandwidth. And, the ":" seems to be there in case we get an incorrect protocol stream, but there is NO defined recover strategy. The same can be said for the trailing ",". If you really wanted simple framing, what about something like <~> <~> ? On a slightly related note, has anyone looked at this http://www.sics.se/~joe/ubf/site/home.html ? (Erlang) -- [Todd Coram] ''feeling slow this afternoon after a large lunch'' [AMG]: djb doesn't like setting arbitrary limits like this (999,999,999). After all, he's the guy who's responsible for RFC 2822 supporting five-digit years! Joking aside, it's not unreasonable to transfer data trillions of bytes at a type; after all, this is a high-level protocol intended for use over reliable stream-oriented transports. And the CPU time involved in reading nine bytes isn't really worth considering, especially since relatively few netstring headers will exceed four characters in length, and those that do arrive infrequently because they are (of course) separated by more data bytes, with the ratio of data bytes to header bytes growing exponentially. With a variable-length length prefix, the colon is necessary as a field separator, since the first character of DATA may be a digit. The comma serves as a check on the string length. If the comma's not there, the packet was generated incorrectly, or the underlying transport has failed (which is extremely rare these days?). In the first case, aborting is the only sane "recovery" method. In the second case, it may be possible to scan ahead for the next occurrence of ('''0:'''|'''[[1-9]]\d*:''') and provisionally try resynchronizing, again dropping the packet if the terminating comma isn't there. But this is a lot of work and is unlikely to give good results, so aborting again seems reasonable. Your schemes, while workable, suffer from the same problem: unless characters in DATA are "escaped" somehow (something netstring doesn't do), it's always possible to resync at the wrong place. For instance, you might be tunneling, and valid packets can appear ''inside'' of DATA. Seriously, when was the last time TCP or even UDP delivered mutant packets to userland? At any rate, netstring is intended for error detection, not error recovery. If you want error recovery at the userland level, encode your netstring-formatted data with some ECC, and then hope you don't run into dropped/inserted bytes (or worse, bits). Lastly, the characters used in the netstring framing were chosen because they are compatible with 6-bit links. ~ requires a 7-bit link. But since 8-bit links are ubiquitous for everything except email (whaa?!), I doubt this matters much. ---- [[ [Category Interprocess Communication] | [Category Security] | [Category Data Structure] | [Category Binary Data] ]]