[Started Dec 9 2005 by lexfiend, who's quite astonished that anyone took any notice of this obscure encoding.]
AMG: s/obscure/awesome/
netstrings is a string encoding invented by D. J. Bernstein (formal definition here: [L1 ]). In a nutshell:
<length_of_data_in_bytes>:<data>,
It's 8-bit clean, easy to implement, interoperable and "safe" (at least when used correctly). Netstrings can also be trivially nested, so the list
{This is a test}
can be represented as
23:4:This,2:is,1:a,4:test,,
More usefully, netstrings can be used to safely send/receive chunked ASCII/binary data between any two arbitrary platforms (implemented in any language) with trivial framing and length checks, and none of that nonsense about having to escape special characters in the datastream.
The Quick Mail Queuing Protocol (QMQP) [L2 ] that qmail [L3 ] implements is an example of real-life netstrings use.
# readNetString <chan> # Purpose: Reads and returns a netstring from <chan> proc readNetString {chan} { set nslen 0 set nstr "" while {![eof $chan]} { set nschar [read $chan 1] switch -glob $nschar { [0-9] { # Next char in length - add it in set nslen [expr {$nslen * 10 + $nschar}] } : { # End of length - read in $nslen bytes set nstr [read $chan $nslen] # Check to see if we've got there set readlen [string length $nstr] if {$readlen == $nslen} { # Now we must have a trailing comma if {[read $chan 1] == ","} { # We're done return $nstr } else { error "Missing delimiter at end of netstring" } } else { error "Netstring length ($nslen) does not match read length ($readlen)" } } default { error "Illegal character '$nschar' in netstring length" } } } } # writeNetString <chan> ?<str> ...? # Purpose: Writes all <str>s to <chan> as netstrings proc writeNetString {chan args} { foreach str $args { set writelen [string length $str] puts -nonewline $chan "${writelen}:${str}," } flush $chan }
AMG, 9 Dec 2005: I changed the "Illegal character" switch arm pattern specifier from "-" to "default".
lexfiend, 10 Dec 2005: Oof! Talk about a PEBKAC error! Thanks for catching that -- it was very late when I wrote that code.
AMG: Note that the above code differs from djb's netstring specification which forbids superfluous leading 0's in length prefixes, such that there is exactly one legal encoding for any string (e.g. empty string is "0:,").
lexfiend: Good point. That's in fact a not-so-well-known aspect of netstrings, and I notice even djb's sample code doesn't handle this case correctly. Will fix it some other time, when it's not so late...
lexfiend's code doesn't help much with picking apart nested netstrings because it only accepts characters from a [read]able channel. Here's some code that works with Tcl strings rather than channels.
proc ns_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"ns_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*):} $data header length]} { error "missing or corrupt netstring header" } incr pos [string length $header] if {![regexp -start $pos "\\A(.{$length})," $data body payload]} { error "missing, corrupt, or truncated netstring body" } incr pos [string length $body] return $payload } proc ns_encode {data} { return [string length $data]:$data, }
My code isn't good for data arriving from a channel since the channel reader doesn't know in advance how many bytes to read, but it works for data that's already read, as in the case of nested netstrings. It would be pretty nice (and semantically WEIRD!) if we could unify these two cases, that is, somehow have Tcl objects backed by channels rather than simple RAM, and be able to [regexp] over them and such.
lexfiend: Thanks for writing the equivalent string-based implementation. The use-case that I envisioned for my code simply to read/write non-nested netstrings as part of a simple application protocol like PIRP [L4 ], but your code would certainly come in handy for, say, implementing a QMQP client or server.
AMG: The following works with lists in the format given by lexfiend, that is, netstring wrapping a simple concatenation of netstrings:
proc list_ns_decode {data} { set pos 0 set result [list] while {$pos < [string length $data]} { lappend result [ns_decode $data pos] } return $result } proc list_ns_encode {args} { set result "" foreach arg $args { append result [ns_encode $arg] } return $result } % ns_encode [list_ns_encode This is a test] 23:4:This,2:is,1:a,4:test,, % list_ns_decode [ns_decode 23:4:This,2:is,1:a,4:test,,] This is a test
In practice I wouldn't use this "tree" (nested list) format since it doesn't clearly express the difference between interior and leaf nodes; given arbitrary data, you don't know how many levels deep you need to decode. Perhaps I should illustrate:
% interp alias {} unknown "" list_ns_encode % ns_encode [a [b [c d]] e f [g] [] [h []]] 51:1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,,
(That's possibly the strangest use of [unknown] I've ever seen.) Anyway, this string of bits could legitimately be an encoding of any of the following lists:
{1:a,15:1:b,8:1:c,1:d,,,1:e,1:f,4:1:g,,0:,7:1:h,0:,,} { a 1:b,8:1:c,1:d,, e f 1:g, { } 1:h,0:, } { a { b 1:c,1:d,} e f {g} { } { h { }}} { a { b { c d}} e f {g} { } { h { }}}
Plus other variations. So the recipient of encoded data must know in advance how it is supposed to be structured; this encoding is not (reliably, unambiguously) self-describing.
lexfiend: Personally, I thought an underlying assumption behind netstrings was that both sides already knew what to expect, so a self-describing format was neither needed nor desired (along that path lies XML), though it's certainly an interesting mental exercise. 8-)
Perhaps use type characters, for instance d for leaf nodes and l for interior nodes. Also, in keeping with the netstring spirit, for interior nodes give the number of child nodes (i.e. the [llength]) in the encoding. I'll call this nl for netlist.
proc nl_data_encode {data} { return [string length $data]d:$data, } proc nl_list_encode {args} { return [string length [join $args ""]]l[llength $args]:[join $args ""], } proc nl_decode {data args} { switch -- [llength $args] { 0 {set pos 0} 1 {upvar 1 [lindex $args 0] pos} default {error "wrong # args: should be \"nl_decode data ?pos_var?\""} } if {![regexp -start $pos {\A(0|[1-9]\d*)(?:(d)|(l)(0|[1-9]\d*)):} $data\ header length d_type l_type l_length]} { error "missing or corrupt netlist header" } incr pos [string length $header] set end [expr {$pos + $length - 1}] if {[string index $data [expr {$end + 1}]] ne ","} { error "missing or corrupt netlist trailer" } if {$d_type eq "d"} { set payload [string range $data $pos $end] } elseif {$l_type eq "l"} { set payload [list] while {[llength $payload] < $l_length} { lappend payload [nl_decode [string range $data 0 $end] pos] } if {$pos != $end + 1} { error "extra characters after final netlist element" } } set pos [expr {$end + 2}] return $payload }
Considerably more complex... Differentiating between "data" elements and "list" elements is neither easy nor comfortable. Oh well, let's test.
% interp alias {} d {} nl_data_encode % interp alias {} l {} nl_list_encode % d foobar 6d:foobar, % l [d foobar] [d quux] 18l2:6d:foobar,4d:quux,, % l [d foobar] [l [d basby] [l]] [d quux] 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, % nl_decode 38l3:6d:foobar,14l2:5d:basby,0l0:,,4d:quux,, foobar {basby {}} quux
Seems to work.
If you haven't noticed, there's an impedance mismatch here between Tcl and, well, the rest of the world. Tcl has indistinguishable encodings for any data object and a single element list containing that same data object. (See K for K*.)
% K* a % ns_encode a a 1:a, % list a % ns_encode [list_ns_encode a] a 4:1:a,, % list [list a] % ns_encode [list_ns_encode [list_ns_encode a]] a 7:4:1:a,,,
As a consequence, for all the complex unambiguous list representation infrastructure I put into netlist, nl_decode can't always communicate the list structure to its caller. Watch:
% d frobozz 7d:frobozz, % nl_decode 7d:frobozz, frobozz % l [l [l [l [l [d frobozz]]]]] 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, % nl_decode 35l1:29l1:23l1:17l1:11l1:7d:frobozz,,,,,, frobozz
Aw, that's a shame.
jmn 2005-12-11. Regarding this 'impedance mismatch'. One way to deal with this is with 'tagged' lists in Tcl. Nested empty lists/tuples, or nested lists/tuples where the inner list/tuple has one element are reasonably common in Erlang and manipulating such lists as ordinary Tcl lists can be tricky to say the least. I use a tagged list mechanism when transferring data back & forth between Tcl & Erlang. e.g.
%package require erlish %erlish::term tcl {"somestring",someatom,6,[],[[]],[x],[[x]],{{{}}}} TERMLIST {STRING somestring} {ATOM someatom} {INT 6} LIST {LIST LIST} {LIST {ATOM x}} {LIST {LIST {ATOM x}}} {TUPLE {TUPLE TUPLE}}
With this package you can even preserve whitespace by using 'compound' tags. 'lindex 0 0' will always give you the 'tag' i.e the type of list you're dealing with, and the rest of the tag contains a whitespace encoding pointing to the positions in the data where it occurs. This is similar in concept to transmitting a list's 'nesting traversal' talked about below... but of course both these mechanisms require you have all the data prior to transmitting.. so it can make streaming a little difficult. Anyway.. sorry for the sidetrack - I'll make a page for the erlish package some time.
lexfiend: Actually, I'd embed the type signature as a netstring rather than shoehorning it into the length field, like this:
<overall_len>:<typesig_len>:<typesig>,<chunk1_len>:<chunk1>,...<chunkN_len>:<chunkN>,,
For example, if '{n}' is the type signature for a list of length <n>:
26:2:{1},15:2:{1},7:frobozz,,,
would represent a list of a list of 1 element (frobozz). In fact, with type signature compression, your last example:
{{{{{frobozz}}}}}
could be encoded thusly:
26:11:{{{{{1}}}}},7:frobozz,,
(AMG interrupts: how does one programmatically construct a list such as {{{{{frobozz}}}}}? - RS: See below)
and:
{{foo {bar {baz 321} this {is {a}}}} test}
would be:
70:19:{{1{1{2}1{1{1}}}}1},3:foo,3:bar,3:baz,3:321,4:this,2:is,1:a,4:test,,
This method has the advantage of interoperating with any standard netstring implementation, yet any program that can't understand or wasn't expecting the <typesig> (because, for instance, it was a misrouted message) can safely junk the entire data structure without losing sync with the other side.
AMG: That's very clever. I like! Eschewing nesting/recursion in favor of transmitting the infix traversal... radical. Maybe I'll write an implementation for it here, if someone doesn't beat me to it.
I called my little protocol netlist instead of netstring to emphasize the incompatibility, but about said incompatibility I wasn't happy at the time and I'm not happy now because I'd like to be able to use good, existing netstring-aware code, for instance execline [L5 ] (see [L6 ] for the specific usage). Hmm, execline is interesting stuff; we should have a page on it!
From a Tcl perspective I say that for self-describing data, list structure is the only type information I think is worth advertising: everything else is ordinary strings (i.e. isn't further encoded*). If this was wiki.python.tk, we probably wouldn't even be talking about netstrings since Pickle does everything we want anyway, and we wouldn't be concerned about compatibility with inferior languages. :^) Okay, actually I don't know what Pythoneers think but isn't this what Pickle is for?
* Of course you can argue that in netstrings every data type is encoded as strings, for instance the string "50" is an encoding of the integer that is equal to the number of stars on the circa-2005 US flag. But Tcl'ers don't care since Tcl performs the same encoding but does so without bothering the programmer about it.
Lars H: Is that "Pickle" the same as the Bencoding used by BitTorrent [L7 ]? Bencoding is pretty close to the "netstring with type characters" described above.
AMG: I refer to [L8 ]. The BitTorrent document's choice of metasyntactic variables (spam and eggs) strongly suggests that it's Python-based, so yes it's quite likely that the two encodings are the same. I'm having trouble finding documentation for the official Python pickling protocols (there are several numbered versions plus binary and textual variants), but I remember reading about it in a book I loaned away once upon a time. It was a good book: the Visual Quick Start guide to Python. Check it out if you want to learn Python.
Lars H: Well, browsing a bit of documentation about Pickle quickly killed off any inclinations I might have had towards learning Python. (Is this what these people have to go through to serialize data?? The horror, the horror!!!) This only strengthens my conviction that Tcl has gotten it right -- everything is a string.
Why did I (AMG) put this in Security? Even though Tcl is thought not to have buffer overflows and other such problems, many other programs do, especially those written for ad-hoc network protocols. Netstrings were developed to make it easy to design protocols for which it's difficult to write buffer overflow-ridden software. Apparently protocols need to be hardened against programmer stupidity. :^)
lexfiend: Netstrings are also designed to make it easy to design robust protocols, in that the simple combination of explicit length and framing characters help ensure that any traffic that obeys netstring conventions can be read and dealt with reliably and predictably, unlike (say) line-oriented protocols whose implementors/users occasionally forget to escape line-breaks in the actual data...
AMG: djb says that there are two kinds of interfaces: user interfaces and good interfaces. Line-oriented network protocols (file formats, too!) fall in the first category, and they're plagued with all sorts of funny problems. But they're easy for puny hu-mans to read and write directly. Netstring falls in the second category; it's great for computers to talk to each other, but users need an intermediary tool in order to correctly read and write. Maybe that's not so bad, as long as the encoding protocol is simple enough that the intermediary tool itself is unlikely to be buggy. In other words, direct compatibility with [telnet] and soft, fleshy brains isn't a worthwhile feature.
I'm spoiled on Tcl's nice buffering, so I usually wind up making my protocols line-oriented. Maybe I should break this habit and stop trying to handle packet arrival events directly in my [fileevent readable] callback. Instead I disable buffering and handle characters as they come; the readable callback slices things into packets and calls the appropriate registered high-level handler. (By the way, is it possible to [event generate] without a widget? lexfiend: Do you mean Custom event notification?)
Okay, that gives me an idea for how to unify code that reads netstrings from the wire and from inside of already-read netstrings. Actually I've done this before, with a few minor variations, and it works but looks complex. Yet I also find it to be inevitable: does anyone have a simpler idea that'll still yield a correct, full-fledged I/O handler?
The readable callback [read]s all available data, then passes it to a netstring decoder function capable of returning all the full packets it sees in the data, telling the caller how many input characters it consumed and will never need again, reporting encoding errors, and attempting to resynchronize after error. The readable callback stuffs read data into one end of a ring buffer and the netstring decoder pulls it out at the other end, leaving behind bits that comprise partially received packets. This netstring decoder function doesn't know or care where its data comes from, so it can be used to get at nested netstrings.
RS How to make deep nestings programmatically? By constructing a suitable string :)
proc deeplist {n content} { return [string repeat \{ $n]$content[string repeat \} $n] } % deeplist 5 frobozz {{{{{frobozz}}}}}
Lars H thinks the problem was rather about how to make list generate these deep nestings, like so:
proc repeatlist {n content} { for {set k 1} {$k<=$n} {incr k} { set content [list $content] } return $content }
The trick there is merely to pick a nontrivial content:
% repeatlist 5 "fro bozz" {{{{{fro bozz}}}}} % repeatlist 5 "{frobozz}" {{{{{{frobozz}}}}}}
AMG: Is there an easy, reliable way to transform "trivial" content into a "nontrivial" form without changing its meaning?
Lars H: Not in general, because if that content is a string then encoding and meaning is the same. However if the content is itself encoded then there are often loopholes that can be exploited. E.g. if the content is the list "frobozz", then it can be encoded also as the string "\146robozz", which due to the backslash will be braced by list.
escargo - This netstrings seems like a way of representing values without much notion of structure. Some of the encodings used above try to encode structure as well as value. The notion of netstrings seems to be almost the opposite of XML. If one does not want the overhead of XML, than imposing structure could be done with JSON or YAML.
AMG: Netstring doesn't know anything about structure, but that doesn't mean its payload can't be self-describing. lexfiend and I have been bandying about with methods of encoding lists in netstring style. Sure, we could just put the Tcl string representation in a netstring, but generating and dissecting Tcl-style lists is comparatively hard to do correctly and securely. These list encodings, particularly the two lexfiend came up with, are much better I think. (lexfiend's first is for when both sides know the structure, and his second is self-describing for sending arbitrary lists and trees. My proposal probably shouldn't be used because it's incompatible with existing code, fairly complicated, and doesn't offer any benefit over the alternatives.)
About the netstring spec...
I'm probably missing the point (in a big way), but I don't see how Bernstein's netstring protocol is any better than (something a little easier to process, such as) this:
<NNNNNNNNN> <DATA> <,>
where NNNNNNNN is always 9 bytes (ASCII encoded length of DATA), immediately followed by DATA and ending with a sanity byte ",". His protocol, due to the variable length of N, trades CPU for bandwidth since you have to make up to 9 reads (or iterative checks if system I/O reads are buffered) just to get the length. By requiring all nine bytes, you can safely do a single read. Nine bytes is not that a lot out of today's bandwidth. And, the ":" seems to be there in case we get an incorrect protocol stream, but there is NO defined recover strategy. The same can be said for the trailing ",".
If you really wanted simple framing, what about something like <~> <NNNNNNNNN> <DATA> <~> ?
On a slightly related note, has anyone looked at this http://www.sics.se/~joe/ubf/site/home.html ? (Erlang)
-- Todd Coram feeling slow this afternoon after a large lunch
AMG: djb doesn't like setting arbitrary limits like this (999,999,999). After all, he's the guy who's responsible for RFC 2822 supporting five-digit years! Joking aside, it's not unreasonable to transfer data trillions of bytes at a time; after all, this is a high-level protocol intended for use over reliable stream-oriented transports. And the CPU time involved in reading nine bytes isn't really worth considering, especially since relatively few netstring headers will exceed four characters in length, and those that do arrive infrequently because they are (of course) separated by more data bytes, with the ratio of data bytes to header bytes growing exponentially. (Todd Coram -- A naive implementation of a reader for this protocol may make extra read() calls, which can be quite expensive. I deal with soft-realtime software -- with lots of short messages, where too many kernel calls to read() data add up to poorer performance :-( ) AMG: short messages don't require many reads since they are short! :^) Also, a correct netstring scanner will forbid packets with leading zeros, thus enforcing minimal header length. Otherwise you could tie up a bad scanner indefinitely by sending it a constant string of ASCII 0 bytes.
With a variable-length length prefix, the colon is necessary as a field separator, since the first character of DATA may be a digit.
The comma serves as a check on the string length. If the comma's not there, the packet was generated incorrectly, or the underlying transport has failed (which is extremely rare these days?). In the first case, aborting is the only sane "recovery" method. In the second case, it may be possible to scan ahead for the next occurrence of (0:|[1-9]\d*:) and provisionally try resynchronizing, again dropping the packet if the terminating comma isn't there. But this is a lot of work and is unlikely to give good results, so aborting again seems reasonable. Your schemes, while workable, suffer from the same problem: unless characters in DATA are "escaped" somehow (something netstring doesn't do), it's always possible to resync at the wrong place. For instance, you might be tunneling, and valid packets can appear inside of DATA. (Todd Coram -- Possible, but still recoverable -- on resyncing at the wrong place, you look for nine more 0-9 ASCII digits and if they don't appear or add up incorrectly, continue consuming bytes until you reach the next "~" and start over.) AMG: For tunneled data even this won't work: let's say the uncorrupted message is ~000000016~000000005abcde~~, but the 6 mutates into a ß. It would be incorrect to resync on ~000000005 because that's the payload and not the beginning of the next packet.
Seriously, when was the last time TCP or even UDP delivered mutant packets to userland? (Todd Coram -- Right. Which is why I question the usefulness of the check byte in the protocol. TCP has gone through all of that trouble of making sure that you get the data intact...) AMG: I suppose nowadays the comma is more useful to check the implementation and to give human readers a hint where the packet ends. For example, if I manually craft a netstring packet but screw up my character count, most netstring scanners will be able to spot and report my error rather than blindly truncating my data.
At any rate, netstring is intended for error detection, not error recovery. If you want error recovery at the userland level, encode your netstring-formatted data with some ECC, and then hope you don't run into dropped/inserted bytes (or worse, bits).
Lastly, the characters used in the netstring framing were chosen because they are compatible with 6-bit links. ~ requires a 7-bit link. But since 8-bit links are ubiquitous for everything except email (whaa?!), I doubt this matters much. (Todd Coram -- djb is a smart dude. But, sometimes I wonder about the motivation he has behind solving problems I thought were already solved... Thanks for the explanation! :-) ) AMG: netstring is the first named encoding I've encountered that's designed to be robust against bad programmers (i.e. has a length prefix to help avoid buffer overflows, doesn't muck about with escaping characters, doesn't have multibyte sequences like CRLF as packet separators, etc.), and that's worth something.
DKF: If you're under malicious attack, malformed packets are very likely and the defences of TCP/UDP checksums aren't that great for anything other than preventing accidental problems (and even then they're not great if you're shipping very large amounts of data; a 32-bit CRC isn't a wonderful safeguard when the amount of data gets over a few gigabytes). Typically, you'd really want to use a stronger checksum at the application level (even md5 is much better that straight crc, and for serious use you'd be up at sha256 or something like that) or even a full encrypted channel.
Of course, I'd go for a binary version of the protocol with four bytes (network endian 32-bit int of course) giving a 32-bit length of data to follow. The big advantage is that it is easier to code in Tcl (fixed-size initial read) though some attention needs to be paid to encodings to make it all hang together correctly. The only problem is that it doesn't let you handle more than 4GB, but that's not a problem if either end is a 32-bit machine; you just can't have Tcl strings long enough to be a significant problem due to shortage of addressable memory. ;^)
AMG: djb's approach (ASCII digits) supports communication channels with as few as six bits per character. If you can guarantee that you have an eight-bit link, and you know that you'll never send more than four gigabytes per message, go ahead and use a binary protocol.
jmn 2008-03 There is netstring package for Tcl here: http://vectorstream.com/tcl/packages/tarpacks/netstring-0.1.tm It includes a simple echo server which demonstrates reading netstrings from a channel using 2 different idioms,
There are also encode & decode functions to operate on entire netstrings or concatenated sequences thereof.
Documentation consists only of comments in the code for now.
rmadilo 2010-07 If anyone is interested I long ago extended the concept of netstrings to structured data, and slightly simplified the syntax.
There are two formats: one called rhf or rmadilo hand format (hand written) and the binary/compiled version called rdl or rmadilo data language.
See http://rmadilo.com/files/rmadilo/ for more information. (Link broken 2013-04-02)
I have notes on extending this simple format to support xml-like element attributes.
The package contains simple script pipelines which convert from rhf to rdl.
And I have a number of applications which use the format.
AMG: See also http://en.wikipedia.org/wiki/Netstring . This Tcl Wiki page on netstrings came into being several years before the Wikipedia article. :^)