Version 24 of Tool Protocol Language

Updated 2014-04-09 18:18:52 by pooryorick

Tool Protocol Language, or TPL, is not a description of a new language, but an essay on thinking about Tcl in a different way.

CMcC 2008-10-28 17:11 AEST:

The world has many protocols for communicating between processes (IPC). Some are binary, some are ASCII. I will call ASCII protocols for IPC protocol languages.

Lars H: The distinction between "ASCII" and "binary" in this sense isn't necessarily what characters may occur, but rather whether the format requires encoding explicit lengths or positions. For example PDF documents consist mostly of ASCII text (unless you compress data), but it is a "binary" format because it is hopeless to edit as text; doing that would throw off the exact file positions listed in the cross-reference table.

Examples of protocol languages and binary protocols

  • XML is used in XML-RPC type IPC.
  • FCGI, SCGI, CGI have aspects of protocol languages, and some of binary protocols.
  • HTTP is used in REpresentational State Transfer, REST
  • JSON is used in AJAX
  • several data representation languages YAML, ...
  • XDR is a binary protocol used in SUN RPC
  • command-line protocol - used to invoke commands from a process, per the unix system() command, e.g.
  • Google's Protocol Buffers are a binary protocol.
  • The OpenMath specification defines both an XML encoding and a binary encoding.

Tcl (arguably) has many characteristics desirable in a protocol language:

The purpose of this page is to argue this assertion, to explore the truth of the foregoing assertion, to explore counterarguments, to deepen understanding of tcl as a protocol language, to explore prerequisites for its use in this role.

It seems to me (CMcC) that a cut-down Tcl syntax, without special meaning attributed to {*}, $ and [] would serve well in this role. Such a restricted syntax should be called TPL or perhaps TDL (Tcl Data Language)

PYK notes that when CMcC wrote this, TDL must not have existed.

Many applications and systems provide a plugin API. Such an API will necessarily have an expression of the form command+args->result, and will therefore be well suited to a TPL representation.

Examples of plugin APIs:

In many cases, wrappers for library APIs have a similar form, command+args->result, and in fact many critcl and other such wrappers translate C APIs into this form.

The advantage and virtue of TPL is that it formalises the useful Tcl syntax subset in a way which might be of interest to people outside the Tcl community.

A precedent for TPL may be found in JSON. All command+arg->result forms can be represented as valid Tcl lists, and these are clearly and obviously trivial to interpret in Tcl. Providing a C library which translates a meaningful set of C data types into and out of TPL would enable conversion of any plugin API into wire-ready TPL protocol language.

One thing which would be useful in this endeavour is production of a subset of the Tcl syntax *dekalog sufficient to completely define TPL as a subset, and perhaps the full Tcl syntax could be expressed in terms of TPL.

Dynamics of protocol interaction:

There are several styles of protocol interaction:

  • simple request/confirmation - RPC
  • pipelined request/confirmation - HTTP
    • in-order confirmation pipelining
    • out-of-order confirmation
  • multiplexed streams of the above - FCGI

Consideration must be given to each of these styles.


RS 2008-10-28: One Tcl-based "data language" is the one that tDom produces with the asList method:

% dom parse "<foo><bar><grill>42</grill><grill>345</grill><qux a='b' /></bar></foo>"
domDoc01301840
% [domDoc01301840 documentElement] asList
foo {} {{bar {} {{grill {} {{#text 42}}} {grill {} {{#text 345}}} {qux {a b} {}}}}}

It may not be the prettiest sight, but can be parsed with Tcl very easily, and allows to represent the same complexity that XML has, with nested structures and attributes.

  • Each element is a triplet of {name attributes children}, where
  • attributes is {key val key val ...} and
  • children is {element ...}

Lars H: Indeed a very useful format, that natively supports data is code, and for which a pure-Tcl translator from XML is available: A little XML parser (whose only failing is that it doesn't translate XML entities to ordinary characters in e.g. attribute values and #text data). I'm presently using it for a project, and have encountered no problems with it (even though it sometimes feels a bit prolix when one has to go [list [list foo {} {}]]] in order to make a list with the only child of a node). I suspect this is a slightly higher level format than the TPL suggested here, though; while pretty much anything can be encoded in XML, it might be that some applications of TPL would benefit from not adhering to the strict type–attributes–children structure of the XML-asList format. In XML-specification-speak, I think what I'm getting at is that XML-asList would be an important application of TPL, but not necessarily the only one.

In terms of Tcl syntax subsets, there are at least three useful levels:

  • Full Tcl syntax (*dekalog).
  • Tcl list syntax (as shortly explained on the lindex manpage): $, brackets, #, semicolon, and newline have no special powers, but backslash, braces, quote, and whitespace have their normal meaning.
  • Tcl list syntax, plus command separators and comments. I think this is roughly what CMcC is proposing for TPL, but no direct support exists for it (that wouldn't also support full Tcl syntax).

The extended list syntax is superior to list syntax in that it supports comments and is better at catching syntax errors — when forgetting an argument of some command it doesn't grab the name of the next command as that argument — while still keeping the nesting of braces at a tolerable level. A downside for internal processing is that it (presently) cannot preserve the internal representation of data, since everything shimmers to a string when you join separate commands into a script.

CMcC: I hadn't thought about command separators and comments. I think they might unnecessarily complicate the TPL usage of Tcl syntax, although the use of command separators to represent pipelining is an intriguing possibility (seems to me that comment is completely useless in the TPL context). I think TPL needs backslash, braces, quote and whitespace. So it looks like Tcl list syntax is equivalent to TPL. This is roughly analogous to JSON, which is a useful parallel to keep in mind, I think.

In general, APIs and RPC are functional applications, so are directly representable by Tcl's command syntax + lists (and since command syntax is list syntax, this reduces to being lists.) Clearly anything can be represented as a string (that's a Tcl mantra) and it's useful to interpret a subset of strings as lists, and those lists can be used to represent anything a protocol language can be expected to represent.

The virtues of formalising this approach, and giving the Tcl syntax subset a distinct name, are:

(a) marketing - we have a new way to think about Tcl, we have a new way to provide utility to the wider world, and to give people a reason to think about Tcl for applications or retrofits,

(b) support - we can produce a series of C/JS/etc language functions which will interconvert between the host language's data types and Tcl's. Given such a library for a given language, the process of interfacing applications and systems written in that language to Tcl is significantly simplified, and of course the processing of the API/RPC protocol language in Tcl is *vastly* simplified. This allows Tcl to better fulfill its function as a Tool Control Language, by supporting Tool Protocol Language as a protocol language.

Lars H: I got the impression that you wanted TPL to be a natural language for config files; in that setting command separators and comments are highly recommended, but I can imagine use-cases also for command substitution (e.g. binary decode of some blob might be more convenient than the \x counterpart) and variable substitution there, so it is probably better served by full Tcl syntax. For communication exclusively between two pieces of software, command separators and comments are pretty useless and a significant complication, so if that is the niche for TPL then the Tcl list syntax is indeed the natural fit.

Wire protocol

Lars H: TPL, like XML, is ultimately about how one encodes data as a string of Unicode characters, but one might also think about giving recommendations on how to transport a stream of TPL messages over a binary channel (octet-stream). The following is such a "wire protocol" I'm using in a project. It is based on the wire protocol of comm, but has some tweaks to improve robustness. Basic principles are:

  • Use utf-8 for character encoding — this is efficient with respect to ASCII characters (of which there are going to be a lot just for the TPL syntax) and it is capable of encoding everything.
  • Provide a message separator, so that the receiver can start processing a message as soon as it has been completely received.
  • Make it so that the complete message stream can afterwards be viewed as the list of all the messages, and viewed using standard tools (e.g. more) — this simplifies debugging.

Naturally, we also want it to be easy to implement in Tcl.

Transmitting is very simple. To encode a message msg, do

encoding convertto utf-8 [list $msg]

To prepare a channel for a stream of messages do

fconfigure $channel -translation binary
puts $channel \xC0\x8D

To write an encoded_msg to such a channel do

puts $channel "$encoded_msg\xC0\x8D"

To read the list of all messages on a channel (assuming they are well-formed) and loop over them, do

fconfigure $channel -translation lf -encoding utf-8
foreach msg [read $channel] { # Do stuff }

Reading messages as they come in (still assuming well-formedness):

fconfigure $channel -translation binary -blocking no
fileevent $channel readable [list Receive $channel]
proc Receive {channel} {
    variable buffer
    append buffer [read $channel]
    # For now, I also skip checking for EOF.
    while {[
        set pos [string first \xC0\x8D $buffer]
    ] >= 0} {
        set chunk [string range $buffer 0 [expr {$pos-1}]]
        set buffer [string range $buffer [expr {$pos+2}] end]
        set msg [lindex [encoding convertfrom utf-8 $chunk] 0]
        # Process $msg
    }
}

So how does it work?

  • The separator \xC0\x8D is malformed utf-8 for carriage return (like Tcl internally uses for NUL in stringReps), so it cannot occur in any message.
  • When Tcl's "encoding convertfrom utf-8" encounters \xC0\x8D, it still sees it as a carriage return, which is just ordinary whitespace as far as TPL syntax is concerned. If reading with -encoding utf-8 -translation auto, the whole \xC0\x8D\n sequence will be seen as just a crlf and thus translated to a single \n.
  • Taking (the string representations of) two lists and putting some whitespace between them produces a string representation of the concatenated list, so the whole thing will be a valid list.

How it looks:

  • From within Tcl (and assuming suitable channel configuration), just like a list of messages.
  • At a unix prompt, each message starts on a new line, begins with a left brace, and ends at the end of a line with a right brace followed by two bytes of binary gunk. This makes message boundaries easy to spot.

Things that could be better:

  • I'd like to specify that there should be a left brace before each message and a right brace after it, but as it turns out one cannot 100% rely on this if all valid string representations of lists are allowed. Consider:
% set a "b \"\{\""
b "{"
% llength $a
2
% lindex $a 0
b
% lindex $a 1
{
% list $a
b\ \"\{\"

This is a valid list of a (multi-element) list, but it isn't brace-delimited. Switching to the canonical list representation would however make it so:

% lrange $a 0 1
b \{
% list [lrange $a 0 1]
{b \{}

A conclusion of this could be that TPL should tighten the Tcl list syntax somewhat.

PYK 2014-04-09: Lars H, you've overthunk it! This "wire protocol" is exactly equivalent to transmitting a normal valid Tcl script encoded in UTF-8, with each message being the syntactic equivalent of a command in the script. The \xC0\x8D strategy is entirely unnecessary. The transmission procedure can simply be:

fconfigure $channel -encoding utf-8 
puts $channel $msg

and the receiving procedure can then use info complete to separate the items which, syntactically, are commands. -binary is also unnecessary as utf-8 handles tha naturally:

fconfigure $channel -encoding utf-8 -blocking no
fileevent $channel readable [list Receive $channel]
proc Receive channel {
    variable buffer
    if {[eof $channel]} {
        #wrap things up
        #If the buffer isn't empty, the message is incomplete
    }
    gets $channel line
    if {$line ne {}} {
        if {$buffer eq {}}  {
            append buffer $line
        } else {
            append buffer \n$line
        }
        if {[info complete $buffer\n]} {
            process $buffer[set buffer {}]
        }
    }
}

The moral of the story is that Tcl has another data format that doesn't get enough airtime: the script.