Interpreting TOOT

NEM 21June2004: Some thoughts on TOOT, values and types in general. In particular, how TOOT allows interpretations to be associated with representations (values) in a general, but flexible manner.

Tcl is "untyped" or mono-typed - there is only the string. The "type" of a value is determined by its usage; if it is used as an integer (successfully) then it is one, but it may also be something else (e.g., a string, variable name etc).

Other languages are usually typed; values have some type that determines the operations that can be performed on them and helps disambiguate some syntactic forms and allows for polymorhic command/operator overloading.

In most languages, "type" is assumed to be an intrinsic property of a value -- e.g., "2" is an integer.

Tcl takes the view that "type" is an extrinsic property -- "2" is a string, which may be used as an integer. TOOT expands on this principle.

TOOT's view is that the only thing you ever store in a computer is a "representation". For instance "2" is a string representation of a single character. Note that you never actually store the number 2 in the computer -- that's physically impossible, as numbers are abstract concepts and have no physical presence that could be stored. Even at the lowest machine level, integers are stored as a series of bytes (which in turn are a series of bits, and they themselves are manifest as streams of electrons in the underlying circuitry) that represent a given number.

When you give a representation a type, you are applying an "interpretation" to that representation. In some languages, you are only allowed to manipulate values that have some interpretation attached. Some languages even require you to declare this interpretation in advance:

int a = 1;
String foo = new String("Hello, World!");
Object bar = foo;
String bars = (String)bar;

Note that you can down-cast the value to a weaker interpretation, and later upcast (with the possibility of error) to a stronger interpretation, but only if the new interpretation is compatible with the original interpretation that the value was given. So, the value always has a "type" (an interpretation) associated with it, and you are forced to comply with that interpretation, even if you later change your mind, or disagree with whoever wrote the code that created the value in the first place.

In Tcl, values have no intrinsic interpretation. They just are values. Actually, they are strings, but as strings are just sequences of bytes, this representation covers all possible data that a computer can actually represent, and it's useful to have *some* base representation to build on. Now, Tcl commands can impose whatever interpretation they like to this representation, independent of any other commands that operate on the same value. So, for instance, you can do:

set a "24"
string length $a
llength $a
expr $a + 2

Everything there works. Values are interpreted in different ways by different commands, without affecting the behaviour of other commands operating on the same value -- "type" is an extrinsic property, and this notion is used to good effect in plenty of Tcl code.

The problem with Tcl's approach is that it requires each command to know what interpretation it wants to give a value. This may seem obvious, but there are occasions where you want to perform an operation on a value where the behaviour of that operation (and the interpretation of the value) is defined elsewhere. For instance, suppose I have a value that represents a resource on the internet, which I want to fetch. Some example code might be:

set url1 "http://www.foo.com/index.html"
set url2 "ftp://ftp.tcl.tk/blah/foo.tar.gz"
# Get both URLs
http::get $url1
ftp::get $url2

(Assuming implementation of http::get and ftp::get). Now, if we want to wrap this into a general proc that can handle any URL then we might do something like:

proc get {url} {
    regexp {([^:]+):(.*)$} $url -> proto rest
    switch $proto {
        http    { # Fetch via http }
        ftp     { # Fetch via ftp }
    }
}

This shows the problem - we need special purpose code that dissects the value and determines which protocol it refers to -- the "type" of URL it is, and based on this interpretation calls the correct bit of code. In a typed language that supports runtime polymorphic dispatch, such as many OO languages, you might code this, instead as:

URL url1 = URL.createURL("http://www.foo.com/index.html");
URL url2 = URL.createURL("ftp://ftp.tcl.tk/blah/foo.tar.gz");
url1.get();
url2.get();

In this example, different "types" of URL subclass would be returned by the factory method in each case (e.g., HttpURL and FtpURL classes), which provide the appropriate get() method. So, the interpretation of the URL is done once at creation time and then subsequent operations do not need to determine this for themselves.

In TOOT, you can have the best of both worlds. Values (representations) are not typed by default. However, you can associate a type with a value, to create a representation of an interpretation (in other words, a new value that associates a type with some other value), which can then be passed around. For instance:

set url "http://www.foo.com/index.html" ;# Un-typed representation
http::geturl $url ;# Command interprets $url as an HTTP URL
set http_url [Url create $url] ;# Returns {HttpUrl: http://www.foo.com/index.html}
$http_url get ;# Uses the interpretation

So TOOT allows you to take arbitrary representations, and package them up with an indication of how the value should be interpreted. But, crucially, this package becomes a new value with the type explicitly becoming part of the representation, rather than an intrinsic, behind-the-scenes property. This allows you to do clever things, like totally ignore the interpretation given. In TOOT, the interpretation (type) is a prefix that is a command name unique for each type you create.

In this scheme, the Tcl interpreter really is just that - it executes commands that apply operations to a value under some interpretation. For instance:

$http_url get
# Becomes
HttpUrl: http://www.foo.com/index.html get

Which applies the operation "get" to the value "http://www.foo.com/index.html " under the interpretation that the value is a URL using the HTTP protocol.

Comments, criticism, etc welcome.

[string repeat nod 1000] ! -jcw

One more comment from me: REBOL seems to have gone in this direction. It has Tcl's minimalism in syntax (well... almost), but it does associate types with strings. In think it's a bit like Tcl without its dual-rep & shimmering - or to put it in context: REBOL appears to do what TOOT does behind the scenes. The interesting aspect in REBOL is that type also seems to drive the parsing and precedence of parsing, somehow. REBOL's types are in C and not as explicit as TOOT, so perhaps harder to play tricks with it. One thing that does seem to come out of this approach, is conciseness. You get the ability to say "$foo length", without having to specify $foo's type at every turn.

So the lesson so far seems to be: we need to base everything around <type,rep> tuples, right? Funny how - coming from a different way - this is OOP again!

NEM Interesting. I've never looked at REBOL (honest!), but it sounds similar. I'll have a look at what they do there. Regarding <type,rep> tuples: yup, that is the key of TOOT - packaging up a type with a value representation to create an interpretation. Yes, it is OOP in a way (hence the "OO" in TOOT). I must confess to having a soft-spot for OO as an idea. The problem with discussing OO though tends to be the mess of different (and often orthogonal) concepts that are associated with that term. This page, for instance, makes no reference to inheritance (or delegation), mutability of state, or a host of other concepts. Instead, I'm concentrating on how operations on values are interpreted in a given context. Of course, I have ideas for most of the others too, but they'll have to wait for other days!

Lars H: I like the discussion of intrinsic versus extrinsic interpretations. Sounds like something one should keep in mind when explaining Tcl to people coming from other languages. (But then I'm mostly in favour of the extrinsic approach, so I suppose I would place intrinsic interpretations under "prejudices you should let go of when you program in Tcl".) One place where Tcl does rely on intrinsic interpretations is in expr, for the distinction between "integer /" and "float /". I find that mostly a bad thing, since it means forgetting a .0 in some other part of the program can have rather unexpected consequences.

The URL example got me thinking, though: there seems to be more than one kind of intrinsicality. The scheme part (http, ftp, etc.) of an URL provides information about how the rest of the URL should be interpreted, so it serves as a "type". This type does come with the value, and is thus intrinsic, but it is also rather different from the TOOT interpretations and datatypes in other languages. Whereas URL schemes and the integer/float status of numbers are explicit parts of such values, classical data types rather tend to be (more or less) hidden. (Perhaps easily accessible to the typing system, but not part of its public information.) Thus one could say that there are three possibilities: extrinsic, explicit intrinsic, and implicit intrinsic.

jcw - Tcl == "everything is a string". TOOT == "everything is an interpretation". TOOT seems to come down to the recipe "look at the first (or second) list item, use it as type to define what to do with the rest. The "HttpUrl: http://www.foo.com/index.html get" example illustrates that there is room for ambiguity and redundancy still. It might be better to use "Url: http://www.foo.com/index.html get". Types can be complex beasts (nested, recursive even), but as far as TOOT is concerned, all that matters is a standard way of extracting a type from a complete value. That type determines what calls/methods there are. In the case of an URL, these may well decide to take the http: etc prefix first, and re-apply the method to the resulting subtype.

NEM The URL example was probably a bad choice, given that there is a "type" associated with a URL to begin with. Actually, TOOT is beginning to look a bit like URLs, with the new {type: data} syntax I've been leaning towards, so the URL could be represented as {http: //foo.com/blah.html}. In other words, simply introducing a space between the protocol and the path. Of course, you could break it down further, as jcw suggests with {url: {http: $path}} and probably further still (hostname, port, path, query etc). I'd disagree that TOOT == "everything is an interpretation". TOOT allows you to create interpretations, but doesn't force it; you can still pass around untyped values (strings), but you can also add arbitrary nestings of type identifiers to encapsulate an interpretation. You can also dynamically rearrange, add to, or ignore completely the type anotations attached to a value, as they are just normal values themselves (which happen to be command names). So, I guess, to use Lars's terminology, TOOT provides a method to convert extrinsicly typed values to explicitly intrinsicly typed values (what a mouthful!). But you can still discard the type information (because it is explicit) and treat the underlying value as extrinsically typed.

Lars H: Another aspect I would like to point out is the problem with binary operations. With extrinsic interpretations, this is no harder than unary operations, because it is the operation which decides how the operands should be interpreted. With intrinsic interpretations, any binary operation will have to deal with two interpretations, which may in principle be distinct. When the number of possible interpretations is small and fixed (e.g. number types in Tcl) it is usually possible to tabulate each pair of interpretations that may occur, but when that is not the case then things get messy.

NEM: This is interesting and brings up another point that I have been thinking about. When you have binary (or n-ary) operations, the key problem is usually finding a common interpretation for each argument. In the case of arithmetic operations, where one argument is a real number (float, double) and the other is an integer, you usually use a real number representation, and produce a real number result. This should be handled by an appropriate polymorphic type hierachy, as integers strictly are a subset of real numbers, and so should be a sub-type (any integer IS-A real number, but the reverse is not true). Thus you could define a division operation as:

(Real a) / (Real b) -> Real

(to use a made-up type notation) and there should be no need to tabulate the different situations. It seems most languages' type systems are designed around implementation details though (so, there is an abstract Number type with distinct Integer and Real sub-types, or similar). However, things are rarely this simple, and even in this example, someone might point out that real numbers are themselves a strict sub-set of complex numbers, and so my operators would need to be redefined again. I think this illustrates the general principle that designing type hierachies is hard, whereas defining operations (functions) is relatively simple. So, being able to delay classification (interpretation) of a value is useful, as is being able to reinterpret a value, and indeed to reinterpret types themselves (i.e. change the type hierachy), as these are things unlikely to be got right first time. Statically typed languages tend to assume that the programmer always gets the type hierachy correct from the start, which is rarely true, and may not even be possible.

PWQ 18 Oct 05, going back to the original example, even though NEM acknowledges it was not the best example it does show the fallacy in his thinking.

The claim is that the following:

URL url1 = URL.createURL("http://www.foo.com/index.html");
URL url2 = URL.createURL("ftp://ftp.tcl.tk/blah/foo.tar.gz");
url1.get();
url2.get();

Avoids the inevitable:

proc get {url} {
    regexp {([^:]+):(.*)$} $url -> proto rest
    switch $proto {
        http    { # Fetch via http }
        ftp     { # Fetch via ftp }
    }
}

Which is how we dumb procedual programmers would have to code the example.

Firstly, inside the URL class there must be the exact same code as shown above so that it can choose to return either a HTTP type url or a FTP type url.

Secondly, if I want to add a https: or news: protocol I know where it has to be placed. In the get proc. With 'URL it could the in there, or maybe it is inherited from some other class. Maybe ftp has overriden the base class of http.

Thirdly, and most importantly, the above example is not how the example should have been coded in this context. We have numerous options, some of them are:

regexp {([^:]+):(.*)$} $url -> proto rest
set data [$proto $rest]

The advantage of the above dispatching is that to make another protocol available you just create a proc:

proc https {url} { ....}
proc news {url} {.....}

Take the code from Let unknown know,

know {urlregexp} {create url dispatch proc ....}

set data [http://no.need.to/decode.url]

There may be a justification for the expense of TOOT, but the examples to date have not shown it.

NEM - 18 Oct 2005 - Yes, you could do things this way. There are essentially two ways of achieving data-directed programming (or polymorphism): one is to do pattern matching in individual operations; and the other is to do dispatch on the type. These correspond to the two alternatives:

 # Pattern matching:
 proc get {url} {
    switch -regexp $url {
        http://(.*)   { ...HTTP handler... }
        ftp://(.*)    { ...FTP handler...  }
        ...
    }
 }
 proc put {url data} {
    switch -regexp $url {
        http:// ... etc
    }
 }

You can match against strings, or you can do algebraic pattern matching or any other pattern matching you want. The advantage of this is that it is easy to add new operations. The other alternative is the type-oriented version:

 type define http {
    method get {url} { ... }
    method put {url data} { ... }
 }
 type define ftp {
    method get ... etc
 }
 proc display {url} {
    set html [$url get]
    .html parse $html
 }

The advantage of this is that it is easy to add new types (protocols in this case). All TOOT does is provide a simple uniform way of doing the latter type of dispatch with resorting to opaque handles (as in most OO systems). It does this by tagging data with a "type" which happens to be an ensemble that implements some interface of operations for that data. That's all. Now, the URL example was poorly chosen because URLs are already tagged with the protocol, and they have a standard format that describes how this protocol name can be extracted from the data. What TOOT does is generalise this so that the standard format is just Tcl's usual command format (i.e. [type data args..]) and so dispatch becomes easier and more robust. A more useful example might be to consider two different ways of implementing an abstract data type, for example an environment mapping names to values (as might be used in an interpreter implementation):

 # Stores mapping as a dictionary
 type define DictEnv data {
     proc create {} { return [dict create] }
     method set {name value} {
         dict set data $name $value
     }
     method get {name} {
         dict get $data $name
     }
 }
 # Stores mapping as two lists {names values} (i.e. column-oriented)
 type define ListEnv names values {
     proc create {} { return [list {} {}] }
     proc set {name value} {
         lappend names $name
         lappend values $value
         return [list $names $values]
     }
     proc get {name} {
         set idx [lsearch -exact $names $name]
         return [lindex $values $idx]
     }
 }
 set denv [DictEnv: {name "Neil" age 24}]
 set lenv [ListEnv: {name age} {"Neil" 24}]
 proc Eval {term env} {
    type switch $term {
       match Var: name  { return [$env get $name] }
       ...
    }
 }

Here "Eval" can be defined only in terms of the interface that it requires of the environment instead of the actual representation, leading to greater flexibility. Sure, there are plenty of other ways that you could do this, but I think that the OO message-passing way, where data can be wrapped up as commands, is a very flexible and useful way. You couldn't distinguish the above two types with a simple regexp, so you would need to tag the values somehow to distinguish the types in the procedure. Also, using regexp or other string matching for type-dispatch is (a) inefficient, and (b) exposes implementation details. (The code above is based on a currently unreleased version of TOOT I'm playing with, similar in some ways to that at Monadic TOOT. Note that it supports both forms of dispatch I described earlier, and is indeed based on algebraic data types. The current version is also much more efficient, due to some simple rearrangements that avoid a trip through unknown.)

You seem to have a chip on your shoulder about procedural vs. OO (or even FP) programming. I'm not sure where this comes from. I have certainly never called procedural programming "dumb" and resent the implication. I, and others, find enjoyment and enlightenment in mastering a range of different styles of programming, including procedural, OOP, FP, logic programming, constraint programming, concurrent dataflow programming and a whole host of other styles. Each have added to my understanding of the craft of software engineering. If you don't like OO, or don't like TOOT, or any other technique, then feel free not to use them. I would like to point out, though, that your final example using "unknown" is basically a recreation of half of TOOT tailored for URLs, so how TOOT is an "expense" but that isn't, is beyond me.

Category Concept

Category Object Orientation