data is code

(Page begun by Lars H, 11 March 2007; feel free to expand.)

Data is code is a powerful programming technique for working with complicated data, especially tree-like data. The idea is that the data structure itself can also be regarded as a command, so that traversing a data structure can simply be to evaluate it! By providing different definitions (of the basic node types) in different namespaces, it is possible to implement a variety of different operations as evaluations of the same data structure.

The idea supports polymorphism to about the same degree as Object-oriented Programming, but does so in a quite different way. While it in OOP is the data (objects) that have to know all the operations (methods) that may act upon it, data-is-code requires that the operations (in this setting: namespaces) know all the types of data that they may encounter. Both support extensions to about the same degree:

  • In OOP, a new type can be implemented as a new class, whereas a new operation can be implemented as a new method on the ancestor class.
  • With data-is-code, a new type is implemented by defining commands for it in all operation namespaces, whereas a new operation is implemented as a new namespace.

Where OOP can use inheritance to share some operation definitions (make them the same as for some other type), data-is-code can do the same by creating aliases. The main difference is that objects, being commands even in cases where they're primarily used as containers for data, need to be explicitly destroyed and therefore objects add the burden of lifetime management. Pure data that is sometimes treated as code does not have that problem.

TOOT is a bit of a data is code/OO hybrid. Its basic mechanism is to treat data as code (and thus doesn't have the lifetime management problem), but the only operation implemented is OO-style method dispatch.


An example: Algebraic expressions

An example which illustrates many of these points can be provided by a data structure for simple algebraic expressions. Suppose an expression is a list on one of the forms

+ expression expression
- expression expression
* expression expression
/ expression expression
var name
const value

With the obvious interpretation of +, -, etc., we find that the expression (2x-1)/(x+3) is represented by the data structure

  / {- {* {const 2} {var x}} {const 1}} {+ {var x} {const 3}}

So far, this is just a data structure; there are no operations which act on it. (The +, -, and friends are node types in the general terminology above.) A very natural operation would however be to evaluate the expression, which can be implemented as follows:

  namespace eval algexpr::eval {
     proc + {term1 term2} {
        expr {[eval $term1] + [eval $term2]}
     }
     proc - {term1 term2} {
        expr {[eval $term1] - [eval $term2]}
     }
     proc * {factor1 factor2} {
        expr {[eval $factor1] * [eval $factor2]}
     }
     proc / {numer denom} {
        expr {[eval $numer] / double([eval $denom])}
     }
     proc const {val} {return $val}
     proc var {name} {
        upvar #0 $name val
        return $val
     }
  }

The command to actually evaluate an expression $datum is then

  namespace inscope algexpr::eval $datum

(or [namespace eval algexpr::eval $datum], but the syntax of namespace inscope is often more appropriate).

  % set datum {/ {- {* {const 2} {var x}} {const 1}} {+ {var x} {const 3}}}
  % set x 1
  % namespace inscope algexpr::eval $datum
  0.25
  % set x 2
  % namespace inscope algexpr::eval $datum
  0.6
  % set x 3
  % namespace inscope algexpr::eval $datum
  0.833333333333

Works fine, but it is probably not all that practical to fetch the variable values from global variables. A more natural construction is to provide them in the evaluation call, but that requires a different set of definitions, and is thus a different operation.

  namespace eval algexpr::eval2 {
     proc + {term1 term2 args} {
        expr {[eval $term1 $args] + [eval $term2 $args]}
     }
     proc - {term1 term2 args} {
        expr {[eval $term1 $args] - [eval $term2 $args]}
     }
     proc * {factor1 factor2 args} {
        expr {[eval $factor1 $args] * [eval $factor2 $args]}
     }
     proc / {numer denom args} {
        expr {[eval $numer $args] / double([eval $denom $args])}
     }
     proc const {val args} {return $val}
     proc var {name args} {
        foreach {var val} $args {
           if {$name eq $var} then {return $val}
        }
        error "No value specified for '$name'"
     }
  }

Commands corresponding to the above are now

  % namespace inscope algexpr::eval2 $datum x 1
  0.25
  % namespace inscope algexpr::eval2 $datum x 2
  0.6
  % namespace inscope algexpr::eval2 $datum x 3
  0.833333333333

and it's not much harder to handle expressions involving several variables

  % set datum2 {/ {- {* {const 2} {var x}} {const 1}} {+ {var y} {const 3}}}
  % namespace inscope algexpr::eval2 $datum2 x 1 y 2
  0.2

(Using namespace eval instead would require list-quoting the extra arguments, which is a bit artificial.)

Evaluating expressions is a rather simple operation however, so what about something trickier, like differentiation? This would mean that

  namespace inscope algexpr::diff $datum $var

returns the derivative of $datum with respect to $var. The basic differentiation rules are straightforward to implement:

  namespace eval algexpr::diff {
     proc + {term1 term2 var} {
        list + [eval $term1 [list $var]] [eval $term2 [list $var]]
     }
     proc - {term1 term2 var} {
        list - [eval $term1 [list $var]] [eval $term2 [list $var]]
     }
     proc * {factor1 factor2 var} {
        list + [
           list * [eval $factor1 [list $var]] $factor2
        ] [
           list * $factor1 [eval $factor2 [list $var]]
        ]

     }
     proc / {factor1 factor2 var} {
        list / [
           list - [
              list * [eval $factor1 [list $var]] $factor2
           ] [
              list * $factor1 [eval $factor2 [list $var]]
           ]
        ] [
           list * $factor2 $factor2
        ]
     }
     proc const {val var} {
        list const 0
     }
     proc var {name var} {
        list const [expr {$name eq $var}]
     }
  }

So from this one may get

  % set ddx_datum [namespace inscope algexpr::diff $datum x]
  / {- {* {- {+ {* {const 0} {var x}} {* {const 2} {const 1}}} {const 0}} {+ {var x} {const 3}}} {* {- {* {const 2} {var x}} {const 1}} {+ {const 1} {const 0}}}} {* {+ {var x} {const 3}} {+ {var x} {const 3}}}
  % namespace inscope algexpr::eval2 $ddx_datum x 2
  0.28

which fits well with the values computed before.

What does the $ddx_datum expression look like in more traditional notation, though? Conversion to expr syntax is of course another operation, so now that we're warmed up, it is probably easier to implement it than to decipher the $ddx_datum by hand. The only issue one needs to solve is how to know when to insert parentheses, but that's easily handled by specifying the operation to be to

  • convert an algexpr to expr form such that it can safely be made an operand of an expr-operation with priority priority, where 0 means no operation, 1 means +, 2 means *, and 3 is as denominator.

The following will then do the trick

  namespace eval algexpr::toexpr {
     proc + {term1 term2 priority} {
        set res "[eval $term1 [list 1]]+[eval $term2 [list 1]]"
        if {$priority>1} then {
           return ($res)
        } else {
           return $res
        }
     }
     proc - {term1 term2 priority} {
        set res "[eval $term1 [list 1]]-[eval $term2 [list 2]]"
        if {$priority>1} then {
           return ($res)
        } else {
           return $res
        }
     }
     proc * {factor1 factor2 priority} {
        set res "[eval $factor1 [list 2]]*[eval $factor2 [list 2]]"
        if {$priority>2} then {
           return ($res)
        } else {
           return $res
        }
     }
     proc / {numer denom priority} {
        set res "[eval $numer [list 2]]/[eval $denom [list 3]]"
        if {$priority>2} then {
           return ($res)
        } else {
           return $res
        }
     }
     proc var {name priority} {
        if {[regexp -nocase -- {^[a-z_][a-z0-9_]*$} $name]} then {
           return "\$$name"
        } else {
           return "\[[list set $name]\]"
        }
     }
     proc const {val priority} {return $val}
  }

The right command to reformat our first $datum is thus

  % namespace inscope algexpr::toexpr $datum 0
  (2*$x-1)/($x+3)

Works like a charm! Now what about that derivative?

  % namespace inscope algexpr::toexpr $ddx_datum 0
  ((0*$x+2*1-0)*($x+3)-(2*$x-1)*(1+0))/(($x+3)*($x+3))

Hmmm... It's correct, but not very nice -- there are a lot of constant subexpressions that any human would have evaluated during the differentiation operation. One way to get rid of these would be to improve the differentiation operation to do these simplifications on the fly, but another is to implement a more generic operation for partial evaluation of expressions. The actual situations one has to deal with are the same in both cases, so the more generic operation seems preferable.

  namespace eval algexpr::parteval {
     proc const {val args} {list const $val}
     proc var {name args} {
        foreach {var val} $args {
           if {$name eq $var} then {return [list const $val]}
        }
        return [list var $name]
     }
     # These two were the central ones, but most of the actual 
     # work remains, since there are many small cases where
     # partial evaluation is possible.
     proc + {term1 term2 args} {
        set res1 [eval $term1 $args]
        foreach {t1 v1} $res1 break
        set res2 [eval $term2 $args]
        foreach {t2 v2} $res2 break
        if {$t1 eq "const"} then {
           if {$t2 eq "const"} then {
              list const [expr {$v1 + $v2}]
           } elseif {$v1==0} then {
              set res2
           } else {
              list + $res2 $res1
           }
        } elseif {$t2 eq "const" && $v2==0} then {
           set res1
        } else {
           list + $res1 $res2
        }
     }
     proc - {term1 term2 args} {
        set res1 [eval $term1 $args]
        foreach {t1 v1} $res1 break
        set res2 [eval $term2 $args]
        foreach {t2 v2} $res2 break
        if {$t1 eq "const"} then {
           if {$t2 eq "const"} then {
              list const [expr {$v1 - $v2}]
           } else {
              list - $res1 $res2
           }
        } elseif {$t2 eq "const" && $v2==0} then {
           set res1
        } else {
           list - $res1 $res2
        }
     }
     proc * {factor1 factor2 args} {
        set res1 [eval $factor1 $args]
        foreach {t1 v1} $res1 break
        set res2 [eval $factor2 $args]
        foreach {t2 v2} $res2 break
        if {$t1 eq "const"} then {
           if {$t2 eq "const"} then {
              list const [expr {$v1 * $v2}]
           } elseif {$v1==0} then {
              list const 0
           } elseif {$v1==1} then {
              set res2
           } else {
              list * $res1 $res2
           }
        } elseif {$t2 eq "const"} then {
           if {$v2==0} then {
              list const 0
           } elseif {$v2==1} then {
              set res1
           } else {
              list * $res2 $res1
           }
        } else {
           list * $res1 $res2
        }
     }
     proc / {numer denom args} {
        set res1 [eval $numer $args]
        foreach {t1 v1} $res1 break
        set res2 [eval $denom $args]
        foreach {t2 v2} $res2 break
        if {$t2 eq "const"} then {
           if {$t1 eq "const"} then {
              list const [expr {$v1 / double($v2)}]
           } elseif {$v2==1} then {
              set res1
           } else {
              list / $res1 $res2
           }
        } else {
           list / $res1 $res2
        }
     }
  }

Some examples:

  % namespace inscope algexpr::parteval $datum x 2
  const 0.6
  % namespace inscope algexpr::parteval $datum2 x 2
  / {const 3} {+ {var y} {const 3}}

So, what about that ugly derivative?

  % set ddx_datum_simplified [namespace inscope algexpr::parteval $ddx_datum]
  / {- {* {const 2} {+ {var x} {const 3}}} {- {* {const 2} {var x}} {const 1}}} {* {+ {var x} {const 3}} {+ {var x} {const 3}}}
  % namespace inscope algexpr::toexpr $ddx_datum_simplified 0
  (2*($x+3)-(2*$x-1))/(($x+3)*($x+3))

Better, but still redundant. (Simplifying algebraic expressions to about the degree expected from a math student is really rather complicated, once you have to instruct a computer to do it, and this representation with binary +, -, *, and / is not very practical for the task. Better solutions are beyond the simple example, however.)

Let us instead consider how to extend the algexpr data structure with a new node type, say exp for the (natural base) exponential function. This requires defining exp procedures in all the namespaces where this node type may be expected:

  namespace eval algexpr::eval {
     proc exp {arg} {expr {exp([eval $arg])}}
  }
  namespace eval algexpr::eval2 {
     proc exp {arg args} {expr {exp([eval $arg $args])}}
  }
  namespace eval algexpr::diff {
     proc exp {arg var} {
        list * [list exp $arg] [eval $arg [list $var]]
     }
  }
  namespace eval algexpr::toexpr {
     proc exp {arg priority} {return "exp([eval $arg [list 0]])"}
  }
  namespace eval algexpr::parteval {
     proc exp {arg args} {
        set ares [eval $arg $args]
        if {[lindex $ares 0] eq "const"} then {
           list const [expr {exp([lindex $ares 1])}]
        } else {
           list exp $ares
        }
     }
  }

That's all there is to it:

  % set datum3 {exp {/ {var x} {var y}}}
  % namespace inscope algexpr::eval2 $datum3 x 2 y 1
  7.38905609893
  % set ddy_datum3 [namespace inscope algexpr::diff $datum3 y]
  * {exp {/ {var x} {var y}}} {/ {- {* {const 0} {var y}} {* {var x} {const 1}}} {* {var y} {var y}}}
  % namespace inscope algexpr::toexpr $ddy_datum3 0
  exp($x/$y)*(0*$y-$x*1)/($y*$y)
  % set ddy_datum3 [namespace inscope algexpr::parteval $ddy_datum3]
  * {exp {/ {var x} {var y}}} {/ {- {const 0} {var x}} {* {var y} {var y}}}
  % namespace inscope algexpr::toexpr $ddy_datum3 0
  exp($x/$y)*(0-$x)/($y*$y)

Sharing commands

The examples above never gave an opportunity to use the same command for more than one thing, but using interp alias or namespace import to make one command appear in several places is a standard Tcl programming technique that is explained elsewhere. There are however some special issues one should take note of:

Avoid colons in node types
Colons interact strangely with the namespace separator ::, so you may be unable to make use of, or even speak about, the command you want.
Be aware of the namespace context
If you import or alias a proc into a namespace, then the namespace context of this proc does not follow. For set-ups such as the above, where the namespace context is an essential foundation for the interpretation, this can be disastrous.

There is a rather simple solution to the namespace problem, however: use uplevel instead of eval! This works because it causes a child node to be interpreted as a command in the same context as the parent node was, so if for example the child and parent nodes are of the same type then they will both be handled by the same command.

If this still seem mysterious to you, then remember that namespace eval and kin (e.g. namespace inscope) adds a level visible to uplevel. The following provides an example:

  namespace eval ::foo {}
  proc ::a {} {
     puts "- I'm ::a, called as [info level 0]."
     return ::a
  }
  proc ::foo::a {} {
     puts "- I'm ::foo::a, called as [info level 0]."
     return ::foo::a
  }
  proc ::b {} {
     puts -nonewline "- I'm ::b, called as [info level 0]; "
     puts "my namespace is [namespace current]."
     puts "::b says: I've just called [uplevel 1 a]."
  }
  interp alias {} ::foo::b {} ::b

The call [b] prints

  - I'm ::b, called as b; my namespace is ::.
  - I'm ::a, called as a.
  ::b says: I've just called ::a.

and the call [namespace inscope foo b] prints

  - I'm ::b, called as ::b; my namespace is ::.
  - I'm ::foo::a, called as a.
  ::b says: I've just called ::foo::a.

because b calls the a visible in the namespace from which b itself was called, even if that isn't where b resides.

An alternative approach to the namespace problem, which can be useful for generic utility commands (that via aliases with some extra arguments can implement a whole score of commands), is to supply the intended namespace as an extra argument. Then if one defines

  proc exprbinop {ns op left right args} {
     set leftval [namespace eval $ns $left $args]
     set rightval [namespace eval $ns $right $args]
     expr \$leftval $op \$rightval
  }

the algexpr::eval and algexpr::eval2 implementations of +, -, and * can all be given as aliases to this exprbinop:

  interp alias {} algexpr::eval::+ {} exprbinop algexpr::eval +
  interp alias {} algexpr::eval::- {} exprbinop algexpr::eval -
  interp alias {} algexpr::eval::* {} exprbinop algexpr::eval *
  interp alias {} algexpr::eval2::+ {} exprbinop algexpr::eval2 +
  interp alias {} algexpr::eval2::- {} exprbinop algexpr::eval2 -
  interp alias {} algexpr::eval2::* {} exprbinop algexpr::eval2 *

Using uplevel is however an easier solution, as long as one can predict what level to uplevel to.


Handling unknown node types

In some cases, the set of possible node types is not known from start, but all except a known set of nodes are to be given default processing (which is typically to do nothing here, but recursively process all the children). This is a problem for the namespace inscope technique, since there is no way to stop Tcl from looking outside the namespace for command definitions: namespace path can be used to make Tcl look in additional namespaces, and namespace unknown can be used to override the generic unknown, but there is no way to prevent commands in the :: namespace from being found as legitimate implementations of a the operation on a node. If list, set, expr, or the like can occur as node types then this means you're in trouble.

One way to get around this could be to make sure that the :: namespace is empty, by using an empty interpreter like so:

  interp create -safe dispatch
  dispatch hide namespace
  dispatch invokehidden namespace delete ::

after which you can create your node types as aliases from dispatch to your main interpreter, and process a node by doing

dispatch invokehidden namespace inscope namespace node ?arg ...?

but this is rather complicated.

As of Tcl 8.5, a better approach is to use an ensemble to do the dispatch, since this does not look outside its given namespace. The set-up closest to that of namespace inscope is to do

  namespace eval $operation {
      namespace export *
      namespace ensemble create -prefix 0
  }

after which the counterpart of

  namespace inscope $operation $node $arg

is

  $operation {*}$node $arg

e.g.

  algexpr::eval2 {*}$datum x 1

instead of

  namespace inscope algexpr::eval2 $datum x 1

This will throw an error when the node type is unknown; to do something else one must supply an -unknown handler for the ensemble.

A problem with this set-up is however that it is less obvious how to make recursive calls for children; an eval or uplevel on a child will not reinvoke the ensemble (thus losing the protection against unknown node types). Hardcoding the fully qualified name of the ensemble where one needs to make a recursive call is one option, but that gets inconvenient for nested namespaces. When the ensemble (as is the default) has the same name as its namespace, its name can be obtained as [namespace current], e.g.

     proc + {term1 term2 args} {
        expr {[[namespace current] {*}$term1 {*}$args] + [[namespace current] {*}$term2 {*}$args]}
     }

but this is prolix too, and leads to shimmering (command look-up cannot be cached, as the name is regenerated whenever it is used). A better approach is often to place the ensemble command inside its own namespace, as this makes it callable without namespace qualification. Using the empty string as name of the ensemble command, this can look as follows:

  namespace eval algexpr::eval2 {
     proc + {term1 term2 args} {
        expr {["" {*}$term1 {*}$args] + ["" {*}$term2 {*}$args]}
     }
     proc - {term1 term2 args} {
        expr {["" {*}$term1 {*}$args] - ["" {*}$term2 {*}$args]}
     }
     proc * {factor1 factor2 args} {
        expr {["" {*}$factor1 {*}$args] * ["" {*}$factor2 {*}$args]}
     }
     proc / {numer denom args} {
        expr {["" {*}$numer {*}$args] / double(["" {*}$denom {*}$args])}
     }
     proc const {val args} {return $val}
     proc var {name args} {
        foreach {var val} $args {
           if {$name eq $var} then {return $val}
        }
        error "No value specified for '$name'"
     }
     namespace export ?*
     namespace ensemble create -command [namespace current]:: \
        -prefix 0 -unknown [list ::default_to_type +]
     interp alias {} [namespace current] {} [namespace current]::
  }
  proc default_to_type {type ensemble args} {list $ensemble $type}

The interp alias is so that the ensemble can still be called by external users as if it had the same name as its namespace. Using ?* as export pattern means all commands in the namespace except the ensemble command itself are available as node types. The -unknown handler means all unknown node types are treated as if they were +:

  % algexpr::eval2 + {const 1} {var x} x 2
  3
  % algexpr::eval2 - {const 1} {var x} x 2
  -1
  % algexpr::eval2 / {const 1} {var x} x 2
  0.5
  % algexpr::eval2 & {const 1} {var x} x 2
  3

Traversals with state

Sometimes, it is important to maintain a state when traversing a data structure, and update that state for each node processed. That each node is processed in a separate local context makes it trickier to maintain such a state than it would be if a single procedure call did all the work, but there are some approaches one can use:

  1. Use a global or namespace variable. Inelegant, but often easy and good enough.
  2. Use a local variable that is accessed using upvar. Gets into trouble if uplevel or namespace eval is being used as shown in the previous section.
  3. Pass the previous state as a command argument and return it as the command result. This is the functional programming idiom.

Collecting children in a sublist or placing them directly the node list

The data in a node can be categorised as:

  1. Node type.
  2. Annotations (data not in the form of a node).
  3. Children (data in node format).

In the algebraic examples above, the const and var nodes have annotations (value and variable names respectively) but no children, whereas the other node types have children but no annotations. The grammar_fa example (see bullet list below) is similar, but adds the complication that some nodes can have an arbitrary number of children. In the parsetcl example, all nodes have two pieces of annotation and in addition many have a variable number of children. Most general is the data structure employed by a little XML parser, since a typical node in that case can have an arbitrary number of annotations and an arbitrary number of children.

The issue considered here is how to best design the list structure for the various node types. If the number of annotations and children is determined by the node type, then it probably seems natural to just lay them out in a flat structure

  nodetype annot1 annot2 ... annotM child1 child2 ... childN

since it is then trivial to name all the arguments in the proc definition

  proc $nodetype {annot1 annot2 ... annotM child1 child2 ... childN} { # ...

-- no need to lassign or such, and you get good error messages for free.

It is possible to use this design also for node types with a variable number of children, since the proc definition can just as easily be

  proc $nodetype {annot1 annot2 ... annotN args} { # ...

(this was very much done in parsetcl), but in these cases it is better to have just one argument which is the list of children, like so:

  proc $nodetype {annot1 annot2 ... annotN children} { # ...

Why is that? Because it simplifies operations which take arguments (like eval2, parteval, diff, and toexpr above)! Consider what would happen if + nodes looked like

+ ?term ...?

(variadic addition is actually more natural for algebra). The eval operation is straightforward

  proc algexpr::eval::+ {args} {
     set sum 0
     foreach term $args {
        set sum [expr {$sum + [eval $term]}]
     }
     return $sum
  }

but the diff operation is much trickier due to the exceptional final argument holding the differentiation variable name

  proc algexpr::diff::+ {args} {
     set var [lindex $args end]
     set res [list +]
     foreach term [lrange $args 0 end-1] {
        lappend res [eval $term [list $var]]
     }
     return $res
  }

and the eval2 operation is simply impossible -- there is no sure way to distinguish variable-value arguments from child node arguments.

This problem does not arise if the syntax had instead been

sum term-list

(the + node type is already assigned to binary addition) since the implementations are then

  proc algexpr::diff::sum {terms var} {
     set res [list sum]
     foreach term $terms {
        lappend res [eval $term [list $var]]
     }
     return $res
  }
  proc algexpr::eval2::sum {terms args} {
     set sum 0
     foreach term $terms {
        set sum [expr {$sum + [eval $term $args]}]
     }
     return $sum
  }

A little XML parser has got this right, with one argument that is a dictionary of annotations and another that is the list of children. After that can follow any number of operation arguments, and it is easy to see which is which.

Still, it sometimes happens that one has to implement operations on data in a legacy format, with the list of children expanded into the node list. What can one do then? One possibility is to first implement an operation converting the data to a child-list format, because that operation doesn't need arguments, and then perform the wanted operation on the converted data. Another possibility is to design the argument structure for the operation so that there is an explicit end-of-children argument, like -- to signal end of switches. In this case, the right form of an end-of-children argument is an empty list, since every child must be a list of length at least 1. In that case, one could define

  proc algexpr::eval3::+ {args} {
     set n 0
     foreach term $args {
        if {[llength $term] < 1} then {break}
        incr n
     }
     set sum 0
     foreach term [lreplace $args $n end] {
        set sum [expr {$sum + [eval $term [lrange $args $n end]]}]
     }
     return $sum
  }

and make calls like

  namespace inscope algexpr::eval3 $datum {} x 1

but the need for the extra end-of-children argument means this is not the same operation as eval2.


Discussion

male - 2007-03-12:

Thanks for your very detailed data is code page!

My best example for data is code was my way to read VRML files, to import their structure and their geometrical data:

  1. Read the file data
  2. Convert the file data to a valid list
  3. Reorganize the file data list to have all VRML node attributes in sub lists
  4. Evaluate this list with the scheme, that each VRML node name could be the name of a tcl command or procedure. If such a tcl procedure exists, than evaluate it with the VRML node attributes as arguments, if not ignore this VRML node.

Such becomes VRML data executable code.


NEM 12 Mar 2007: Thanks, Lars, this is a useful technique to document. I've only skimmed at present, so apologies if some of these remarks are covered. An alternative to using a namespace would be to just use a switch (or some form of pattern matching) inside a proc. For instance, the first expression evaluator might be written:

 proc algexpr::calc expr {         
    lassign $expr op a b
    switch -exact -- $op {
        +     { expr {[calc $a] + [calc $b]} }
        -     { expr {[calc $a] - [calc $b]} }
        *     { expr {[calc $a] * [calc $b]} }
        /     { expr {[calc $a] / double([calc $b])} }
        const { return $a }
        var   { upvar #0 $a v; return $v }
    }     
 }

A sugared version of this appears in a lambda calculus interpreter with arithmetic.

Lars H: The only catch in that is that it doesn't really do data is code, as the data isn't treated as code; in a sense, it is precisely the more traditional technique that data-is-code provides an alternative to!

NEM: Right, but what are the advantages of manipulating data as Tcl code as opposed to just treating it as a bit of data? Speed? Leveraging the existing Tcl interpreter code? The above proc is a bare-bones interpreter for a minimal language of arithmetic expressions, so its data is also code. Code is just data, after all. (The "real" dichotomy would be between data and processes, where code could be seen as data that describes a process).

Lars H: Speed and leveraging existing mechanisms are indeed reasons why this is useful. Another reason is extensionality (how do you tell algexpr::calc about the exp node type?). As for "code is data"... well, that would be the subject of another page (every now and then I mix the two up too).

NEM: Code really is just data... But anyway, to the other points: I'm not so sure about the speed, although I'm prepared to believe it for more complex examples. For example, transforming XML data into Tcl code could result in a quite fast parser, but it's tricky to do correctly, and there are of course issues with trust -- Tcl is a much more powerful language than a custom little interpreter. As to the question of extensibility, it is actually fairly simple to extend a switch-based interpreter without altering the original with a little forethought. All you need to do is to use open-recursion and an explicit fixpoint operator (here we add support for environments too; dict required):

 proc calc {self expr env} {
     lassign $expr op a b
     switch -exact -- $op {
         +       { expr {[call $self $a $env] + [call $self $b $env]} }
         -       { expr {[call $self $a $env] - [call $self $b $env]} }
         *       { expr {[call $self $a $env] * [call $self $b $env]} }
         /       { expr {[call $self $a $env] / 
                             double([call $self $b $env])} }
         const   { return $a }
         var     { dict get $env $a }
         default { error "unknown op \"$op\"" }
     }
 }
 # fixpoint operator (similar to Y combinator):
 proc call {cmd args} { uplevel #0 $cmd [linsert $args 0 $cmd] }

We can then call our command using the fixpoint operator (which is quite similar in some ways to TOOT dispatch):

 % set datum {/ {- {* {const 2} {var x}} {const 1}} {+ {var x} {const 3}}}
 / {- {* {const 2} {var x}} {const 1}} {+ {var x} {const 3}}
 % call calc $datum {x 1}
 0.25

Extending the language to support an exp operation is now a simple task, much like extending an OO program:

 proc calcexp {self expr env} {
     lassign $expr op arg
     switch -exact -- $op {
         exp     { expr {exp([call $self $arg $env])} }
         default { calc $self $expr $env }
     }
 }

And to test:

 % set datum3 {exp {/ {var x} {var y}}}
 exp {/ {var x} {var y}}
 % call calcexp $datum3 {x 2 y 1}
 7.38905609893065

Note that our original language is still available, so only those callers who want the extended language have to get it:

 % call calc $datum3 {x 2 y 1}
 unknown op "exp"

NEM (A few minutes later): It's also quite simple to extend the language in different directions. For example (and because it's fun!), here's how to add lexically-scoped lambda expressions to the language (something Tcl itself doesn't have built-in):

 proc calclam {self expr env} {
     lassign $expr op a b
     switch -exact -- $op {
         lambda  { list closure $a $b $env }
         app     { calcapp $self [call $self $a $env] [call $self $b $env] }
         default { calc $self $expr $env }
     }
 }
 proc calcapp {self f x} {
     lassign $f op arg body env
     switch -exact -- $op {
         closure { call $self $body [dict replace $env $arg $x] }
         default { error "not a closure: \"$f\"" }
     }
 }

And the test:

 % set f {lambda x {lambda y {* {var x} {var y}}}}
 lambda x {lambda y {* {var x} {var y}}}
 % set exp [list app [list app $f {const 3}] {const 4}]
 app {app {lambda x {lambda y {* {var x} {var y}}}} {const 3}} {const 4}
 % call calclam $exp {} 
 12

NEM (cont.) As you mention at the start, an OO approach makes it easy to add new types (just add a new sub-class and implement the operations), but trickier to add new operations, whereas a pattern matching approach makes it easier to add new operations but harder to add new data types (as you need to update all the operations). This is apparently known as the expression problem in programming language design, and there are a number of ways in which it is tackled. The usual OO way is the Visitor pattern, which essentially implements pattern matching over OO dispatch. In this case you'd have something like the following (psuedo-code):

 class Expr {
     abstract method visit {visitor}
 }
 class Num val {
     inherit Expr
     method visit {visitor} { $visitor visitNum $val }
 }
 class AddOp a b {
     inherit Expr
     method visit {visitor} { $visitor visitAddOp $a $b }
 }
 ...
 class ExprVisitor {
     abstract method visitNum {val}
     abstract method visitAddOp {a b}
     ...
 }
 class ExprInterp {
     inherit ExprVisitor
     method visitNum {val} { return $val }
     method visitAddOp {a b} { expr {[$a visit $self] + [$b visit $self]} }
     ...
 }
 [AddOp [AddOp [Num 2] [Num 5]] [Num 4]] visit [ExprInterp] ;# => 11

Pattern matching can also be made more flexible by including default cases.


LV One of the examples I've read about, in terms of this topic, has to do with saving the state of a Tcl program. The idea is that one would write procs which would go through and save off the values of any variable, proc, namespace, etc. At first, I thought well, only the values which are dynamically set are necessary - the procs defined in the program's file itself need not be saved, nor variables which are set within that file. However, given that someone might come along and modify the program, or for that matter, given that you would have to write specialized code to avoid getting those values, it is much easier just to save everything. If done properly, the "save file" should be able to be invoked and the application would be back the way things were (with some possible exceptions, depending on the code, extensions, etc.) when last started.


See also

Examples found "in the wild" of data structures which support the data-is-code paradigm:

parsetcl
Represents the syntactic structure of Tcl code as a list-tree. A wiki user contributed a set of procs that made the list-tree convert itself back to Tcl code when evaluated!
grammar_fa
Has a list-tree representation for regular expressions. Node types are S, ., |, &, *, ?, +, and !.
A little XML parser
Has tree-list representation (inherited from tdom) for XML data. Node types are the tags that may occur, and #text for text content.
Geometry
Represents geometric objects as list-trees, with POINT and VECTOR as leaf node types.

Going one step further: code is data is code


escargo 4 Nov 2008 - And here I thought the word "data" is plural.

NEM: Singular or plural. "Data are codes" sounds weird.