Version 3 of Cloverfield - References

Updated 2008-01-23 16:49:13 by FB

Cloverfield references using the {ref} word modifier is a solution to the problem of defining and serializing structures with circular references. Let's consider a tree whose nodes have to reference their root or parent node in some way (for instance, a DOM tree from XML data). E.g:

root
|
+-node1
|
+-node2
| |
| +-node21
| |
| +-node22
| |
| +-node23
|
+-node3

Each node (including the root) is a structure that holds a value (here root, node1...), a list of children, and a reference to its root parent. Expressing this structure in plain Tcl gives:

set tree {root {
    {node1 {}}
    {node2 {
        {node21 {}}
        {node22 {}}
        {node23 {}}
    }
    {node3 {}}
}}

This is a tree expressed as a recursive list. Each node holds 2 elements: a name and a list of children. Nodes "know" their root and parent implicitly because they are part of the same Tcl variable. If we want to extract a given node, we lose the reference information so we have to retrieve this metadata somewhere. Typically one would have to maintain this information in variables, with some in-band signaling convention, or with specific naming conventions. For example, we could store the names of the root and parent nodes in the node structure (in-band signaling).

set tree {root root root {
    {node1 root root {}}
    {node2 root root {
        {node21 root node2 {}}
        {node22 root node2 {}}
        {node23 root node2 {}}
    }
    {node3 root root {}}
}}

Extracting node22 gives:

{node22 root node2 {}}

Knowing that the data is stored in variable tree, one could find the root and parent nodes by searching the tree. But this is a potentially expensive operation and needs explicit coding.

Now consider the same structure with Cloverfield's {ref} word modifier:

set tree {ref root}(root {ref root}{} {ref root}{} (
    {ref node1}(node1 {ref root}{} {ref root}{} {})
    {ref node2}(node2 {ref root}{} {ref root}{} (
        {ref node21}(node21 {ref root}{} {ref node2}{} {})
        {ref node22}(node22 {ref root}{} {ref node2}{} {})
        {ref node23}(node23 {ref root}{} {ref node2}{} {})
    ))
    {ref node3}(node3 {ref root}{} {ref root}{} {})
)

(Note that this is the string representation of the tree. Behind the curtains, the structure maintains pointers to the referenced data.)

Here we chose to use the node name as the reference ID, but IDs are arbitrary. The tree is now expressed as a recursive list with all nodes tagged by a unique reference ID. Each node holds 4 elements: a name, a reference to the root node, a reference to the parent node, and a list of children. This string representation is similar to the above Tcl example with in-band signaling. The difference is that elements 1 and 2 respectively hold references to the root and parent nodes, whereas the Tcl version required an explicit naming convention.

Some example use:

% set node [lindex $tree 3 2 3 2]; # 3's are for each node's children list.
# Note: indentation and inline comments added to the below string representation for clarity.
{ref node22}(
    # Value:
    node22
    
    # Root: first reference to root is serialized. 
    {ref root}(
        root
        
        # Subsequent references to root aren't serialized.
        {ref root}{}
        {ref root}{}
    
        # Children:
        (
            {ref node1}(node1 {ref root}{} {ref root}{} {})
            {ref node2}(node2 {ref root}{} {ref root}{} (
                # Siblings of node22 get serialized...
                {ref node21}(node21 {ref root}{} {ref node2}{} {})
                
                # ... but not node21.
                {ref node22}{}
                
                {ref node23}(node23 {ref root}{} {ref node2}{} {})
                
            ))
            {ref node3}(node3 {ref root}{} {ref root}{} {})
        )
    )
    
    # Parent: reference was already serialized above.
    {ref node2}{}
    
    # Childen:
    {}
)

# Get node name.
% lindex $node 0
node22

# Get node parent name.
% lindex $node 2 0 ; # 2 is for parent.
node2

# Get siblings.
% lindex $node 2 3 ; # 2 for parent, 3 for children.
# Note: indentation and inline comments added to the below string representation for clarity.
(
    # First sibling is the first to reference root.
    {ref node21}(
        # Value:
        node21 
        
        # Root: first reference to root is serialized. 
        {ref root}(
            root
            
            # Subsequent references to root aren't serialized.
            {ref root}{}
            {ref root}{}
        
            # Children:
            (
                {ref node1}(node1 {ref root}{} {ref root}{} {})
                {ref node2}(node2 {ref root}{} {ref root}{} (
                    # First sibling is currently being serialized.
                    {ref node21}{}
        
                    # Subsequent siblings need serialization.
                    {ref node22}(node22 {ref root}{} {ref node2}{} {})
                    {ref node23}(node23 {ref root}{} {ref node2}{} {})
                    
                ))
                {ref node3}(node3 {ref root}{} {ref root}{} {})
            )
        )
    )
    
    # Subsequent siblings were already serialized.
    {ref node22}{}
    {ref node23}{}
)

Referenced structures are never serialized more than once (the referenced value is serialized at the first occurrence). That way, structures with circular references can be serialized in finite space, expressed as strings, and compared for equality. For a simpler example, consider this:

set l {ref a}(a {ref a}{})

The above structure is a list whose 2nd element is a reference to itself. Hence:

% lindex $l 1; # Or $l[1] using Cloverfield conventions.
{ref a}(a {ref a}{})
% expr {[lindex $l 1] == $l}
1

References can be combined with argument expansion. For example:

{ref root}(a {*}{ref root}{})

gives an infinite (flat) list of "a". The following code loops indefinitely:

foreach v {ref root}(a {*}{ref root}{}) {
    puts $v
}

The argument expansion word modifier must be placed before the reference so that the referenced value be expanded in-place in the containing list. Placing {*} after {ref} has no effect since the reference word modifier tags a single word.


Notice also the use of parentheses in place of braces, in order for the reference IDs to share the same scope. For example:

{ref a}(a {ref a}{}) 
{ref a}{a {ref a}{}}

In the first case, both references a designate the same reference, contrary to the second case because words are not evaluated at the same time. The first case is a self-referencing structure, whereas the second is equal to the list {a {}} value-wise.

The consequence is that the following works as expected:

% set l1 ({ref a}1 2 {ref a}{})
% set l2 ($l1 {ref a}3 {ref a}{})
{{ref a}1 2 {ref a}{}} {ref a}3 {ref a}{}
# The above equals {{1 2 1} 3 3} value-wise.

l1's references don't pollute l2's unexpectedly.

In summary, references don't cross brace boundaries, because braces delay the evaluation of their content. That is the main reason why parentheses are necessary for references to work.


See also: