Version 157 of Brush

Updated 2016-07-27 01:10:53 by AMG

AMG: Brush is a proposed language derived from Tcl. It currently exists in the form of a paper I presented at the Nineteenth Annual Tcl/Tk Conference (2012). Before I take up any detailed design or implementation tasks, I invite feedback on the proposal. Brush has similarities to Jim and Cloverfield.

The paper is available online here: http://andy.junkdrome.org/devel/brush/brush.pdf

Abstract

This paper proposes a new programming language similar in structure to Tcl. The new language embeds several functional programming concepts which are currently awkward or difficult to realize in Tcl. A reference system serves as the foundation for several other new features. The Tcl object model is refined to address performance concerns encountered in the Wibble project. Syntax is fine-tuned to streamline preferred usage. Despite similarity to Tcl, backwards compatibility is broken. The paper discusses the motivation, design, and semantics of the language and provides a preliminary specification. Included code samples contrast how various tasks are done in Tcl and the new language.

Development notes

AMG: Below this point I will jot down errata and other things that will contribute to the next iteration of the Brush project. Please bear with my rambling, for I wish to retain not only my conclusions but the illogic and missteps needed to reach them.

Summary of possible changes

AMG: Since this page is long, I'll go ahead and list here very briefly all the substantial changes to the paper I'm considering. I'm not committed to anything yet though.

  • Allow $ in math expressions to be omitted for variables starting with non-ASCII characters
  • Fix the precedence of & and | relative to && and ||
  • int() behaves like Tcl entier()
  • real() behaves like Tcl double()
  • Support for complex numbers and vector math
  • / is real division, // is integer division, regardless of operand type
    • Though / won't force result to be real if it's a whole number and both arguments are integers
  • % works for real numbers as well as integers
  • [incr] works for reals
  • {x:} is the empty list immediately after index x (was before in the paper)
  • {:x} is the empty list immediately before index x ({:x} was not allowed in the paper)
  • New [loop] and [collect] commands take the place of [for], [foreach], [while], and [dict for].
  • Change the string representation of references so that late-bound indexes say $... instead of &...@ or any such complicated thing
  • Have $&...^ notation instead of &...^ for late-bound indexes
    • &...^ is incorrect because it is (eventually) taking a value, hence should start with $
    • $...^ is incorrect because it treats the variable as containing the reference, rather than making a reference to the variable
    • $&...^ is ugly, but it's in fact shorthand for...
    • $[: &...]^ which is even uglier and unreasonable to type
  • Create a [lambda] command which compiles its lambda argument and binds its local variables to the locals of its caller
  • Drop the superfluous $ from reference composition and allow ^ in more places
  • Turn garbage collection inside-out
  • Rename the create subcommands to new, e.g. [list new]
  • Change [set] to handle sets (was lots) rather than assign variables (probably won't actually do this)
  • Cache the hash values in the Tcl_Obj rather than the hash table (probably won't actually do this)
  • Get constant-time dict element removal via an improved, lazy algorithm
  • Require parentheses around expressions containing spaces used in list indexes when more than one index is given
  • Allow end prefixes in math expressions, but retain end in the string representation of the result
  • Index arguments aren't arbitrary expressions, instead revert to pre-TIP 176 [L1 ] Tcl behavior
  • Add - for don't-care assignments in extended [set]
    • No, call it / because Reasons.
  • Offer customizable behavior for multi-variable [set] being given too few values
    • Actually this becomes a powerful, general-purpose data struture unpacking mini-language which also can be used with the [loop] and [collect] commands
    • Let [proc]'s parameter list syntax accept a very similar notation
  • Tweak [proc] parameter list syntax to make special characters be a separate list element
  • Be strict about metacharacters in contexts where Tcl currently treats them as literal
  • Be lenient about whitespace between backslash and newline (maybe not)
  • The [take] command, perhaps
  • Fix creative writing problem with namespace variables
  • Typographic fixes

Math improvements

Automatic $ in math expressions

AMG: The restrictions on page 10 can be loosened up a bit. Right now the dollar sign on variables in math expressions can be skipped if the name follows C identifier rules (alphanumerics and underscores only, doesn't start with a number), doesn't use indexing or dereferencing, is named literally, and isn't named "eq", "ne", "in", "ni" (or "end"... need to add that to the list).

The "alphanumerics and underscores" part can be relaxed to be any characters other than significant metacharacters and operators. So if your variable is called مینڈک, that would work too. (Assuming Brush supports right-to-left script, that is.)

Fixing operator precedence

AMG: There's no good reason to keep the precedence of & and | near && and ||. Put the bitwise operators somewhere after comparison. This was just for C compatibility, which is not required. Tcl compatibility would be nice, but compatibility is an anti-requirement of Brush.

Generalization to real numbers

AMG: I'm mystified about why certain things in Tcl are limited to integers until you switch to a different, more complicated syntax. Brush ought to clean this up.

int() and double()

AMG: I don't like the name double() because it's tied too closely to the implementation. Instead I prefer real() which describes what's happening at a higher level and doesn't require the programmer to know C to understand the name. Seriously, double() sounds like it should be returning double (two times) its argument!

As for int(), I don't like how Tcl masks to the machine word size. The non-masking Tcl function is entier() which is a bizarre name. So Brush will adopt entier()'s behavior but name it int(). Want to mask to the machine word size? Use & to do it.

Hmm, renaming double() to real() puts me in mind of complex numbers. I hadn't considered that before. I don't see why complex number notation couldn't be added, along with imag() and conj() and other such operations.

Maybe vector math would be nice too.

Integer and real division

AMG: My main reason for needing double() (or real(), as it may now be known) is to force real division even when the numbers may appear to be integers. I think a better solution is to make integer and real division be distinct operators, like Python has done. I'm already breaking compatibility with the Tcl [expr] notation in several ways, why not one more? So let's have / be real division and // be integer division.

Real division isn't quite the same thing as forcing either argument to be a real. This is only done when both arguments are integers but the numerator is not a multiple of the denominator.

ExpressionResultExpressionResult
5/2 2.5 5//2 2
10/5 2 10//5 2
10.0/5 2.0 10.0/5 2

And yes, this gotcha just bit me for the millionth time. That's why I want this change.

Real modulus

AMG: In the same vein, % should work the same as fmod when at least one argument is not an integer.

[incr] and reals

AMG: [incr] should be defined to support real numbers as well as integers.

Empty list range notation

First revision

AMG: On page 29, I never was happy with the fact that [set &x{end+1:} z] is required to append to the list in x. (By the way, that should be (z) not z.) I'd much rather be able to just say "end".

But what I have is necessary for consistency because the stride denotes the range including just before the first element of the stride (end, let's say) up to just after the second element of the stride (omitted in this case, indicating an empty stride). That empty stride (consisting only of the space before the final element) is then replaced with a list consisting of z.

Here's the superior solution. Allow the first stride index to be omitted, keeping the second. Since one of the two indexes is omitted, the stride is empty, so that's one important property maintained. Since the second index is specified, the stride denotes the space after the indexed element, so that's the other property I want.

How does it look? [set &x{:end} (z)]. I'll take it!

Also, I think I'll change the example to append two elements, not just one. That shows the utility of this feature, since appending a single element is already accomplished by [set &x{end+1} a], as shown in a previous example on the same page.

Second revision

AMG: When I first wrote the Brush paper, I envisioned only leaving out the second index to signify an empty list range, with the second value defaulting to -1. Thus, &a{5:} is a reference to the empty list range immediately before the element with index 5 in variable a.

Later I became annoyed at having to write &a{end+1:} which refers to the empty list range immediately before the element after the last. Too convoluted. I wanted to be able to instead refer to the empty list range immediately after the last element. This is of course the same list range, but explained more easily. So I came up with &a{:end}.

Now that I look at it, I think it would make more sense to reverse the behavior of omitting indexes so that the colon serves as a visual mnemonic as to whether the referenced empty list range is before or after the stated index.

Notation Interpretation
&a{:0} Before index 0, i.e. start of list
&a{4:} After index 4
&a{:5} Before index 5
&a{:end}Before last index
&a{end:}After last index, i.e. end of list

This looks much better to me.

Here's a simple example:

set &a (c e g)     # c e g
set &a{0:} (d)     # put d after c
set &a{:0} (a b)   # put a b before c
set &a{:end} (f)   # put f before g
set &a{end:} (h i) # put h i after g
: $a               # a b c d e f g h i

AMG: When using this new empty list notation, it's no longer the case that the list begins immediately before or ends immediately after the first and second indexes, respectively. It doesn't matter how the omitted index is defined. So instead describe empty list notation as a special case. {x:} is the empty list immediately after index x, and {:x} is the empty list immediately before index x. It's that simple.

Rethinking loops

AMG: I like [foreach] a lot, whereas [for] bothers me as being unintuitive and nothing more than a copy of C. While thinking this over last night, I took inspiration from [lcomp] and Ada and came up with a way to aggregate all looping into a general form: the [loop] command.

Here's a table summarizing and comparing the possible syntax. You may be familiar with some of these forms from the [lcomp] page. Combining everything like this is a bit more than simple sugar because it allows the various forms to be combined in new ways, including those not demonstrated in the table.

The do "keyword" in the following is optional when it is the second-to-last argument. I include it to be explicit, but it can be left out the same way then and else are optional in Tcl's [if] command.

Description Tcl version Proposed Brush version
Infinite loop while {1} {code} loop do {code}
While loop while {cond} {code} loop while {cond} do {code}
While loop, negated test while {!cond} {code} loop until {cond} do {code}
While, minimum once code; while {cond} {code} loop do {code} while {cond}
While, min. once, neg. testcode; while {!cond} {code} loop do {code} until {cond}
Maximum iteration count for {set i 0} {$i < $count} {incr i} {code} loop count {limit} do {code}
C-style for loop for {init} {cond} {step} {loop} loop init {init} while {cond} step {step} do {loop}
C-style for loop, neg. testfor {init} {!cond} {step} {loop} loop init {init} until {cond} step {step} do {loop}
Simple list iteration foreach var $list {code} loop for &var in $list do {code}
Striding list iteration foreach {var1 var2} $list {code} loop for (&var1 &var2) in $list do {code}
Unpacking list iteration foreach _ $list {lassign $_ var1 var2; code} loop for (&var1 &var2) unpack $list do {code}
Combinatorial iteration foreach var1 $list1 {foreach $var2 $list2 {code}} loop for &var1 in $list for &var2 in $list2 do {code}
Parallel iteration foreach var1 $list1 var2 $list2 {code} loop for &var1 in $list1 and &var2 in $list2 do {code}
Conditional iteration foreach var1 $list1 {if {cond} {code}} loop for &var1 in $list1 if {cond} do {code}
Conditional combinatorial foreach var1 $list1 {if {cond} {foreach $var2 $list2 {code}}}loop for &var1 in $list1 if {cond} for &var2 in $list2 do {code}
Simple w/ references (see below) loop for &var over &listvar do {code}
Unpacking w/ references (see below) loop for (&var1 &var2) unpackover &listvar do {code}

The over and unpackover forms are very special. Instead of putting each list value in the variable, they store references to the list values. This allows in-place modification of the list without having to keep track of indexes or generate a new list. I really look forward to this feature.

Now take this a step further. I also envision a [collect] command that takes all the above forms plus takes one or more initial arguments (before the [loop]-style arguments) which are expressions to be evaluated after each iteration of the loop (before the step clause, if there is one). The results of these evaluations are collected into a list which is returned as the result. For [collect], it'll often make sense to leave off the do clause entirely, though it may occasionally be useful to set variables used in the result expressions.

Also I should mention that Brush expressions allow the leading dollar sign of variable names to be omitted in simple cases.

For example, here are all the Python PEP 202 [L2 ] and PEP 274 [L3 ] examples written in Brush:

set &nums (1 2 3 4)
set &fruit (Apples Peaches Pears Bananas)
proc &range (count) {collect i init {set &i -1} count $count {incr i}}
proc &zip (l1 l2) {collect {(i1, i2)} for &i1 in $l1 and &i2 in $l2}

collect i for &i in [range 10]
collect i for &i in [range 20] if {i % 2 == 0}
collect {(i, f)} for &i in $nums for &f in $fruit
collect {(i, f)} for &i in $nums for &f in $fruit if {[string index $f 0] eq "P"}
collect {(i, f)} for &i in $nums for &f in $fruit if {[string index $f 0] eq "P"} if {i % 2 == 0}
collect {i for &i in [zip $nums $fruit] if {$i{0} % 2 == 0}

collect i {[format %c $i]} for &i in [range 4]
collect k v for (&k &v) in $dict
collect {[string tolower $x]} 1 for &x in $list_of_email_addrs
proc &invert (d) {collect v k for (&k &v) in $d}
collect {(k, v)} {k + v} for &k in [range 4] for &v in [range 4]

I intend to allow nesting of C-style for loops within a single invocation of [loop] or [collect]:

collect ($i $j) init {set &i 0} while {i < 3} step {incr &i}\
                init {set &j 0} while {j < 3} step {incr &j}

will return {0 0} {0 1} {0 2} {1 0} {1 1} {1 2} {2 0} {2 1} {2 2}.

If the while or until clause is left out of a C-style for loop, it will loop forever or until interrupted. However, if it is iterating in parallel with a Tcl-style foreach loop via the and clause, it will stop when its parallel loop does.

collect ($i $j) init {set &i 0} step {incr &i} and &j in (0 1 2)

will return {0 0} {1 1} {2 2}.

Another possible feature is the else clause which supplies code to execute if [break] is never called. This is useful for searches to implement the failing case, thereby avoiding the need for a success/failure flag or double-checking the iteration counter against the limit. But perhaps else isn't the best name for it (that's what Python uses) since that name would easily confuse its interaction with the if clause whose purpose is merely to conditionally skip nested loops. last might be better because it's only executed after the last iteration, upon failure of the while or until conditional or running out of items in the in or over or unpack or unpackover lists. I'd prefer to avoid finally because, in the [try] command, the finally clause is executed no matter what, even in the face of break or analogous.

Here's an example that prints "match" if two lists have at least one corresponding element that's numerically equal:

loop for &x in $list1 and &y in $list2 do {
    if {$x == $y} {
        puts match
        break
    }
} last {
    puts mismatch
}

Not recommended, but this code could instead be written:

loop for &x in $list1 and &y in $list2 if {$x == $y} do {
    puts match
    break
} last {
    puts mismatch
}

If I were to change last back to else, you might expect it to print "mismatch" repeatedly rather than only at the end of the loop. That's why I want to avoid the term else.

I could further reduce the opportunity for confusion by forbidding if immediately before do so that it is only legal for conditionally skipping nested loops. Likewise I could forbid it as the first clause. Thus it would only be valid when sandwiched between loops. For example, the above table would no longer show "Conditional iteration" except on the "Conditional combinatorial" line.

On second thought, in the case of [collect], if can indeed be worthwhile after the last loop, before the optional do. If the if fails, no result entry is generated, which is useful for filtering. For example, this would return a list of all even numbers returned by the [numbers] command:

collect x for &x in [numbers] if {!(x & 1)}

Does last make any sense for [collect]? I don't think so. It's not going to contribute to the result. But perhaps allow it anyway in case someone needs to put in side effects that only happen when [break] isn't used. I doubt anyone ever will want this.

Example

AMG: Here's Tcl code that sets all real number elements in a table, which is a list of lists of lists of cell values, to have a certain [format]-style precision.

for {set i 0} {$i < [llength $table]} {incr i} {
    for {set j 0} {$j < [llength [lindex $table $i]]} {incr j} {
        for {set k 0} {$k < [llength [lindex $table $i $j]]} {incr k} {
            if {[string is double -strict [lindex $table $i $j $k]]} {
                lset table $i $j $k [format $precision\
                        [lindex $table $i $j $k]]
            }
        }
    }
}

Here's the equivalent Brush code:

loop for &section over &table for &row over &section^ for &cell over &row^ {
    if {[string is real $cell@]} {
        set $cell [string format $precision $cell@]
    }
}

Notes:

  • Use of over avoids all use of [lindex], [lset], [llength], [incr], and loop counter variables.
    • This is my major motivation.
  • Only one invocation of [loop] is needed since it permits nesting by repeated for arguments.
    • This cuts down on indentation.
    • Nested invocation is permitted if style or other considerations demand it, but it's not required.
  • &section^ and &row^ use late-bound reference composition.
    • Saying $section and $row would fail because these variables don't have values before [loop] executes.
    • Saying &section and &row would fail because the iteration is over the referenced lists, not the references themselves.
    • &section^ composes a reference to whatever $section refers to at the moment that &section^ is dereferenced. Likewise &row^.
  • $cell doesn't contain the cell value but rather a reference to the element.
    • $cell@ dereferences to obtain the value.
    • The first argument to [set] is $cell not &cell because [set] is changing the referenced cell value, not changing the reference itself.
  • [string is real] instead of [string is double -strict].
    • I think "real" is a better term than "double" for checking for real numbers.
    • I want -strict to be the default mode of operation because empty string is not a number even though appending characters could make it into one.
    • Heck, [string is double 1.2e] returns 0 even though appending a digit would make it valid, so non-strict doesn't even satisfy the use case of partial input validation.
    • Get rid of it.
  • [string format] instead of [format].
    • Might not do this, but I think that would be the correct ensemble for [format].
    • We have [binary format] and [binary scan], why not [string format] and [string scan]?

Here is a more literal translation that retains the C-style loop counting:

loop init {set &i 0} while {i < [list length $table]} step {incr &i} {
    loop init {set &j 0} while {j < [list length $table{i}]} step {incr &j} {
        loop init {set &k 0} while {k < [list length $table{i j}]} step {incr &k} {
            if {[string is real $table{i j k}]} {
                set &table{i j k} [string format $precision $table{i j k}]
            }
        }
    }
}

This version, while verbose and not preferred, demonstrates the fact that inside expressions and list indexes, many variable substitutions don't require dollar signs. Context makes it clear that a variable's value is being taken. This version also shows off that [lindex] and [lset] are not needed for reading and writing list elements.

Alternative to unpack

AMG: Instead of the unpack and unpackover keywords, I wish to expand the syntax of the variable reference list. Let the variable reference argument (the one preceding in or over) be a single variable reference, or a list, like before. But change what each element of the list can be. Let them either be a reference (as in the existing proposal), or be any of the other forms allowed by the [=] command. (Except pass, or whatever I end up calling it.) This will allow for unpacking more complex structures.

First, compare the existing examples:

With unpack Without unpack
loop for (&var1 &var2) unpack $list do {code} loop for ((&var1 &var2)) in $list do {code}
loop for (&var1 &var2) unpackover &listvar do {code}loop for ((&var1 &var2)) over &listvar do {code}

Now consider new possibilities:

= (&w &h) 2 4
= &rects{end+1} ((20 30) red -outline orange)
= &rects{end+1} ((25 12) green -stipple gray12)
loop for (((&x &y) &color (* &options))) in $rects {
    .canvas create rectangle $(x-w/2) $(y-h/2) $(x+w/2) $(y+h/2)\
            -fill $color {*}$options
}

Here's how it would be done in Brush without this change:

# ...
loop for &elem in $rects {
    = ((&x &y) &color (* &options)) $elem
    # ...
}

Literal Tcl 8.5+ transcription:

lassign {2 4} w h
lappend rects {{20 30} red -outline orange}
lappend rects {{25 12} green -stipple gray12}
foreach rect $rects {
    lassign [lindex $rect 0] x y
    set options [lassign [lrange $rect 1 end] color]
    .canvas create rectangle [expr {$x-$w/2}] [expr {$y-$h/2}] [expr {$x+$w/2}] [expr {$y+$h/2}]\
            -fill $color {*}$options
}

This feature gets even fancier in combination with the other assignment modes I'm adding to the [=] command. Basically, the argument following for will accept the same syntax as the first argument to [=], so complex data structure unpacking operations become possible far beyond what the now-rejected unpack keyword would have accomplished.

Single-command nested loops

AMG: There's a complication when nesting loops within a single command, e.g. collect {(a, b)} for &a in (1 2) for &b in (3 4). This case is fine, but what if the inner loop operates on the variable set by the outer loop? It can't be done directly because the substitution is performed before the command is executed, not upon each iteration of the outer loop.

One solution is to use references so that the dereferencing is deferred:

% collect {(a, $b@)} for &a in ((1 2) (3 4)) for &b over &a
{{1 2} 1} {{1 2} 2} {{3 4} 3} {{3 4} 4}

Some cases can be written in terms of the expanded assignment capabilities:

% collect {$b@} for &a in ((1 2) (3 4)) for &b over &a
1 2 3 4
% collect b for ((* &b)) in ((1 2) (3 4))
1 2 3 4

Of course, traditional nesting is still available. The syntax is very clumsy for [collect], though. [lmap] does it better.

% collect {[collect {(a, b)} for &b in $a]} for &a in ((1 2) (3 4))
{{{1 2} 1} {{1 2} 2}} {{{3 4} 3} {{3 4} 4}}

Compare with Tcl:

% lmap a {{1 2} {3 4}} {lmap b $a {list $a $b}}
{{{1 2} 1} {{1 2} 2}} {{{3 4} 3} {{3 4} 4}}

For [loop] there's no issue:

% loop for &a in ((1 2) (3 4)) {loop for &b in $a {= &result{end+1} ($a $b)}}
% : $result
{{1 2} 1} {{1 2} 2} {{3 4} 3} {{3 4} 4}

Though do be aware these examples give subtly different results depending on whether the output of each inner loop is collected into a single list element or if it's all flattened into a single list.

Numeric ranges

AMG: The major use case driving C-style loops is iterating over numeric ranges. I think this should be made more straightforward. Previously I was thinking about a range command that can be used to generate the list being iterated over, but making the list would be a waste of time and memory. It makes more sense to incorporate the iteration parameters directly into the loop/collect command.

% collect x for &x from 0 to 10
0 1 2 3 4 5 6 7 8 9 10
% collect x for &x from 0 to 10 step 2
0 2 4 6 8 10
% collect x for &x from 0 to 9 step 2
0 2 4 6 8
% collect x for &x from 0 until 10
0 1 2 3 4 5 6 7 8 9
% collect x for &x from 10 to 0
% collect x for &x from 10 to 0 step -1
10 9 8 7 6 5 4 3 2 1 0
% collect x for &x from 10 until 0 step -1
10 9 8 7 6 5 4 3 2 1
% collect x y for &x from 0 to 2 for &y from 6 to 8
0 6 0 7 0 8 1 6 1 7 1 8 2 6 2 7 2 8
% collect x y for &x from 0 to 2 and &y from 6 to 8
0 6 1 7 2 8

Of course I intend the loop forms to work too, but collect makes for easier examples.

Real numbers present a challenge. Stepping by a number whose denominator is not a power of two likely won't land exactly on the final to value. However, this is far from a new problem.

As discussed above, it's tricky to nest loops in a single command if the inner iteration parameters depend on the outer iteration variables. Late-bound references won't help because the arguments to from, to, until, and step are expected to be numbers. The fix is to make them not be numbers but rather expressions which are evaluated as late as possible. This change makes all the above examples continue to work, plus makes this work:

% collect x y for &x from 0 to 2 for &y from 0 to x
0 0 1 0 1 1 2 0 2 1 2 2

But on the downside, this becomes dangerous due to double substitution:

% collect x for &x from $start to $end

The solution is to ensure all substitution is in the hands of the expression engine. Don't let the interpreter do any substitution of its own. Either protect $ with quotes (as with Tcl [expr]) or avoid it entirely (using Brush expression syntax's implied variable substitution):

% collect x for &x from {$start} to {$end}
% collect x for &x from start to end

I wonder if the to/until and step expressions should be evaluated in each iteration of the loop (like they would in the C-like loops) or if they should be evaluated once at the beginning of each loop, at the same time as the from expression.

Concerns about references

AMG: After thinking about it, these two concerns are not new problems. They exist in Tcl already but in different forms. They're fundamental, and we're already happily ignoring them or successfully working around them.

Security

AMG: References (as defined by Brush), just like C pointers, break the security of the language, allowing unrestricted access to any variable from anywhere else in that same interpreter. You can't obtain such a raw reference by writing &123, where 123 is the numeric ID of the variable, since that would be interpreted as legally making a reference to a (presumably new) variable that's actually named 123. But you can use quoting: \&123. The resultant string won't initially be seen as a reference, but it can easily shimmer to reference: [set \&123 value].

Numeric IDs could be randomized to make this kind of thing far harder to pull off in a useful sort of way, but that's expensive and not guaranteed to work. This would be like having C map variables far apart from each other and in random order, different every time a program is run, with no predictable layouts like the stack. C would have to keep array and struct elements adjacent, similar to how Brush would (of course) keep list and dict elements together in a single variable.

While initially designing Brush, I had the thought that confining access to be within the same interpreter was enough security. After all, Tcl is actually the same way: at every point in the program, you have unrestricted read/write access (modulo traces) to all variables in your local stack frame, in all your callers' frames, and all global and namespace variables within the interpreter. Just use [uplevel] or [global] or [variable] or :: notation.

So I guess this isn't a new problem after all, and I shouldn't feel too badly that my alternative for [uplevel]/[global]/[variable]/:: has one of the same problems they do. I should probably point this out in the paper.

But let's say I wanted to fix this anyway. I could do so by forbidding shimmering to reference. This is an apparent violation of EIAS, but it is theoretically possible. Having this restriction would greatly simplify reference counting of variables, so there is some temptation to do this.

Fragility

AMG: Even though Brush references have string representations in EIAS fashion, those string representations don't have meaning outside the interpreter in which they were created. They can't be written to disk then loaded later to pick up where the interpreter left off. Tcl's analog of Brush references (names of variables) has the same limitation (only make sense inside the interpreter) but to a lesser extent (that interpreter does have the opportunity to resume execution using data it read from disk).

Not sure how big an issue this is, or if it's an issue at all. I mean, Brush doesn't stop you from putting variable names, as opposed to references, in data structures. However, those names won't have the effect of keeping the variables from being destroyed by the garbage collector, but that is true of Tcl as well. And it's not worth complaining about an overactive gc when we're already talking about terminating and restarting the program, thereby invoking the second biggest garbage collector of them all (process termination... the first biggest being a system reboot).

Compare with C. Brush references are analogous to C pointers. C pointers are never suitable for data interchange formats. The same goes for Brush references. Take the case of shared memory segments. In C, it's preferable to put array indices in shared memory rather than pointers. This is like putting variable names rather than references into Brush data structures which are shared with the outside world.

So I guess it's not that big a deal, or at least it's not a new problem.

Late-bound indexes

Round one

AMG: Page 27 appears to have an error.

set (&x &i) ((a b c) 1)
set &j &i       # &124
set &r1 &x{&i^} # &123(&124@)
set &r2 &x{$j^} # &123(&124@)
: ($r1@ $r2@)   # b b
set &i 2
: ($r1@ $r2@)   # c c

The string representations show dict indexes (parentheses), but list indexes (braces) were used to construct.

Also, more explanation is needed here that in the string representation of a reference, list and dict indexes can contain references involving @ and other indexing, and they will automatically be dereferenced.

Is that really the notation I want to use? Yes, because all references start with &, and the references in the example are to two variables, not just one. No, because it's not merely a reference but also an instruction to perform substitution. How confusing! But since this is the generated string representation rather than what the programmer types, I can bend rules if I must. However I would prefer as close a parallel between the string representation and the source script as possible.

So look at set &r1 &x{&i^}. The programmer types this directly, yet it uses a reference as an index. Guess that establishes precedent, but I really ought to have explained it. The idea was to make it look at the value of i at time of dereference, not immediately, without having to go through the trouble of putting the reference into another variable, like in the set &r2 &x{$j^} line.

I'm tempted to write $&i^ instead, but this has the problem of not making it clear which indexing and dereferencing operators are bound to the construction of the inline reference &i and which are bound to the substitution $&i. Does it matter? I don't believe it does, since either way every layer of indexing that gets built up when constructing the reference has to be traversed again when dereferencing it. Rewriting the example:

set (&x &i) ((a b c) 1)
set &j &i        # &124
set &r1 &x{$&i^} # &123{$&124@}
set &r2 &x{$j^}  # &123{$&124@}
: ($r1@ $r2@)    # b b
set &i 2
: ($r1@ $r2@)    # c c

This looks less like a typo, since it's clear references aren't being used directly as list indexes. That's bad enough because references aren't legal integers and therefore shouldn't be legal indexes. But it's worse when using dict indexes because references are valid strings, therefore valid dict indexes, though I guess backslashes would help.

Now we have a new $&name notation. What does it mean? It's the same as &name but, when used in a reference index and in combination with ^ (both of which should be mandatory), is clearly taking a value, not just making a reference.

It's a sort of shorthand. This:

set &r1 &x{$&i^} # &123{$&124@}

is functionally equivalent to this:

set &ir &i       # &124
set &r1 &x{$ir^} # &123{$&124@}

without the need for an intermediate variable. Or write:

set &r1 &x{$[: &i]^}

which I guess is the most straightforward way to demonstrate that it is shorthand. The question I brought up earlier can now be posed as: of the indexing between the variable name and the first ^, whether it matters which goes immediately before the ] and which goes immediately after. Still, I don't think it does.

Here's a crazy example designed to tease out whether or not there's a difference:

set &x ((a b c) (d e f))
set &y ((A B C) (D E F))
set &iv (1 0)
set &jv (2 1 0)
set &xy (lower &x upper &y)
set &i 1
set &j 2
set &k lower
set &r &xy($&k^){$&iv{$&i^}^ $&jv{$&j^}^}
: $r@     # a
set (&i &k) (0 upper)
: $r@     # D

Are all three of the following lines equivalent?

set &r &xy($&k^){$&iv{$&i^}^ $&jv{$&j^}^}
set &r &xy($[: &k]^){$[: &iv{$[: &i]^}]^ $[: &jv{$[: $j]^}]^}
set &r &xy($[: &k]^){$[: &iv]{$[: &i]^}^ $[: &jv]{$[: $j]^}^}

Yeah, I'd say so. So long as sequencing is maintained, it shouldn't matter whether a particular index is baked into a reference or applied upon dereference.

Something else came to mind. What does this mean?

set &r &xy($&k^){$iv{$&i^} $jv{$&j^}}

In the construction of the reference, it's trying to index into the current value of iv and jv, but the index is late-bound. In this case I think "late" means "now" since late binding means it's deferred until the index is actually needed, which is right away. In a compiled language, this would be a good thing to warn about. But Tcl and Brush have no facility for warnings. I don't see how it could be an error, so save it for Nagelfar.

This line of thought needs more time to simmer.

What's the good of late-bound indexes, anyway? Well, even though the Brush paper says otherwise, late binding is needed elsewhere in the language, namely for reference composition, so we have it. But I originally specified it only for indexes in reference so that you could make a reference which automatically looks at an iterator to find the value being considered for the current iteration of the loop.

Round two

AMG: I'm still on the fence about whether or not I even want this feature. The motivating use case was being able to modify a list while iterating over it without having to reference the list by name all the time:

set &alphabet [list split abcdefghijklmnopqrstuvwxyz {}]
set &ref &alphabet{$&i^}
for {set &i 0} {&i < [list size $alphabet]} {incr &i} {
    set $ref [string toupper $ref@]
}

But maybe there's a more direct way: a [foreach] whose iterator variables reference, instead of copy, the list elements. That would do the job. Dereference the iterators to get their values, or pass them directly to [set] (or [unset]) to modify the list.

There's a design problem! [foreach] takes one or more list values as arguments, so there's no requirement for the lists to be in variables. It's impossible to make references to a value, only to a variable. Though consider: what is the sense of modifying a value that's not in a variable? In such a case, this [foreach] variant would not be used, and the values would be placed directly in the iterator variables, no references involved.

It does, however, make sense to mix and match. You might want to iterate by reference over one list variable in parallel with iterating by value over another list value, maybe for the purpose of incrementing the elements of the first by the elements of the second.

How would this be specified? I wish I could just say that passing a reference to [foreach] where a list value is expected causes it to iterate by reference, or just pass a list value directly to iterate by value. However, this doesn't work because every reference can be interpreted as a single-element list.

The Brush [set] command relies on the fact that lists with non-unit length are distinct from all possible references. This trick doesn't help here though since it's certainly valid to pass [foreach] a list with length one. How unfortunate. Consequently, an option is needed to explicitly say that a particular list is being iterated by reference. I don't want to complicate the common, iterate-by-value case though.

How about iterating over a subset of a list? I think reference composition can be used for this purpose. Rework the above example to only capitalize the first five letters:

set &alphabet [list split abcdefghijklmnopqrstuvwxyz {}]
foreach -reference &i &alphabet@{0:4} {
    set $i [string toupper $i@]
}

This actually looks a lot clearer than the late-bound index method. So maybe I keep this and drop the other, thereby eliminating the need for $&...^ notation as well as reference nesting inside of the string representations. That simplifies everything quite a lot.

But I still have late-bound dereferencing. Funny though, it's now used for a completely different purpose than was envisioned in the Brush paper.

AMG, later: These dilemmas are resolved by the introduction of the [loop] and [collect] commands.

Round three

AMG: I keep going back and forth on this. Forbidding late-bound dereferencing in indexes isn't as simple as disallowing this $&...^ syntax I mentioned, since they can certainly be had by other means. Like I said, it's just shorthand for $[: ...]^ which is legal. I would have to check for and restrict any use of late-bound dereferencing other than reference composition. But that seems artificial to me, and I am shooting for generality. Maybe keep it after all, even though it's complicated and rarely needed.

I also do have the option of changing the string representation further. Don't say $&...@ but rather simply $... (leave out the & and the @), where ... is the numeric ID corresponding to the variable being referenced. It's fine to make this change because it'll still be recognized as a reference on account of living inside a reference value (a value starting with &... and otherwise conforming to this specification). That sounds good to me.

set &r1 &x{$&i^} # &123{$124}

In addition, I have half a mind to not support writing $&i^ to get this behavior but rather forcing the programmer to write it out long: $[: &i]^.

Automatic closures

AMG: Brush, as specified, implements closures by manually bringing them into scope as references.

This is impossible, but I'll describe it anyway.

proc &accum_gen ((val? 0)) {
    : (lambda ((valref= &val) (inc? 0)) {
        set $valref $($valref@ + inc)
    })
}

Here, the returned lambda captures the caller's &val as a hidden argument. That's all fine and dandy, but what about:

proc &accum_gen_automatic ((val? 0)) {
    : (lambda ((inc? 0)) {
        incr &val $inc
    })
}

This maps very closely onto Paul Graham's Common Lisp version:

(defun foo (n) (lambda (i) (incf n i)))

Using his argument names and ditching the defaulted values:

proc &foo (n) {: (lambda (i) {incr &n $i})}

Very cool! But why is it impossible? It's because by the time Brush discovers that the return value of [foo] is in fact a lambda, the enclosing scope is long gone. All it'll do is return $i with no memory that there once was a variable called n.

How to fix? Eagerly compile the lambda, at least sometime before [foo] returns. But this is contrary to the design of Tcl therefore Brush, since everything is a string and the types of values are manifest in their use, not their declaration. And use happens too late.

How to fix anyway? The reference to n absolutely must be created while [foo] is still running, which is why [accum_gen] is written the way it is. And there you have it.

Damn it, you say, let's have automatic closure anyway! There could be a [lambda] command that eagerly compiles its argument as a script body, capturing its calling environment immediately, then returning the equivalent lambda value that contains references to variables in the enclosing scope along with some magic to bind them to local names. And should [proc] also be given this property? It would virtually eliminate the need for [variable] and [global]. I'm not sure. I actually kind of like how Tcl forces you to explicitly bring in globals.

How would that look? Let's put it side-by-side with Paul Graham's version:

(defun foo (n) (lambda (i) (incf n i)))
proc &foo (n) {lambda (i) {incr &n $i}}

Holy cow, that's a close resemblance! Plus, the parentheses around n and i can be omitted because their listhood is dictated by context.

More thought needed. I must admit that I am sorely tempted.

One thing I can do is add a & notation to the parameter list specification. The currently allowed forms are:

Notation Intepretation
var Required argument
var? Optional argument
(var? val)Defaulted argument
var* Catchall argument
(var= val)Bound argument

But I can add & to required, defaulted, and bound arguments to create reference arguments. (It doesn't make sense for optional and catchall arguments.) For reference arguments, the local variable will be linked to the variable referenced by the value, with no need to explicitly dereference using @.

proc &foo (n) {: (lambda ((n&= &n) i) {incr &n $i})}

That's a step in the right direction. All that's needed is a [lambda] command that compiles the script argument, identifies the non-local variable names present in both the script and [lambda]'s calling environment, and augments the formal parameter list with "(n&= &n)"-like constructs for each.

proc &foo (n) {lambda (i) {incr &n $i}}

[lambda] returns this three-element list: lambda {{n&= &123} i} {incr &n $i}

where &123 is the reference that would be obtained by typing &n.

Restrictions

AMG: However, this magical [lambda] command won't know to bind variables whose names are computed, either by $"..."/&"..." notation or by generated code. Also this means extension commands taking script arguments or otherwise looking up variables will need to advertise to the bytecode compilation engine what variables they reasonably expect to use when executed. I hope this limitation isn't too onerous.

Namespaces

AMG: It's been tacitly understood (by me, anyway) that [lambda] and the string/list representation thereof would have an optional third argument containing the namespace name. I should make this explicit. The reason I've avoided this so far is so I could avoid thinking about whether I really want to keep Tcl namespace syntax which some have complained about.

Reference composition

AMG: In the paper I call it additive references, but I don't think that's as good a name as compositional. Also, the notation shouldn't be "&$name@" because the $ dollar sign isn't helping. Get rid of that. A more general explanation for the notation exists.

Also, allow late binding (^) not only inside index computations but also anywhere following the reference variable name, with the restriction that early binding (@) can't happen after (to the right of) late binding.

Here's a convoluted example.

set &var  (a (0 1 2) b (3 4 5))
set &ref  (&var(a) &var(b))
set &ref2 (&ref &var)
set &ref3 (x &ref2{0} y &ref2{1})

Using the above code as given, here are a whole bunch of possible combinations of early and late binding. For each I show the notation entered by the programmer, the equivalent string representation (assuming &100 is &var, etc.), and the value obtained by dereferencing the reference.

notation string rep referent value
&var &100 a (0 1 2) b (3 4 5)
&ref &101 &100(a) &100(b)
&ref2 &102 &101 &100
&ref3 &103 x &102{0} y &102{1}
&ref3 &103 x &102{0} y &102{1}
&ref3(x) &103(x) &102{0}
&ref3(x)@ &102{0} &101
&ref3(x)@@ &101 &100(a) &100(b)
&ref3(x)@@{1} &101{1} &100(b)
&ref3(x)@@{1}@ &100(b) 3 4 5
&ref3(x)@@{1}@{1}&100(b){1} 4
&ref3 &103 x &102{0} y &102{1}
&ref3(x) &103(x) &102{0}
&ref3(x)@ &102{0} &101
&ref3(x)@@ &101 &100(a) &100(b)
&ref3(x)@@{1} &101{1} &100(b)
&ref3(x)@@{1}^ &101{1}@ 3 4 5
&ref3(x)@@{1}^{1}&101{1}@{1} 4
&ref3 &103 x &102{0} y &102{1}
&ref3(x) &103(x) &102{0}
&ref3(x)@ &102{0} &101
&ref3(x)@^ &102{0}@ &100(a) &100(b)
&ref3(x)@^{1} &102{0}@{1} &100(b)
&ref3(x)@^{1}^ &102{0}@{1}@ 3 4 5
&ref3(x)@^{1}^{1}&102{0}@{1}@{1} 4
&ref3 &103 x &102{0} y &102{1}
&ref3(x) &103(x) &102{0}
&ref3(x)^ &103(x)@ &101
&ref3(x)^^ &103(x)@@ &100(a) &100(b)
&ref3(x)^^{1} &103(x)@@{1} &100(b)
&ref3(x)^^{1}^ &103(x)@@{1}@ 3 4 5
&ref3(x)^^{1}^{1}&103(x)@@{1}@{1}4

Also, the mnemonic is @ is an archery target that's been struck whereas ^ is an arrow still in motion.

AMG: It's quite likely most, if not all, of the practical use cases supporting late binding are obsoleted by the improved looping capabilities, particularly the ability to loop "over" a variable's value, even if the variable's value is a complex data structure. So there's no need to run a loop counter and access a funky reference variable whose value changes according to the value of said loop counter, just so the funky reference variable can be used to modify the structure that's effectively being iterated over.

Compare:

= &data (a b c d e f g)
= &ref &data{&i^}
loop for &i from 0 until [list length $data] {
    = $ref [string toupper $ref@]
}

With:

= &data (a b c d e f g)
loop for &i over &data {
    = $i [string toupper $i@]
}

It's no contest. The latter is vastly superior. Even though I designed it, I can't even be sure I got the syntax right on the former.

Unless I can find a compelling reason for late-bound references, I'm going to drop them entirely.

By the way, the following would also work for this example:

= &data [collect {[string toupper $x]} for x in $data]

Or exploit EIAS and the fact that [string toupper] doesn't modify the list grouping and delimiting characters:

= &data [string toupper $data]

A more complex example:

= &data ((Douglas Adams) (Terry Pratchett))
loop for (((&first &last))) over &data {
    = ($first $last) ([string toupper $last@] $first@)
}

With late-bound references:

= &data ((Douglas Adams) (Terry Pratchett))
= &first &data{&i^ 0}
= &last &data{&i^ 1}
loop for &i from 0 until [list length $data] {
    = ($first $last) ([string toupper $last@] $first@)
}

And in current Tcl:

set data {{Douglas Adams} {Terry Pratchett}}
for {set i 0} {$i < [llength $data]} {incr i} {
    lset data $i [list [string toupper [lindex $data $i 1]]\
                       [lindex $data $i 0]]
}

Traces

AMG: Do I want Brush to have variable traces? Traces can be supremely powerful and helpful, but they also complicate variable access and interfere with optimizations, for instance common subexpression elimination in math expressions. Tcl traces on arrays and elements are quite nice and are the reason to use arrays instead of dicts, but Brush doesn't have arrays, only dicts. How would I get that functionality back?

Similarly, what would it mean for a trace to be applied to an indexed reference? That seems way too troublesome for me. Perhaps only support traces on the entire variable, but give the trace callback the reference path that was used. Let it decide what special handling, if any, the referenced component needs.

Tcl_LinkVar() has been very, very useful to me at work, and it would be a shame not to have something like it in Brush.

Other kinds of traces might not be as big a deal, but I wouldn't know for sure. I haven't used them.

AMG: One way to go about indexed traces is to maintain a list of indexes for which a trace has been registered, then every time the value of the variable is changed, check if it changed in a way that affects any of the traced indexes. "Changing" a value such that its new value equals its old still counts.

This gets very tricky when a variable has both {vectored} and (keyed) index traces. Also efficiency may be an issue. Probably should also forbid late-bound ^dereferencing, or else things will be really unmanageable.

It's not advisable to put trace flags directly in the value structure. Values are shared, and traces are bound to variables, not values.

AMG, years later: I think it's probably best to not support variable traces. The theoretical existence of traces severely limits the possibilities for optimization, and incrementing an epoch counter for stuff that can be done to local variables seems like it would defeat the purpose of optimization. As for the practical usability of traces, it obscures the actual program flow, certainly not a good thing. What of GUIs and linked variables? GUIs can work like they do in compiled languages: provide accessor methods. Linked variables have been very useful to me, but it's quite likely they're almost never used in the wild since they had a bug which disallowed zero or negative float values, and no one noticed until I came along.

But what of Tequila? That seems like a great use for traces. I guess the GUI solution is still needed: manually run a command to sync a variable or collection of variables.

Complete rewrite of garbage collection

AMG: The garbage collection scheme described in the Brush paper is very expensive. I tried my best to make sure it is only needed very rarely, then gave advice for how the programmer can further reduce its need. Whenever variables have nonzero refcount after removing their stack frame, the interpreter needs to confirm whether or not they're surviving because they're externally reachable (e.g. a reference was returned or placed in a global variable) or because they're merely self-referential. To tell the difference, it searches everywhere it can think of for external references.

There's a back door approach that's much faster. Instead of searching the whole world for external references, look inside the variables' values themselves to see if they account for the entirety of the remaining refcount.

I'll illustrate by example. Take this code:

proc &tricky () {
    set &a &b
    set &b (&a &a)
    set &c &d
    set &e &f
    set &f &c
    : &c
}

This returns a reference to a variable whose value is a reference to an unset variable. Most of the code just burns CPU. Silly, I know, but I'm trying to hammer the gc. Could have just said: proc &trickyfast () {set &c &d; : &c}.

At the end of [tricky], its local variable table will have entries for six variables: a through f. Let's say their reference values come out to be &100 through &105, correspondingly. Now, once the local variable table is blown away, the reference counts for all six variables will be decremented, resulting in the total refcounts in the following table:

namereferencevalue total refcountbackrefs possible saviorslocal refcountresult
a &100 &101 2 $b{0} $b{1} b 2 deleted
b &101 &100 &100 1 $a a 1 deleted
c &102 &103 2 $f (result)f 1 saved because external
d &103 (unset) 1 $c c 1 saved by c
e &104 (ignore)0 0 deleted
f &105 &102 1 0 deleted

Note that the refcount for c is two because it's referenced not only by the value of f but also by the value in the interpreter's result field.

Since e now has zero refcount, its value is ignored in all further processing.

Next, analyze the values of all remaining variables to find references to other local variables. If a variable x's value references another variable y, x is a possible savior for y. If the number of references to a variable x found amongst the values of all remaining local variables is less than the total refcount of x, then x is external. If a variable is external or otherwise saved, it is saved along with all variables it is capable of saving. And if a variable is neither external nor saved by an external variable, it is deleted.

In this case, c and d are saved. Of course, they are no longer called that because their names existed only in the local variable table that was just deleted. They are now only accessible through the reference that was returned by [tricky]. Though some code somewhere might decide to bind the reference to a local variable name again.

What if [tricky]'s caller ignores its return value? The very next command will overwrite the interpreter result. When that result goes away, the refcount of the variable formerly known as c drops to zero, and it is deleted. Its value is likewise deleted because it's not referenced by its variable. That in turn decrements to zero the refcount of the variable once called d, and it is also deleted. It is unset and therefore has no value, so processing stops.

Name of create subcommand

AMG: Do I really want my list/dict/lot creation subcommands to be called create? For such a common operation, a shorter name would do, e.g. cons or make or new. But then again, Brush already has a far shorter name: just enclose the list/dict/lot in parentheses.

I'd say it'd still be useful to have [list new] and [dict new] and [lot new] for functional contexts, but shorthand exists for that too. All three are essentially [: ({*}$args)]. The only functional difference is in the non-canonical and error cases. In the non-canonical case for dict and lot, keys are duplicated, and the true [dict] and [lot] commands would omit all but the final duplicate of each key. The only error case is giving [dict new] an odd number of arguments, but [: ({*}$args)] would accept this, though eventually the returned value will be used as a dict, at which time the error would happen anyway. I'm perfectly okay with this. Also I like "new" better than "create".

Reclaiming [set]?

AMG: Probably not going to go with this, but another option for naming [lot] is to in fact call it [set], changing the traditional [set] command to [let] or [var] or [get]/[put] or [=]. I bet everyone would hate this, me included. But it's a thought, so I'm recording it here.

AMG: The Tcl "changes" file says the [set] command used to be named [var] [L4 ], so there is precedent for using the name "var" to get/set variable values. (To be more accurate, the precedent is exactly against this, but let's not split hairs haha.)

The quote is: 4. "Var" command has been changed to "set".

AMG: If Brush is going to have an [:] command, i.e. a command whose name is a symbol rather than a word, then that opens the door for it to have an [=] command to use for variable lookup and assignment. It's a possibility. Symbols are cool because their meanings are more open to interpretation than words. The pronunciation would be "equal" meaning "let the referenced variable be equal to a value, or query what value the referenced variable is equal to".

= &a 5
= &b 6
: $(a + b)

versus

set a 5
set b 6
expr {$a + $b}

AMG: This creates an interesting symmetry between single-argument [:] and single-argument [=]. Single-argument [:] returns its first argument directly, and single-argument [=] returns the first argument's referent. In general, : $var has the same result as = &var, and : $ref@ has the same result as = $ref.

AMG: Instead of calling it the "equal" command implying comparison, pronounce it "assign" or "let".

Cache the hash

AMG: The string representation's hash value can be cached in the Tcl_Obj (or whatever name) structure to avoid repeated computation and to speed up string equality tests. Reserve hash value 0 to mean the hash hasn't been computed since the last time the string representation changed. If the hash actually evaluates to 0, replace it with 1 or ~0 or such.

The actual hash algorithm need not change, and it has been shown to work enviably well for the kind of load Tcl experiences [L5 ].

JO's "Ousterhash" H(S(1..n),i=n) is simply defined:

  • H(S,i>0) = (H(S,i-1)*9 + S(i)) & ~0U
  • H(S,i=0) = 0

The minimum size for a hash table is four buckets. When the number of entries in a hash table exceeds three times the number of buckets, the hash table is rebuilt to have four times as many buckets as before. The hash function's least significant bits determine which bucket holds each entry.

The interface to the hash functions can stand to be improved. Tcl's internal hash structures have leaked into tcl.h and therefore are highly resistant to change due to the need to preserve the ABI [L6 ]. In particular, I miss the ability to pass opaque context pointers to custom hash implementation functions. The only safe way to do this is with global (but thread-local) variables.

Actually, Tcl already caches the hash for each entry in its hash table implementation. So it appears all I'm suggesting is to move this hash value into the object to speed up comparisons. I doubt this is worth the cost of increasing the size of Tcl_Obj.

Amortized constant-time dict element removal

AMG: Linear-time dict element removal sucks, and the only reason Brush has it is to maintain the list representation. Instead remove the elements from the hash table and mark the list as stale. An old email of mine to tcl-core describes some machinery for managing the relationship between list and dict, and it already has the ability to track which list elements aren't present in the hash table because they correspond to duplicate keys. This facility can probably be reused to quickly find elements that have been deleted while lazily updating the list representation. Also maybe periodically update the list when it gets too far out of sync with the dict so as to avoid keeping more than a small amount of garbage in memory.

The sticky wicket is managing object lifetimes. If a dict contains the last reference to something that needs to be cleaned up, that cleanup should happen as soon as the reference is removed from the dict. It would not do for the reference to be held in a tombstone garbage slot in the list, only to be cleaned up in response to a list operation being performed on the dict. Brush's internal data model and optimizations should have no effect on the order of operations visible at the script level. I believe the trick is to check if the removed keys or values contain references, then update the list eagerly if so.

Actually, a compromise presents itself. When removing references from a dict, don't collapse the list, thereby incurring linear-time performance, but instead set the Tcl_Obj pointers in the list to NULL. Actually, might as well do that all the time, not just when removing keys and values containing references! Really, all that's being done is avoiding the slow memmove() needed to eagerly keep the list compacted.

So now I describe a hybrid eager/lazy method. Removed dict elements are eagerly finalized, whatever that entails, but the slots they occupy in the backing list are merely set to NULL. Compaction is deferred until the list needs to be treated as a list, or the index of a key is requested, e.g. using [lot search]. Maybe some heuristic will force compaction if the list is more than 75% NULL or something like that.

Using parentheses to make dicts

AMG: Here's a positive consequence of the dict/list merge in Brush.

First I'll explain using Tcl command names. In Brush, there's no performance advantage to creating a dict with [dict create] versus [list]. In Tcl, [dict create] is faster because it avoids creating a temporary list representation which will shimmer to dict the first time it is used as such. But in Brush, the two types are unified, and the only performance difference between the two approaches is whether the dict hash table is created eagerly or lazily. Either way, it's created (at most) once. In fact, the [list] approach means it might never have to be created, if the value is never used as a dict. (There's a non-performance difference: with [list], error detection [i.e. missing value to go with key] is also performed lazily.)

Now I'll use Brush notation. Brush's [list] command doesn't make lists; rather, it's an ensemble to collect list-related subcommands. I anticipate there being a [list create] command which works like Tcl [list]. Obviously, this is a lot to type, so Brush has a very convenient shorthand: (parentheses).

So, put these two thoughts together. If explicitly constructing as a dict versus a list offers no performance benefit, one might as well use whichever form is easier to type. And in Brush that would be list construction using (parentheses).

There's one gotcha, but Brush offers an answer to it too. That gotcha is that [dict create], etc. are commands, and many times you need a command, but (parentheses) are merely syntax for constructing a word, same as "quotes" and {braces}. So how do you use (parentheses) to make a list or dict in a context where you need a command? You could explicitly call [list create] or [dict create], but that's pretty clumsy to type. Brush's solution is the [:] command. When given a single argument, it simply returns it. That single argument could well have been constructed using (parentheses).

Let me summarize with an example. In Tcl, you'd write:

dict create $key1 $val1 $key2 $val2

But in Brush it's:

: ($key1 $val1 $key2 $val2)

The only difference in return values is the intrep and the fact that Brush got its result faster due to deferring construction of the hash table until the value is actually used as a dict. Again, the drawback is that error message generation is also deferred.

Collision between indexes-are-exprs and multiple list indexing

AMG: Multiple list indexing (as shown by one example on page 18) doesn't mix terribly well with the fact that all list (and string, really any numerical) indexes are interpreted as expressions, not just integers or end-integer. Space is the delimiter between indexes, yet whitespace is also permitted within an expression, and each index is an expression. This isn't an irreconcilable conflict, but it can get goofy and difficult to read. For example, $x{1 + 2 3} means $x{1 + 2}{3}.

In the interest of promoting readability, I want to require parentheses for expressions containing spaces when more than one index is provided.

Good$x{(1 + 2) 3} Parentheses protect spaces within expression
Good$x{1+2 3} No spaces used inside of expression
Good$x{1 + 2}{3} No more than one expression per pair of braces
Good$x{$(1 + 2) 3} All spaces contained within a nested substitution
Good$x{[expr {1 + 2}] 3}All spaces contained within a nested substitution
Bad $x{1 + 2 3} Some spaces separate expressions, others separate tokens within expressions

Alternately, don't allow arbitrary expressions as indexes.

Implicit expr Explicit expr
$x{(1 + 2) 3}$x{$(1 + 2) 3}
$x{1+2 3} $x{$(1+2) 3}
$x{1 + 2}{3} $x{$(1 + 2)}{3}

I definitely don't like this "solution" though. It forces another level of symbols which I find unwelcome, and there's no benefit. There's already sufficient context for the interpreter to know to expect a math expression, so requiring the programmer to say it in two ways is just a waste and does not contribute to readability.

Double substitution creeps back in

AMG: Again, since numerical indexes are expressions, not just constant numbers, substitutions need to be done by the expression engine and not the interpreter. Same problem as with [expr] in Tcl, but in a lot more places. My fix was to allow variables in expressions to not have dollar signs, though this doesn't work for all possible variable names. I guess in such cases, braces are required. But they would have been required anyway had [expr] been used.

Take [string index] as an example. Its first argument is the string from which to retrieve the character, and its second argument is the index of the character to retrieve. That second argument, being an index, would now be treated as some combination of expression and the magic prefix end.

In the simple, common cases of the index being a literal integer, there is no problem. These cases are the same as current [string index] usage where no substitution takes place.

Now, what about indexes that are the result of variable substitution? Well, try it and see:

set &index 4
string index foobar $index

Here, [string index] receives the expression 4 as its index. Parsing and evaluating 4 as an expression yields the number 4 which seems alright, but what if $index contained... something else?

set &index {[exit]}
string index foobar $index

This would terminate the program. Bad news. Correct usage would be...?

string index foobar {$index}    ;# mirrors current correct [expr] usage
string index foobar index       ;# alternative thanks to implied $ in expressions

This bothers me. One thing I had wanted to do with Brush is have the programmer explicitly identify variable names (& prefix) even when not taking their value ($ prefix). Yet here the name is being given as a literal string in order to avoid a security issue, and I had also wanted to make the easy thing also be correct, safe, and fast. It's perhaps not so easy to remember when it's required to leave off all prefixes in order to be safe. foobar isn't the name of a variable or any such; it's the actual string being indexed. But index is the name of the variable containing the numerical index. So more thought is required here.

Another case! Now let's have the index come from a script substitution.

proc &command () {: 4}
string index foobar [command]

(The proc name needs an & prefix because it's a variable name, and procs are simply lambdas stored in variables. () is the preferred constructor for an empty list, though "" and {} and [] would have also worked. The : 4 is basically shorthand for return 4.)

Here, [command] returns 4 which is taken as the expression giving the index. Same concerns as before. What if...

proc &command () {: {[exit]}}
string index foobar [command]

This also exits. Correct usage would be...

string index foobar {[command]}

Yup, the need to always brace your expr-essions has crept back in. Sigh...

What's the solution? Well, my solution before to the bracing problem is to provide a three-character $(...) shorthand for performing expression substitution which is significantly easier than the nine-character [expr {...}] locution we know and hate.

But here we're getting into trouble again because every command that takes an index argument now actually takes an [expr] argument and is therefore subject to the bracing requirement. I had thought I'd be mitigating that by additionally making $ variable prefixes optional inside expressions in the overwhelmingly common case that there's no possibility for ambiguity. But that only made sense when the parser could easily tell that it's in an expression context (i.e. inside $(...)). Here, that cue is lost because indexes are expressions, and any command argument could be an index.

So fix it by not allowing arbitrary expressions as indexes? This means explicit usage of $(...) in more places, but that should be okay because it's still a tremendous savings over the current Tcl situation of needing [expr {...}] in all those same places.

But there remains a performance drawback when the end prefix is considered, since that implies slow string concatenation and reparsing. I had solved this by making the index notation innately understand both expressions and end prefixes so they'd all fit inside the same intrep or bytecode with no need for shimmering to and from string. So add end prefixes to the expression syntax but only allow its use where an index is required? Maybe that's what's needed. Hmm!

Still not happy with this. Consider another place where indexes are used: variable list indexing. The interpreter already knows (or could know) that it's in expression territory because it saw the braces following a variable name ($ or & prefix) and whatever else, so there's no danger of double substitution here. But there is another danger: inconsistency. That is, unless I can talk it away. :^) Having those indexes allow expressions with no need for $(...) would conflict with indexes passed as arguments requiring $(...).

What do I mean when I say, talk it away? I could instead say (1) yea verily, end prefixes are legal in expressions whose results are used as indexes, and (2) variable list indexing takes expressions, not just simple indexes. Formulated this way, there is no inconsistency, but I've added a level to the concept hierarchy.

Aside from the dirty feeling of having had to resort to smoke and mirrors, I've also made it questionable whether the string end or any other string with an end prefix is acceptable as an index. I don't like having the same thing in two places, so for me to put end prefixes in the expression syntax (albeit only legal when the expression ends up being used as an index), makes me want to remove it from the index syntax and instead require writing end as, say, $(end).

Now for the killer. What is the string representation of $(end)? If I were to [puts] it, what would arise? It's not being used as an index, so that's illegal. In other words, EIAS is violated. Fatal, right?

Well, yes, but not to the entire concept, only rather to my misguided notion that I need to remove end prefixes from index notation if I want to add them to expression notation. The string representation is easily in my grasp: keep the end prefix! So puts $(end) would print end, and puts $(end-5*2) would print end-10.

But surely something remains illegal. That would be attempting to use a string starting with end, no matter how that string was derived, where an ordinary number is expected. It would only fly where an index or general string are expected. This is actually current Tcl behavior, so there is no change.

Need more time before I actually decide on anything... I just wanted to write down my reasoning so I could revisit later and see how dumb I was so I can avoid repeating old mistakes as I go on to make bigger and better mistakes.

AMG: Revisiting after much time has passed. Okay, so what I seem to have decided is that security concerns require that commands taking index arguments not interpret those arguments as arbitrary expressions to be evaluated. This restriction means the caller must explicitly use $(...) notation to evaluate the index expression prior to passing them as arguments. Use of $(...) avoids the security problem. To make this possible even when the end prefix is used, $(...) must be capable of internally understanding end and leaving end in the result. All that is acceptable but requires explanation.

Like Tcl expressions, Brush expressions yield string values, and those strings are grouped into type categories according to how they are used. The new type being introduced is the index, whose value can be an integer, end, end-integer, or end+integer. The end prefix comes from the original index expression notation. Just because indexes can be integers doesn't mean that indexes are always valid where integers are expected.

Let me demonstrate. Here's the setup:

set &index 4           # or maybe: set &index {[exit]}
proc &command () {: 4} # or maybe: proc &command () {: {[exit]}}

And here are the command examples. Passing a quoted expression used to be safe but is now invalid because expressions are no longer indexes. Passing the result of a substitution was valid but unsafe and is now safe because indexes are not subjected to further substitution.

Command Old design New proposal
string index foobar index valid, safe invalid
string index foobar {$index} valid, safe invalid
string index foobar {[command]} valid, safe invalid
string index foobar $index valid, unsafevalid, safe
string index foobar [command] valid, unsafevalid, safe
string index foobar $(index) valid, unsafevalid, safe
string index foobar $($index) valid, unsafevalid, safe
string index foobar $([command])valid, unsafevalid, safe

Some more complex examples:

set &a 1
set &b 3
string index foobar $(a+b)
string index foobar $(end-a-b)
set &i $(end-a-b)
string index foobar $i

Another thing to mention is that this new proposal closely matches current Tcl behavior.

Impact of brace counting changes

AMG: Brush significantly redesigns brace counting in order to avoid comments resulting in mismatched braces, for instance in this Tcl code:

proc moo {x} {
#  if {$x > 5} {
   if {$x > 20} {
      puts $x
   }
}

A human reading this would expect the first [if] line to be commented out, no questions asked. That would be true if only [proc] was even allowed to execute. But it won't get that chance because the Tcl interpreter never finds the closing brace for [proc]'s final argument. Comments have zero meaning to the brace counter, which only cares about braces and backslashes.

Brush's brace counter is also mindful of comments, and it declines to count braces that appear inside of what will be a comment if the braced text were to be a Brush script.

Brush also skips braces that appear inside double-quoted words, making code like the following possible:

proc moo {x} {
    if {$x} {
        puts "{"
    } else {
        puts "}"
    }
}

This code actually doesn't work. Try it and see for yourself. The Tcl fix is to quote the braces with backslashes instead of, or in addition to, double quotes. But the Brush fix is to recognize the double quotes and skip braces appearing inside.

Of course all this means Brush has to keep track of where pound signs and double quotes are actually capable of starting a comment or quoted word. Plus Brush has #{block comment notation}# for comments which can span multiple lines or appear within a single line, and its #regular line comments can begin at any word, not just the first of a command.

So, what's the impact? This makes canonical quoting rules more complicated.

To the best of my knowledge, Tcl's canonical quoting rules are:

  • Backslash-quote words ending with an odd number of backslashes or containing backslash-newline or mismatched braces.
  • Brace-quote empty string and words starting with pound sign or containing whitespace, brackets, dollar signs, matched braces, double quotes, or backslashes.
  • Do not quote words that aren't described by the above.

(Actually, this isn't quite right... I need to further research Tcl's behavior regarding double quotes. I also just noticed that Tcl sometimes backslash-quotes closing brackets, but I can't figure out why. Guess there's always one more thing, huh?)

What are the rules for Brush? It has to worry about the above (well, except backslash-newline, but that's another story), but (among other things) it also has to backslash-quote words ending in a comment. (Words ending in newline should be fine.)

Things go off the rails when the brace-quoted word is not a Brush script but rather code written in a language with different comment rules and a different purpose for pound sign. For example, C:

puts $chan {#include <stdio.h>}

The above doesn't work since the closing brace is thought to be part of a comment. Instead, one of the following must be used:

puts $chan "#include <stdio.h>"
puts $chan \#include\ <stdio.h>
puts $chan {#include <stdio.h>
}

The following will not work. Even though it does successfully inhibit the comment behavior for the brace counter, it will emit the backslash which is not valid C.

puts $chan {\#include <stdio.h>}

Understandably, I'm not happy about this, and I will need to think on it further.

AMG: Okay, I think I'm willing to accept this because a thoroughly decent workaround exists (use quotes). In Tcl it's already the case that not every string can be brace-quoted, so I'm not going to be able to fix that. Multiple quoting modes exist so any string can be encoded if necessary.

AMG: Need to come up with an answer to Lars H's 2006-08-04 post on Why can I not place unmatched braces in Tcl comments.

Extended [set] command

AMG: Brush defines [set] to take the place of [lassign], exploiting the fact that it's always possible to distinguish between a reference and a list with non-unit length. When given a list with length two or greater as its first argument, Brush treats it as a list of references to variables into which elements from its second argument will be assigned, with the leftover elements returned, similar in fashion to [lassign]. If the list has length two, its second element can be empty string, and it is treated as a list of only one reference, to be processed as described in the preceding sentence.

I don't always care about every element in a list. My current practice is to assign to a dummy variable called _, then ignore its value. Brush can do the same with &_, but maybe we can do better than that. If an element of [set]'s first argument list is - (let's mix it up a bit, shall we?), that means to throw away the corresponding element of its second argument list. This should provide a marginal performance improvement, avoid cluttering the namespace with unused variables, and (perhaps most importantly) would be a documented, officially-supported idiom rather than an ad-hoc practice subject to variation and misinterpretation.

AMG: This wouldn't prevent having variables actually named - since - is distinct from &{-}.

The first argument to two-argument [set] is either a reference or a list. If a reference, the second argument is assigned into the variable or element thereof identified by said reference. If a list, each element is a reference (into which a list element of the second argument is assigned) or - (to inhibit assignment). If a two-element list, the final element may be empty string, in which case it is treated as a single-element list containing only the first element.

AMG: A further idea would be to allow the elements of [set]'s first argument to be two-element lists like the parameter list argument to Tcl's [proc] command. Variables (references, actually) that would otherwise go unassigned will instead get the default value supplied by the second element. If no second element is given (i.e. it's a reference and not a two-element list), the referent is unset. This would be a change from the current Brush paper which says it's an error to have too few values for assignment.

Going further, maybe adopt the full Brush [proc] parameter list syntax, except using references instead of names. For this to work well, the [proc] parameter list syntax should be changed to not append special characters to the parameter names, but I think that's a worthy improvement anyway.

AMG: Actually, supporting the parameter list format would largely remove the need for returning the excess elements because it would be possible to write stuff like:

= ((&elem ? ()) (&list *)) $list      # set list [lassign $list elem]
= (($elem ?) (&list *)) $list         # idem, but unset elem if list is empty
= (&elem (&list *)) $list             # idem, but error if list is empty

Even more fun:

= ((&list *) (&elem ? ())) $list      # set elem [lindex $list end]; set list [lrange $list 0 end-1]
= ((&list *) &elem) $list             # idem, but error if list is empty
= (&beg (&list *) &end) $list         # move first and last elements from list to beg and end, respectively

pass keyword to return excess elements

AMG: Just in case there really is a situation where it's necessary to return the excess elements, the parameter list definition could be extended to support a pass keyword, reminiscent of the - keyword described above. pass can be used at most once, and not in combination with a * parameter. When used, it collects all unassigned inputs in the manner of a * parameter, and [=] returns the collected inputs as a list.

= (&a &b &c pass) $list               # lassign $list a b c
= (pass &a &b &c) $list               # lassign [lrange $list end-2 end] a b c; lrange $list 0 end-3
list size [= (&a &b &c pass) $list)]  # llength [lassign $list a b c]

This eliminates the need for the special empty string hack described in the paper which was used to force a single reference to be interpreted as a list. Instead of writing (&a ()), write (&a pass).

If pass is not used, what should extended [=] return? Empty string is probably the best choice. The other option is to return its second argument, but this will hurt performance in the case of [=] being used to take elements from a list, modifying the list variable in the process. Returning the original value of the list variable would require that a copy be maintained, though the copy would almost never be used. If the caller really wanted that copy, [:] is available for that purpose:

: $list [= (&a &b &c (&list *)) $list]

Never mind, forget about pass

AMG: I'm having a hard time getting comfortable with pass meaning "pass through". Perhaps result would work better since it directly means the [=] command's result.

AMG: Actually, I'm having a hard time getting comfortable with having the feature at all. What is it good for? Everything it does is more clearly expressed in terms of assigning to a variable through the other modes of operation. I only came up with it as a way to fully implement [lassign] in terms of [=], but is this a worthwhile goal? I mean, this is possible, explicit, and more flexible:

= (&a &b &c (&result *)) $list; : $result

As for its use to avoid the empty string hack, what is the empty string hack accomplishing anyway? That came from an earlier design for [=] (then named [set]) in which it was a direct replacement for [lassign], always returning the unassigned elements. That feature was only really desirable in combination with popping elements off a list, but that is now better achieved by catchall-assigning back into the list variable with the extended notation introduced here. So once again, it's not needed. Going forward, let's say that [=] always returns empty when doing list assignment.

The remaining use case for the empty string hack is assigning the first element of a list into a variable. That can be done more clearly with [= &a $list{0}]. And if the list value isn't contained in a variable, functional substitution can be used: [= &a $[command]{0}]. Or (highly unlikely) if the result of non-command shenanigans like concatenation and list construction, use [:] as a shim to make it be a command: [= &a $[: (a b c)]{0}], but at that point you're really just trying to make things hard.

A new need for the empty string hack arises below, though for a different purpose than previously envisioned. I'm completely changing the syntax though so it looks less like a hack and more consistent with the rest of the language.

Change order of special list elements passed to [=]

AMG: Parsing is much easier when the * and ? modifiers are the first list elements, instead of the second. This lets humans and computers read left-to-right with no need to backtrack:

= ((? &elem ()) (* &list)) $list      # set list [lassign $list elem]
= ((? $elem) (* &list)) $list         # idem, but unset elem if list is empty
= (&elem (* &list)) $list             # idem, but error if list is empty
= ((* &list) (? &elem ())) $list      # set elem [lindex $list end]; set list [lrange $list 0 end-1]
= ((* &list) &elem) $list             # idem, but error if list is empty
= (&beg (* &list) &end) $list         # move first and last elements from list to beg and end, respectively

Allow nesting

AMG: Why not let the syntax be recursive? In combination with letting [loop]/[collect] use the same syntax as [=], this removes the need for unpack and unpackover.

Data structures nest, so it's useful to nest the list of references into which data structure elements are assigned.

Additionally, I'd like to extend the * catchall to support multiple references, to be assigned in the manner of Tcl [foreach].

One more change: Instead of - inhibiting assignment, let's use /. My reasoning is I wish to maintain commonality with [proc]'s second argument (the parameter list), yet I am considering extending [proc] to have inhibited assignment, if it's not there already. The challenge is that [proc] must be able to product a syntax summary from its parameter list if called with wrong arguments, but what should [proc] say for inhibited assignment? If it's told -whatever instead of -, it would know to print whatever, so that solves the problem. But doing so introduces the problem of making it appear like -whatever is an option. Hence I want a different character, and / is available, so instead say /whatever and all is well.

As mentioned above, the need for the empty string hack returns. Even though it may not be useful at the top level, it can come in handy with nested lists. Without it, a list element would be assigned. With it, the list element would be required to itself be a single-element list, and that single element would be assigned. But the syntax sucks. Instead I want to leverage the notation already established, that being having the first element of the list indicate how it should be interpreted. The ' character is handy, and it is vaguely reminiscent of Lisp's quote operator, which it also abbreviates as '. Plus ' looks like a tiny 1, reinforcing the fact that it is to be used in cases where (semantically speaking) the list has one element.

Here's a syntax summary:

Notation Intepretation
ref Required element, assigned via reference
/ Required element, assignment inhibited
(nest ...) Recursive definition
(' nest) Single-element recursive definition
(? nest) Optional element
(? nest val)Defaulted element
(* nest ...)Catchall elements, alternating assignment

(* &var) assigns a list of elements to var, so (* (&v1 &v2)) should likewise assign lists to v1 and v2.

There is a difficult interaction between (* ...) and (? ...). Data elements are not directly assigned to references recursively nested within (* ...), but rather these references are set to lists of elements encountered. (? ...), when not given a default value, unsets the referenced variable or element when it does not have a corresponding data element to assign. These two rules result in ambiguity when some elements are assigned and others not. The reason is that lists in Brush are dense, not sparse; there's no way to put a "hole" in a list, other than at the end. Unsetting a list element means deleting it, and its existence can't later be tested.

The chosen solution is to add one level of list encapsulation to all elements of all references assigned to non-defaulted (? ...) when recursively nested within (* ...). This provides a sparse list implementation in which each element is a zero-length list if absent or a single-element list if present, and that element is the data. For example, {{hello world}} {} {{how are you}} is such a list, as is {} 1 2. Other solutions are possible, but this seems to be the simplest.

# concatenation and flattening
= &x (1 2 (3 4) ((5 6)) (((7 8))))
: $x    # 1 2 {3 4} {{5 6}} {{{7 8}}}
= (* &x) (1 2 (3 4) ((5 6)) (((7 8))))
: $x    # 1 2 3 4 {5 6} {{7 8}}
= (* (* &x)) (1 2 (3 4) ((5 6)) (((7 8))))
: $x    # 1 2 3 4 5 6 {7 8}
= (* (* (* &x))) (1 2 (3 4) ((5 6)) (((7 8))))
: $x    # 1 2 3 4 5 6 7 8

# input must be a single-element list
= (' &x) 1
: $x    # 1
= (' &x) ()
       # error: too few elements when assigning to {' &x}
= (' &x) (1 2)
        # error: excess elements when assigning to {' &x}
= (' &x) ((1 2))
: $x    # {1 2}

# input must be a list of single-element lists
= (* (' &x)) (1 2 3 4 5 6 7 8)
: $x    # 1 2 3 4 5 6 7 8
= (* (' &x)) ((1 2) (3 4) (5 6) (7 8))
        # error: excess elements when assigning to {&x}
= (* (' &x)) (((1 2)) ((3 4)) ((5 6)) ((7 8)))
: $x    # {1 2} {3 4} {5 6} {7 8}

# multiple variables in combination with *
= (* &x &y) (1 2 3 4 5 6 7 8)
: $x    # 1 3 5 7
: $y    # 2 4 6 8
= (* &x &y) (1 2 3 4 5 6 7)
        # error: too few elements when assigning to {* &x &y}
= (* &x &y) ((1 2) (3 4) (5 6) (7 8))
: $x    # {1 2} {5 6}
: $y    # {3 4} {7 8}
= (* (&x &y)) (1 2 3 4 5 6 7 8)
        # error: too few elements when assigning to {&x &y}
= (* (&x &y)) ((1 2) (3 4) (5 6) (7 8))
: $x    # 1 3 5 7
: $y    # 2 4 6 8
= (* (&x &y)) (((1 2) (3 4)) ((5 6) (7 8)))
: $x    # {1 2} {5 6}
: $y    # {3 4} {7 8}

# ? in combination with *
= (* &x (? &y)) (1 2 3 4 5 6 7)
: $x    # 1 3 5 7
: $y    # 2 4 6 {}
= (* &x (? &y)) ((1 2) 3 (4 5) (6 7) (8 9) () (10 11))
: $x    # {1 2} {4 5} {8 9} {10 11}
: $y    # 3 {{6 7}} {{}} {}
= (* &x (? &y _)) ((1 2) 3 (4 5) (6 7) (8 9) () (10 11))
: $x    # {1 2} {4 5} {8 9} {10 11}
: $y    # 3 {6 7} {} _
= (* ((? &x) (? &y))) ((1 2) 3 (4 5) (6 7) (8 9) () (10 11))
: $x    # 1 3 4 6 8 {} 10
: $y    # 2 {} 5 7 9 {} 11
= (* ((? &x) &y)) ((1 2) 3 (4 5) (6 7) (8 9) () (10 11))
        # error: too few arguments when assigning to {{? &x} &y}
= (* ((? &x) &y)) ((1 2) 3 (4 5) (6 7) (8 9) (10 11))
: $x    # 1 {} 4 6 8 10
: $y    # 2 3 5 7 9 11

Make [proc] parameter list special characters be separate list elements

AMG: Instead of writing:

proc &p (a b? (c? xxx) d (e= yyy) f* g? h) {...}

Write:

proc &p (a (b ?) (c ? xxx) d (e = yyy) (f *) (g ?) h) {...}

This avoids trouble if the parameter (variable) name actually ends in a special character, plus it better suits the proposed change to extended [set].

With this change, here are all the possible forms:

Notation Intepretation
var Required argument (shorthand)
(var) Required argument (explicit)
(var ?) Optional argument
(var ? val)Defaulted argument
(var *) Catchall argument
(var = val)Bound argument
(var & ref)Bound reference

The one I call out as "explicit" is useful if the variable name is very strange and may be interpreted as a list.

AMG: This change makes room for more features. I'm considering named arguments in the manner of options.

However, this might not be the best level at which to implement the feature. Tk, for instance, has an option database which can supply defaults for things not passed as command option arguments, and this is integrated with each widget instance's configure and cget subcommands.

But then again, maybe it is right, since there is nothing stopping the parameter list to [proc] from being the return value of a command which works out all the defaults in a consistent way. Though it's also possible to wrap [proc] itself with a command that does this and more. Or have a common parsing command which each proc can invoke on its catchall argument, just like current Tcl practice, so that no one option scheme is given preferential treatment by the language. Brush's relaxation of Tcl's restriction that the catchall argument be the final makes this a lot easier to do.

What I have already defined is (I think) a mostly minimal subset needed to implement anything else. I say "mostly" because required and optional arguments could be defined in terms of the catchall argument. But bound arguments and references are special in that they're supplied when the command is defined, not when it is invoked.

AMG: A further refinement is to swap the order of elements. Additionally support the nested modes of operation allowed by [=], [loop], and [collect]. Plus require / name its formal argument for the sake of usage/error message generation. Like so:

Notation Intepretation
name Required argument (shorthand)
(! name) Required argument (explicit)
(/ name) Ignored argument (named) [*]
(nest ...) Recursive data structure argument
(' nest) Single-element recursive data structure argument
(? nest) Optional argument
(? nest val)Defaulted argument
(* nest ...)Catchall arguments, alternating assignment
(= nest val)Bound argument [*]
(& nest ref)Bound reference [*]

The "explicit" mode is needed if the name resembles a list which would otherwise be interpreted as a nested construct. The syntax change is required to avoid ambiguous cases.

The nested modes of (= ...) and (& ...) are intended to accommodate val and ref being programmatically generated structures. (= ...) and (& ...) themselves cannot be used in a nested construct.

Lines marked [*] denote features present in [proc] formal argument lists that significantly differ from [=]/[loop]/[collect] assignment lists. [*] or no, in all cases, the syntax differs in that the variables are given as names and not references, because the names apply inside the [proc] body (to be executed later) and not the context in which they are written and originally encountered.

The (/ name) mode is changed to require that a name be given. This is needed for when the generated proc is invoked with a wrong number of arguments and must produce an error message describing its correct usage.

Quite a lot of symbols are involved, so it's definitely appropriate to justify their choice and provide a mnemomic.

Symbol Explanation
! Exclaims (forcefully asserts) that the item is in fact a single argument even if it looks like something else
/ Slashes through the item so that it takes up space on the page but doesn't actually have any meaning
(...)Groups multiple elements into a single list therefore single argument, using standard Brush list construction notation
' Quotes a single-element list to distinguish it from a top-level (non-list) element, also resembles a superscript "1" because one element
? Whether or not the argument is present is a question whose answer isn't known in advance
* Similar to Kleene Star [L7 ] which indicates any cardinality is acceptable
= Predetermined assignment, as in [=] command
& Predetermined references, as in & construct used to make references elsewhere throughout the language

Be strict

AMG: Tcl tries to be generous with stray $dollar signs and close [brackets]. If it's unable to use them in a substitution, it just passes them through literally. In my experience, this tends to hide bugs or obscure their origin. For example, set chan [open file]]] is legal, and no error happens until $chan is passed to any file I/O commands, which will fail to find the channel due to the trailing close bracket. To force programmer intent to be expressed more clearly, and to detect errors as early as possible, I want to require \backslashes or braces around all dollar signs and close brackets that aren't intended to be part of a substitution.

Tcl is also generous with {braces} and "double quotes" that aren't at the start of the word. They have no special meaning unless they're at that privileged position. This is a major departure from Unix shells which allow quoting to start, stop, and change styles anywhere within a word, so there is potential for user confusion. Requiring quoting for these special characters when not at word start would help to more quickly educate the user. Brush allows words to be quoted with parentheses as well as braces and double quotes; for consistency, (parentheses) would also need this strict treatment. Literal parentheses could be quoted with backslashes, braces, or double quotes.

One consequence of this last change would be to more quickly flush out code that omits the leading &ampersand when constructing a variable reference. Detecting this error early would be helpful because adding the requirement for & is a significant difference between Tcl and Brush.

For example, a(b) would be illegal; one must write &a(b), $a(b), {a(b)}, "a(b)", or a\(b\). In all cases, it is manifestly clear whether the programmer intends to make a reference, take a value, or simply have a literal string.

Lastly, Tcl is generous with #pound sign, which only has special meaning when it appears where the first character of a command name could appear. Brush allows comments in more places, but still they can't start in the middle of a word. (I'm considering changing that last aspect, but I'm not committed yet.) I want to require that non-comment pound signs be quoted the same as any of the other special characters I just discussed. A lot of Tcl code already (unnecessarily) backslash-quotes pound signs, for instance in combination with [uplevel], so this wouldn't be a surprising change, but rather one in line with (admittedly naïve) programmer expectation.

Basically I want the rule to be that special characters always require quoting to disable their special interpretation, even in contexts where their special interpretation isn't applicable. This should reduce surprise, encourage consistency, detect errors early, and maximize possibility for future expansion.

AMG: Also I want it to be an error to pass too many arguments to [format].

Be lenient

AMG: Wait, I thought we were going to be strict here? Well, there's one case that I think leniency is called for. That would be the case of extra whitespace in line continuations. Compare the following two Tcl code snippets.

puts\
hello

and

puts\ 
hello

The first works, but the second dies with invalid command name "puts ". Trouble is that the whitespace at the end of the first line (bet you didn't even know it was there) causes the backslash to not be interpreted as a line continuation.

I propose to change backslash-newline-whitespace to instead be backslash-whitespace-newline-whitespace. If a backslash is followed by any amount of linear whitespace, then one newline, then any amount of linear whitespace, then upon execution, the entire sequence is instead treated as a word separator.

AMG: On second thought, this is dangerous. What if the last argument to a command legitimately needs to be a whitespace character? Normally it would be quoted with braces, but perhaps other characters in that same argument can't be quoted with braces, e.g. the argument contains mismatched braces, in which case backslashes must be used. What about double quotes? Sure, they'd solve the problem, but they're not going to be emitted by [list] or the Brush list constructor (parentheses) since double quotes don't nest as well as braces. I want for Brush lists to always be usable as commands when generating code.

Sorry, this feature seems like it may be a liability.

The [take] command

Motivation

AMG: One use for the [K] command is to improve performance in cases where Tcl's sharing semantics incur cost with no benefit [L8 ].

set x [lreplace $x 2 3 foo bar quux]

[lreplace] is forced to create and operate on a copy of $x, even though the original $x will be thrown away very soon after [lreplace] completes. Of course it can't know this in advance, especially if there is any chance of error.

However, [lreplace] can run much faster if its first argument is unshared (e.g. it is not stored in a variable). The current, most efficient Tcl way to do this is:

set x [lreplace $x[set x {}] 2 3 foo bar quux]

This is efficient but confusing to read. It can also be written using [K], through there is little improvement to readability:

set x [lreplace [K $x [set x {}]] 2 3 foo bar quux]

Design approach

I suggest a new [take] command which sets a variable to empty string then returns its old value, which has a higher chance of being unshared at that point. Its argument would be a reference to the variable, with normal Brush indexing permitted. I don't think [take] should always force its return value to be unshared, since whatever uses the value is certainly capable of making its own unshared copies if need be. All it needs to do is take the value away from the variable. It shouldn't unset it because that is expensive, and this is meant as a performance optimization.

Here's a theoretical Brush rendition of the above:

set &x [list replace [take &x] 2 3 foo bar quux]

By the way, in Brush this particular example could have been written without the [take]; I just copied it from the [K] page. Brush has more direct syntax for replacing parts of lists:

set &x{2:3} (foo bar quux)

Now combine the two for an interesting point. After running the above, [take &x{2:3}] would return "foo bar". It would also behave like [set &x{2:3} ()], which is the same as [unset &x{2:3}] and removes the two requested elements from the list. It does not replace each taken element with empty string. The reason is that the operation is being done on a list range, not individual elements.

Oh yes, another thing to mention: Interactive (shell) use of Tcl or Brush implies that the [history] mechanism also keeps references to all the old result values. This also affects performance, but maximal performance in an interactive environment is less important. In Tcl, this can be inhibited by forcing commands to result in empty string by ending them with ";list". In Brush, the construction is ";:", since the [list] command does something different than in Tcl.

Implementation of [take]

AMG: Speaking of [:], here's a Brush implementation of [take]:

proc &take (var) {
    : $var@ [set $var {}]
}

Let's go through it in excruciating detail to review the similarities and differences with Tcl.

[proc]'s three arguments are:

  1. The reference to the locally scoped variable in which the procedure's lambda will be stored.
  2. The list of argument names (not references). Because this is a list, (parentheses) notation is recommended. Special syntax for things like defaults is supported but not used by this example.
  3. The script to execute when the procedure is called.

In this case, the script consists of a single command, [:]. [:] takes any number of arguments, and it returns the first argument (or empty string if no arguments). Two-argument [:] is identical to [K].

The arguments being given to [:] are:

  1. $var@. The $ introduces it as a variable substitution, then var means to get the value of the local variable called "var". The @ says to interpret that value as a reference and get the value of the referenced variable. The caller is supposed to have passed a reference to a variable, so this works out perfectly. Brush references are valid across any number scopes plus have other interesting properties. This gives them many advantages over the Tcl equivalent of passing around variable names. Anyway, this value becomes the result value of [:] which in turn becomes the return value of the proc as a whole, no matter what side effects may take place in the generation of [:]'s subsequent arguments.
  2. [set $var {}]. This is a script substitution, also known as command substitution. [:] ignores all arguments but the first, so the fact that this evaluates to empty string is inconsequential. The important thing is that it has a side effect.

The arguments to [set] are:

  1. $var. [set]'s first argument is a reference to the variable it is to access and (optionally) modify. It doesn't matter where that reference came from. In this case it was put forth by [take]'s caller, but who knows who originally generated it? Doesn't matter. A variable will continue to exist for as long as references to it exist, so this reference could even be to a variable that was a local in a procedure that terminated long ago. It would still work. Anyway, there's no @ here because [set] wants the reference, not the value.
  2. {}. This is the value [set] will put in the variable. In this case, it's empty string. It could also have been written [], (), or "". All four are equivalent, but {} is the canonical representation of empty string.

When run, [take] remembers the current value of the variable which it received by reference, sets that variable to empty string, then returns the remembered value.

Automatic [take]

Okay, back the concept of [take]. After thinking about it a bit more, it occurred to me that it may be worthwhile and not altogether impossible to automatically detect situations where [take] would be helpful. When a variable is about to be overwritten (barring an exception/error, ugh), it need not contribute to its value's reference count. If the interpreter can somehow look ahead and see that the variable(s) contributing to an argument will be overwritten, speculatively decrement its(/their) reference count(s).

set &x [list replace $x 2 3 foo bar quux]

Here, [list replace] doesn't know where its return value will go, nor does it know where its arguments came from. Clearly it's not capable of doing any of the optimizations I describe.

Does the interpreter have a broad enough perspective to recognize the opportunity for optimization? So, what does the interpreter know? It knows it's invoking some command and passing it the value of a variable as an argument. It also knows it's passing the return value of that command as an argument to another command which also takes a reference to that same variable as its other argument.

That's not enough information. The interpreter doesn't know what any of the commands are actually doing, so it doesn't know the second command will overwrite the referenced variable. It can't even be sure the second command will ever be called; the first command might have an exceptional return code such as [break] or [error].

What's needed is for the commands to advertise their behavior to the interpreter. Actually, Tcl already has this! It's called bytecode. Surely a complete Brush implementation would have similar.

If x doesn't exist (has no value), is shared, or has a write trace, this optimization does not apply. Just let normal execution run its course.

By the way, write traces imply sharing because the trace command needs to access both the old and new value of a variable. Read traces are okay.

Next step is for [list replace] (or whatever command) to look before it leaps. Before making any modifications whatsoever to the variable, it must check for every possible error condition. In this case, the only errors that can happen are for the value to not be a list or to be too short.

Once it's certain the command will complete successfully, and that x's value is unshared and has no write trace, the command is free to operate directly on x, achieving optimal performance. In this case, the only thing [set] does is inform the interpreter of the desired semantics, being that x is intended to be overwritten as soon as the command returns.

Actually, there's another way to look at it for this particular example. The bytecode compiler, [list replace], and [set] can cooperate to transform to:

set &x{2:3} (foo bar quux)

And now there's no question about sharing or copies.

Switching gears a bit, just going to jot this down so I don't forget. [take] (or an automatic version thereof) is actually useful in cases where the the manipulation isn't done via command. For example, appending a newline:

set &x $x\n

Ideally this would be just as fast as the [append] version. But in current Tcl, it involves loading the value of $x onto the stack, appending to the value on the top of the stack (thereby creating a new, unshared copy), then popping that value from the stack and into the variable (thereby freeing the original, if unshared). That's a wasted copy. Calling [append] skips the stack and therefore enables in-place modification.

A simpler perspective

AMG: Seeing the above two examples:

set &x{2:3} (foo bar quux)
set &x $x\n

makes me think the [take] functionality would be better had another way. Pass variable arguments in the traditional manner; instead focus on optimizing stuff that's byecoded inline. The first above example shows that Brush's indexing capabilities make it rare to need a command for doing the kinds of operations that could have benefited from [take]. The second shows a case where in current Tcl, no command is needed at all, other than [set], yet could stand to be optimized.

In both cases, the bytecode compiler ought to implement [set] command (a.k.a. [=] as proposed elsewhere on this page) with inline bytecodes, not as a command invocation. The bytecode compiler also is responsible for whatever manipulations may be necessary to index into the variable and modify its value. So if it takes the long view rather than looking through a microscope all the time, it may realize that the duplication can be skipped in certain circumstances.

The first case (set &x{2:3} (foo bar quux)) is easiest since $x is never substituted, so there is no temptation to duplicate. However, it's still extremely interesting because it's equivalent to set &x [list replace $x 2 3 foo bar quux]. Therefore, writing the latter should produce the same optimized result. This establishes the nature of the relationship between [list replace] (or any optimized command) and the bytecode compiler. But it's quite complex, much more than needs to be specified now.

The second case (set &x $x\n) is trickier because a straightforward stack-based implementation will briefly have x's original value both in the (not yet modified) variable and on the stack. Modifying the stack then triggers a wasteful duplication. Here is where the bytecode compiler needs to look ahead and see the potential for an in-place modification.

Creative writing

AMG: Adopt TIP 278 [L9 ]. See also: Dangers of creative writing.

Typographic issues

Wrong phone number

AMG: The "[email protected]" example on page 18 confuses phone numbers 555-1235 and 555-0216.

Wrong error message

AMG: The "wrong # args" error message on page 16 says "e" when it should say "d".

Simply simply

AMG: The word "simply" is used in two sentences in a row on page 8.

[accum_gen] example

AMG: On page 38, change "accums(a) 0" to just "accums(a)" so as to demonstrate the default value feature.

Substitution summary

AMG: On page 23, in the Tcl column, change $simple_name(index) to $name(index). Change Not Easily Available to Not Available In One Line or similar.

Reference before variable

AMG: On page 28, instead of saying, "the reference exists before the variable is made", say it exists before the variable is given a value.

Not enough arguments

AMG: On page 17, instead of saying "not enough arguments", say "not enough list elements for all variables", or something to that effect.

Examples and motivation

AMG: Incorporate examples and motivational discussion from What Languages Fix.

AMG: Here's a real-life case study from some of my CNC work. This shows a side-by-side comparison of Tcl and Brush, one line at a time, performing a few very basic data structure manipulations.

The cuts variable is a list of cuts being made into the material, where each cut is a two-key dict.

  • lines: List of first and last zero-based input file line numbers.
  • vertices: List of alternating X/Y coordinates.

Given the design of the data structure, the Brush version is dramatically simpler than the Tcl version. The data structure really isn't all that complicated; it just has the bad luck of putting a list inside a dict inside a list. Nesting lists in lists is easy, as is nesting dicts in dicts. Mixing gets unreasonably difficult. Thanks to [dict with] and [dict lappend], putting lists in dicts is easier than putting dicts in lists, but this structure has it both ways which really makes things hard.

Initialization Brush version
set cuts {} = &cuts ()
Adding a new cut Brush version
lappend cuts [dict create\
lines [list $xyi $i] vertices [list $x $y]]]
= &cuts{end+1} (lines ($xyi $yi) vertices ($x $y))
Adding a vertex to the last cut: Brush version
lset cuts end [dict replace [lindex $cuts end] vertices\
[concat [dict get [lindex $cuts end] vertices] [list $x $y]]]]
= &cuts{end}(vertices){end:} ($x $y)
Adding a vertex, multi-line approach: Brush version
set finalCut [lindex $cuts end] = &finalCut $cuts{end}
dict lappend finalCut vertices $x $y = &finalCut(vertices){end:} ($x $y)
lset cuts end $finalCut = &cuts{end} $finalCut

Dealing with data structures like this is really what got me started on Brush in the first place.

AMG: Here's another example, a case where [dict lappend] lets us down.

Tcl code:

dict set zips $zipFile extract [concat [dict get $zips $zipFile extract] [list $asciiFile]]

Brush code:

= &zips($zipFile extract){end+1} $asciiFile

Braces become parentheses

AMG: On page 27, list indexes (braces) are used to construct references, but the string representations of those references show dict indexes (parentheses).

Arguments and parameters

AMG: The paper says "formal arguments" throughout, but this should be "formal parameter" or simply "parameter". The nomenclature I should adopt is that an argument is the actual value passed to a proc, whereas a parameter is the variable holding the argument.