Version 41 of Cloverfield - Tridekalogue

Updated 2013-04-15 02:43:30 by LarrySmith

This is the set of 13 rules that define the Cloverfield project language. For comparison, see Tcl's Dodekalogue.

References to the language name use the placeholder <Cloverfield>.


Rules

The following rules define the syntax and semantics of the <Cloverfield> language:

[1] Commands.

A <Cloverfield> script is a string containing one or more commands. Semi-colons and newlines are command separators unless quoted as described below. Close brackets are command terminators during command substitution (see below) unless quoted.

FB: Identical to Tcl rule [1].


[2] Evaluation.

A command is evaluated in two steps. First, the <Cloverfield> interpreter breaks the command into words and performs substitutions as described below. These substitutions are performed in the same way for all commands. The first word is used to locate a command procedure to carry out the command; this step is known as command name resolution. Then all of the words of the command are passed to the command procedure. The command procedure is free to interpret each of its words in any way it likes, such as an integer, variable name, list, or <Cloverfield> script. Different commands interpret their words differently.

FB: Identical to Tcl rule [2] with a mention of "command name resolution", replacing auto-expansion of leading word in the previous version. The latter is a name resolution policy, and as such is outside the scope of the Tridekalogue.


[3] Words.

Words of a command are separated by white space (except for newlines, which are command separators).


[4] Double quotes.

If the first character of a word is double-quote (“"”) then the word is terminated by the next double-quote character. If semi-colons, close brackets, or white space characters (including newlines) appear between the quotes then they are treated as ordinary characters and included in the word. Command substitution, variable substitution, and backslash substitution are performed on the characters between the quotes as described below. However variable references are substituted with their current value at the time of substitution. The double-quotes are not retained as part of the word.

FB: Identical to Tcl rules [3] and [4].


[5] Braces.

If the first character of a word is an open brace (“{”) and rule [11] does not apply, then the following characters are processed recursively as a <Cloverfield> script up to the matching close brace (“}”). No substitutions are performed on the characters between the braces except for backslash-newline substitutions described below, nor do other characters receive any special interpretation. The word will consist of exactly the characters between the outer braces, not including the braces themselves.

FB 20090224: This is a modified version of Tcl rule [6]. Braced expressions must now be properly formatted according to the present rules. Arbitrary data can still be expressed using heredocs (rule [11.3]'s {data} modifier).


Larry Smith I have a hard time getting my mind around this. If we are prepared to break tcl8.* syntax with respect to parenthesis, this seems like pretty small potatoes in return. It also produces a total of four subtly different means of encapsulating blocks of code. I really think the current expr-syntax makes certain common aspects of Tcl quite painful - so painful we have done things like adding a brain-damaged version of expr to the index interpretation code to make "end-1" or "$a+1" less painful. It helps, but it's a kludge. "(a+1)" turning into [expr {$a+1}] would be just as readable, eliminates code in the index interpreters, and is far more generally useful since any expression is now supported, not just the index subset.

I think I see the intent behind this interpretation of parens, but it seems confusing to me. What do I get beyond subst {script} for the special syntax? The distinction between "..." and (...) seems slight and arbitrary. Why not simply use this semantic for "..."?

 or

TRAC had a notion worth thinking about - it had two different execution modes, "active" and "neutral" - indicated as #() and ##(), as I recall. Active evaluation pushed the result back into the input stream and thus forces it to be re-evaluated, whereas neutral eval did one pass and pushed the result to the output stream. We might consider [] for active and '[] for neutral. Or we can consider a later version of TRAC permitted the programmer of the command to decide when to re-evaluate, giving something like "return -active $x", and "return -neutral $x". Something along these lines might be a better, more "tclish" way to go than slapping Lisp-style lists in.

If we really need the extra way to specify a list, why don't we take advantage of our unicode capability and use a different set of delimiters? « » for instance? Granted some keyboards can't type them easily, but modifying a keyboard map is not hard nowadays. Contrariwise, we could use a digraph. {} is an unevaluated script, {! } an evaluated one (i.e. [ ]) and {. } might be evaluate once. This is attractive because it leaves parens free for expr, and brackets are now free for things like arrays.

[6] Parentheses.

If the first character of a word is an open parenthesis (“(”), then the following characters are processed recursively as a <Cloverfield> script up to the matching close parenthesis (“)”). Substitutions are performed on the words formed by the characters between the parentheses in such a way that their boundaries are preserved. The word will consist of exactly the characters between the outer parentheses, not including the braces themselves.

FB: This is a new addition. This quoting rules intends to make list obsolete and free for other uses, for example an ensemble command such as string or dict. Parentheses combine some of the properties of double quotes and braces, as they preserve word boundaries and white spaces while allowing deep substitution. Moreover they make line continuations useless. Think of parentheses as word-preserving, recursive double quotes:

% set a {1 2}
% set b {3 4}
% set s1 "$a
$b"
1 2
3 4
% set s2 [list $a \
$b]
% {1 2} {3 4}
% set s3 ($a
$b)
{1 2}
{3 4}

[7] Command substitution.

If a word contains an open bracket (“[”) then <Cloverfield> performs command substitution. To do this it invokes the <Cloverfield> interpreter recursively to process the characters following the open bracket as a <Cloverfield> script. The script may contain any number of commands and must be terminated by a close bracket (“]”). The result of the script (i.e. the result of its last command) is substituted into the word in place of the brackets and all of the characters between them. There may be any number of command substitutions in a single word. Command substitution is not performed on words enclosed in braces.

FB: Identical to Tcl rule [7].


[8] Variable substitution and reference.

If a word contains a dollar-sign (“$”), not followed by an ampersand (“&”) or an at-sign (“@”), followed by one of the forms described below, then <Cloverfield> performs variable substitution: the dollar-sign and the following characters are replaced in the word by the value of a variable.

If an ampersand immediately follows the dollar-sign, then <Cloverfield> performs variable referencing: the dollar-sign and the following characters are replaced in the word by a reference to the value of a variable. The reference remains valid even if the variable is deleted for some reason (like a local variable when a procedure returns).

If an at-sign immediately follows the dollar-sign, then <Cloverfield> performs weak, or late-bound, variable reference: the dollar-sign and the following characters are replaced in the word by a reference to a variable by its name, which will be dereferenced in its original context each time its value is needed. The reference may get unresolved if the original variable is deleted.

Variables are specified in two parts: the name part and the selector part. The name part of the variable that immediately follows the dollar-sign or dollar-sign plus ampersand may take any of the following forms:

name
Name is the name of a variable; the name is a sequence of one or more characters that are a letter, digit, underscore, or namespace separators (two or more colons).
"name"
Name is the name of a variable; the name is a sequence of arbitrary characters which are parsed according to rule [4].
{name}
Name is the name of a variable; the name is a sequence of arbitrary characters which are parsed according to rule [5].
(name)
Name is the name of a variable; the name is a sequence of arbitrary characters which are parsed according to rule [6].
$name
Name is the name of another variable that holds the name of the variable to use. For example “$$var” returns the value of the variable whose name is the value of variable “var”, while “$&$var” returns a reference to the same variable. This form is recursive.
[script]
Script is a <Cloverfield> script which is evaluated according to rule [7]. The result of the script is treated as an anonymous variable.

The selector part immediately follows the name part and may take any of the following forms:

name{selector ?selector...?}
Generic selector semantics use the name of a variable followed by zero or more selectors enclosed between braces, following rule [5]. The part between braces is split into words and substituted like any other script. If no selectors are presented, the sequence of characters is simply replaced by the value of the “name” variable. When presented with a single selector, it is replaced by the element designated by the selector in the variable value. If additional selector arguments are supplied, then each argument is used in turn to select an element from the previous operation, forming a path allowing the script to select elements from subvalues. So “$name{a b c}” is synonymous with “$name{a}{b}{c}”. Selectors can designate single or multiple elements. If the final selector designates multiple elements, it cannot be used as a reference, and a value substitution is performed instead.
name(key ?key...?)
Keyed access semantics use the name of a variable followed by zero or more keys enclosed between parentheses, following rule [6]. If no keys are presented, the sequence of characters is simply replaced by the value of the “name” variable. When presented with a single key, it is replaced by the element designated by “key” in the variable value. If additional “key” arguments are supplied, then each argument is used in turn to select an element from the previous indexing operation, forming a path allowing the script to select elements from subvalues. So “$name(a b c)” is synonymous with “$name(a)(b)(c)” . Keys are a special form of selectors.

In the absence of a recognized index part, the name designates a scalar variable. There can be several selector parts that designate subparts of the variable recursively. There may be any number of variable substitutions or references in a single word.

FB: Here Tcl rule [8] is extended to allow variable name forms that are more consistent with the other rules. For example, in Tcl the following code fails:

% set foo{} bar
% puts $foo{}
can't read "foo": no such variable
% puts ${foo{}}
can't read "foo{": no such variable
% puts $"foo{}"
$"foo{}"

That's because brace matching rules are different in Tcl rules [6] and [8]. Rule [8] stops at the first close brace, while rule [6] keep them balanced.

Moreover the double substitution $$name is clarified.

Variable name and selector parts are now distinguished. The new selector part allows a Tcl array-like syntax using parentheses to access keyed values such as dicts, as well as a vector access semantics using braces. Internally this will use an interface model. This also means that traces must also work in depth (e.g. tracing a given dict key).

Last, variable references are introduced. The goal is to allow cross-references between objects, as well as mixing mutable and immutable content, and blurring the line between mutable and immutable commands. For example, in Tcl:

% set d [dict create a 1 b 2]
a 1 b 2
% dict replace $d a 3 ; # Immutable, does not modify d.
a 3 b 2
% set d ; # d is unchanged.
a 1 b 2
% dict set d a 3 ; # Mutable, modifies d.
a 3 b 2

Using references, we no longer needs two separate sets of commands for mutable and immutable operations (the typical concat/lappend dichotomy:

% set d [dict create a 1 b 2]
a 1 b 2
% dict replace $d a 3 ; # Using variable value, the operation is immutable.
a 3 b 2
% set d ; # d is unchanged.
a 1 b 2
% dict replace $&d a 3 ; # Using variable reference, the operation is mutable. Same as [dict set d a 3]
a 3 b 2
% set d
a 3 b 2

Tcl already performs mutable operations on objects that are not shared (i.e. whose refcount is <= 1), and perform copy-on-write otherwise. Passing a variable reference would suspend COW.

FB 20090224: Replaced indices by the more generic selectors..

FB 20090302: Changed the meaning of brackets: they now allow selectors on command results without the need of an intermediary variable holding the result. The previous semantics (getting var name as a command result) can still be achieved by e.g “$([script])” or “$"[script]"”. The new semantics allows the following code:

% proc foo {} {return (a 1 b 2 c 3)}
% puts $[foo](b)
2

[9] Backslash substitution.

If a backslash (“\”) appears within a word then backslash substitution occurs. In all cases but those described below the backslash is dropped and the following character is treated as an ordinary character and included in the word. This allows characters such as double quotes, close brackets, and dollar signs to be included in words without triggering special processing. The following table lists the backslash sequences that are handled specially, along with the value that replaces each sequence.

\a
Audible alert (bell) (0x7).
\b
Backspace (0x8).
\f
Form feed (0xc).
\n
Newline (0xa).
\r
Carriage-return (0xd).
\t
Tab (0x9).
\v
Vertical tab (0xb).
\<newline>whiteSpace
A single space character replaces the backslash, newline, and all spaces and tabs after the newline.
\\
Backslash (“\”).
\ooo
The digits ooo (one, two, or three of them) give an octal representation of an eight-bit value for the Unicode character that will be inserted. The upper bits of the Unicode character will be 0.
\xhh
The hexadecimal digits hh give an hexadecimal representation of an eight-bit value for the Unicode character that will be inserted. Any number of hexadecimal digits may be present; however, all but the last two are ignored (the result is always a one-byte quantity). The upper bits of the Unicode character will be 0.
\uhhhh
The hexadecimal digits hhhh (one, to four of them) give a hexadecimal representation of a sixteen-bit value for the Unicode character that will be inserted.
\Uhhhhhhhh
The hexadecimal digits hhhhhhhh (one to eight of them) give a 32-bit hexadecimal value for the Unicode character that will be inserted.

FB: Identical to Tcl rule [9], with added syntax for 32-bit Unicode.


[10] Comments.

If a hash character (“#”), not followed by an open brace (“{”), appears at a point where <Cloverfield> is expecting the first character of the first word of a command, or the first character of a line, then the hash character and the characters that follow it, up through the next newline, are treated as a line comment and ignored. The comment character has no significance when it appears elsewhere.

If a hash-open brace character sequence (“#{”) appears at a point where <Cloverfield> is expecting the first character of a word, then the hash character and the characters that follow it, up through the matching close brace-hash sequence (“}#”), are treated as an inline comment and ignored. Inline comments nest: for each additional open sequence there must be an additional close sequence.

FB: Differs from Tcl rule [10] that only expects comments on the first word.

FB 20090224: modified line comment syntax and added inline comments.


[11] Word modifiers.

If a word starts with a string that obeys rule [5] immediately followed by a non-whitespace character, then the leading part is a word modifier. The interpretation of the rest of the word depends on the form taken by this word modifier. Word modifiers serve varying purposes, such as modifying the behavior of the parser, or the interpretation of the word data. Recognized word modifiers are:

[11.1] Null value.

If a word is preceded by the modifiers “{}”, then the word is parsed as any other word, but not substituted. The word is then replaced by a special null value which is distinct from any other value, including the empty string.

FB 20090224: Replaced “{null}”/“{nil}” by “{}”.

FB 20090224: Removed word comments “{#}” replaced by inline comments “#{ ... }#”.

[11.2] Argument expansion.

If a word is preceded by the modifier “{*}”, then the word is parsed and substituted as any other word. After substitution, the word is parsed again without substitutions, and its words are added to the command being substituted. For instance, “cmd a {*}{b c} d {*}{e f}” is equivalent to “cmd a b c d e f”.

FB: Identical to Tcl rule [5].

[11.3] Raw data.

If a word is preceded by the modifier “{data}”, then the following sequence of letter, digit or underscore characters forms a tag.

If the tag is immediately followed by a double-quote or open brace, then the word is terminated by the first occurrence of a double-quote or close brace immediately followed by the tag. The word is then replaced by the text data strictly enclosed between the delimiters.

In the opposite case, the rest of the line is ignored, and the word is terminated by the first occurrence of the tag in the following lines. The word is then replaced by the text data strictly enclosed between the two lines containing the start and end tags. All characters following the start tag up to and including the newline, and preceding the end tag from and including the previous newline, are ignored.

In both cases, no substitution is performed on the characters between the two outer delimiters. For instance:

    cmd {data}ABCDEF this is ignored 
    foo bar baz #{\"[$
    this is also ignored ABCDEF a b c d

    cmd {data}ABCDEF{foo bar baz #{\"[$}ABCDEF a b c d

    cmd {data}ABCDEF"foo bar baz #{\"[$"ABCDEF a b c d

The three forms are equivalent to “cmd "foo bar baz #\{\\\"[[\$" a b c d” (notice the lack of terminating newline). This rule allows for the inclusion of arbitrary text data (for example C code or XML data) without having to perform escapes needed to accomodate with <Cloverfield>'s parsing rules.

FB: {data} is an implementation of the here document concept.

FB 20090224: Added "inline" syntax with double-quote or brace delimiters.

[11.4] Metadata.

If a word is preceded by the modifier “{meta}”, then the word is parsed and substituted as any other word. The metadata associated with the result is substituted into the word.

If the “meta” string in the modifier is itself followed by a word, then this word is substituted first and associated as metadata with the word, which can be later queried as described above. It replaces any existing metadata. For instance:

    {meta foo}bar

gives “bar” with an associated metadata “foo”.

    {meta}{meta foo}bar

gives “foo”, which is the metadata associated with “bar”.

    {meta}{meta baz}{meta foo}bar

gives “baz”, not “foo”.

[11.5] Delayed substitution.

If a word is preceded by the modifier “{delay}”, then the word is parsed as any other word, but not substituted. Substitution occurs when querying the word value for the first time, and in the context where this substitution occurs. The word whose substitution is delayed can take any form, including command or variable substitution.

[11.6] References.

If a word is preceded by the modifier “{&id}”, where “id” is an arbitrary identifier string, then the word is parsed and substituted as any other word. If no reference exists with the given identifier, a new one is created and is associated with the resulting word. If the reference already exists, then the word value is simply ignored. If the “id” string is empty then the reference designates the outermost word in the current context (for example the root of a nested word tree). Reference identifiers are local, i.e. they are resolved within the outermost word boundaries.

FB 20090224: Removed variable and global references. References now only serve as a circular structure syntax.

FB 20100130: Changed syntax from “{ref id}” to “{&id}” for brevity and (I believe) legibility. Ampersand was chosen for consistency with the variable reference syntax (rule [8]).


[12] Order of substitution.

Each character is processed exactly once by the <Cloverfield> interpreter as part of creating the words of a command. For example, if variable substitution occurs then no further substitutions are performed on the value of the variable; the value is inserted into the word verbatim. If command substitution occurs then the nested command is processed entirely by the recursive call to the <Cloverfield> interpreter; no substitutions are performed before making the recursive call and no additional substitutions are performed on the result of the nested script.

Substitutions take place from left to right, and each substitution is evaluated completely before attempting to evaluate the next. Thus, a sequence like

    set y [set x 0][incr x][incr x]

will always set the variable y to the value, 012.

FB: Identical to Tcl rule [11].


[13] Substitution and word boundaries.

Substitutions do not affect the word boundaries of a command, except for for argument expansion as specified in rule [11.2]. For example, during variable substitution the entire value of the variable becomes part of a single word, even if the variable's value contains spaces.

FB: Identical to Tcl rule [12] with rule [5] replaced with the matching rule [11.2].


Discussion

<jkock>: I don't know where the word "tridekalogue" comes from, but from the viewpoint of Greek it hurts a bit: 13 in Greek is "dekatria", not "trideka", as far as I know, so perhaps "dekatrialogue" would be better. (I don't know Greek, though, so a better authority should advise.) (An internet search reveals that "trideka" is Esperanto and means "30th", so I guess that was not the intended association...)

FB: It derives from the Tridecagon (AKA Tiskaidecagon) which is "a 13-sided polygon" [L1 ], the Dodecagon being 12-sided. I swapped the 'c' for a 'k' to respect the same convention as Tcl's Dodekalogue. <jkock>: Thanks a lot for the explanation! I am quite surprised. It turns out there is a difference between ancient Greek (triskaideka (and variations)) and modern (dekatria). (So the hurt was solely due to my own ignorance. Sorry for the noise.)


DKF: It seems to me that the parts of rule [11] would probably be better off being separate rules as they're very different in nature from each other.

FB: Good suggestion. However I wanted to limit the number of rules to a small number (here 13). Moreover the many parts of rules [11] are not very different syntactically speaking (word modifiers share the same syntactic rules), but on the way they change the parsing, substitution and evaluation rules. Maybe we could use sub-rule numbers, e.g. [11.1] Null value. [Later] Done!


escargo 10 Mar 2008 - There are couple of issues with respect to definition of and treatment of white space. First, a uniform nomenclature for whitespace characters should be used uniformly on this page. So, instead of referring to "white space" or "white space characters" or other terms, "whitespace characters" could be used uniformly.

FB: This is a good point, however you should blame the original Dodekalogue for that ;-) (I kept the same wording in most places).

Second, with the advent of Unicode, the definition of whitespace characters needs to be reconsidered. (See, for example, http://en.wikipedia.org/wiki/Whitespace_%28computer_science%29 for a workable definition.) Personally, I think any character for which [string is space $char] is 1 should be considered to be a whitespace character.

Likewise, the behavior of [string trim string] (three arguments) should be to trim all characters for which [string is space $char] is 1.

These behaviors are not currently true for Tcl.

I beg to disagree. IHMO only the ASCII space chars should be considered white spaces (in the sense of word separators), because other whitespace chars may carry a special meaning. For example, character 160 is a non-breaking space and thus shouldn't be a word separator. However I agree that the behavior of [string is space] is inconsistent with the definition of white spaces in the Tcl's Dodekalogue and the present Tridekalogue. So maybe dropping all references to white spaces in favor of a proper definition of word separators in rule [3] would address this issue. Consequently [string is] would need a separator-like class for [ \t\f\r\v].

escargo - You don't have to beg. I like your idea of a 'separator' class of characters. In my case I was exactly trying to strip leading and trailing nonbreaking spaces that pointed out the difference between [string trim] and my expectations of it.

There is also an interesting question with respect to regex handling. For example, what characters are part of the space character class, defined this way:

space A character producing white space in displayed text.

On the other hand, if you wanted to have [string trim] only trim the ASCII characters, it is a lot easier to list those few than it would be to supply the list of all Unicode space characters (26 of them by my count).

Maybe we need [string trim -strict ...].


Implementation

AMG: I have made a reference interpreter in Tcl. It's missing some features, and it hasn't been exhaustively tested, but it does implement much of what is discussed above. See Cloverfield - Parser.


Twylite: That's 18 rules; hiding 6 of them as 11.1 to 11.6 doesn't negate the fact that you need to know 18 rules to be able to understand code in the language.

FB: Very true, however this is a work in progress, so I chose to gather all the word modifiers into one single rule (likewise, rule [8] defines all the different variable substitution syntaxes). Moreover, modifiers are cumulative.

The use of parenthesis to replace list sounds like a great idea! Wish it was here already.

The introduction of NULLs breaks Tcl. This has been discussed at length elsewhere. There is no use-case that can't be handled by an appropriate Tcl-ish as opposed to C-like data representation. Introducing NULLs complicates all code (Tcl and C extensions) and introduces a class of bugs to which Tcl is currently not vulnerable. Refer also to Tony Hoare's "Billion Dollar Mistake" [L2 ]. NULL dereferencing is the single most common development bug in C and Java. Why the heck do we want it?

FB: There has been a great misunderstanding about nulls from the beginning. Cloverfield does NOT introduces NULL pointers or references, in fact there is absolutely no way such a NULL reference can be created, so Cloverfield is totally immune to this whole class of problems. What I have in mind is closer to Smalltalk's nil objects. Moreover Cloverfield nulls are regular values that happen to be outside of the string domain, so in a sense null is to strings what NaN is to reals. I kept nulls in the rules because Colibri implements them correctly as a special value (not reference).

You will need to take care that the introduction of references does not complicate the development of C extensions. A reference should degrade gracefully to a value when treated as one, so that extensions that don't want to deal with references are not forced to.

AMG: I have been working on an alternative proposal that eliminates several of these rules, including null, while keeping parentheses (yay!), variable indexing, and inline comments. It has references which do degrade to values when treated as such. But it's all very much in flux, like Cloverfield itself.

FB]: I'm currently working on the reference model and semantics; they should degrade gracefully to values: getting a value from a reference takes a snapshot of the referenced object, which serializes to a string using the {ref id} syntax. I've recently found that [Lisp used a similar notation when serializing circular structures (see Cloverfield - References near the end) so I know it can be done. Extension programming shouldn't be much different with Cloverfield from what it is with Tcl.