Version 63 of BNF for Tcl

Updated 2021-05-22 15:53:42 by FortyBot

For years, newcomers to Tcl have asked for its "BNF grammar" (see "Acronym collection"). comp.lang.tcl then typically hosts a more-or-less unproductive confrontation between Tcl'ers rightly asserting that the questioner doesn't really want a BNF for Tcl, and the "outsider" rightly claiming that, yes, that's exactly what he has in mind. It's time to collect (someone else can organize) the facts on this.


It seems like the first response to a request for the grammar chart would be to ask the purpose? Is it to better understand Tcl? Then reading the rest of this page should point people to documents which more clearly describe how Tcl works.

If, however, the purpose is to write some yacc/lex code for parsing, then perhaps the response should be to point people to code which accomplishes this. Of course, some people prefer (or are required by factors) to not make use of existing code.


It's traditional to describe many languages (ALGOL-derived ones, broadly), including C, Java, and Perl (doesn't [L1 ] look like the artifact of a serious language?), with their BNF. Languages such as Forth, Lisp, and Tcl, though, have degenerate syntaxes, designed to give just enough power to implement extensibility. All the functionality in these languages derives from the application of library elements, not syntactic expression. It's moreover typical of the latter that their apparent semantics are mutable at run-time; Lisp, Forth, and Tcl programmers can freely redefine if or while. CL thinks it important to note, though, that production code in the year 2001 relatively rarely changes such control structures.


Donal Fellows writes on this question, "The problem with a standard static grammar for Tcl stems from the fact that the language isn't static (technically, I believe it belongs to a different class of languages - the context-sensitive languages - that cannot be properly parsed by anything less than a Turing machine, and BNF-based grammars are just not capable of recognizing such things.) An example might be a Tcl script that calls [package require Tk] and suddenly becomes extended with a whole bunch of new commands with non-trivial syntaxes like:

button .b -text "foo bar" -command {puts boo}

Worse still, you can even imagine Tcl scripts that contain fragments from some other language (e.g. (most?) database extensions let you write SQL, and that's a completely different language.)

Still, you might be able to do something with an interesting subset of the language."

morgan mair fheal replies, "all nontrivial languages have a type 1 or 0 grammar but few language definitions give anything formally beyond a type 2 or 3 grammar

so you can say tcl contains a very simple type 2 and then explain rest of it with a type 0 grammar or verbosity or pointing to parser

the difference with C is how much you can load on type 2 grammar and how much you have to define elsewhere" ...


Consider the following one-liner to see how changeable the language is:

proc - args {uplevel 1 [lindex $args end] [lrange $args 0 end-1]}

after which Tcl commands can also be called in Reverse Polish Notation style, e.g.

 - x 5 set
 - {$x<0} {- negative puts} else {- positive puts} if

Doesn't work in procs, though... (RS)


Many times when Tcl outsiders comment on "Tcl's syntax", what they really have in mind is the syntax of particular Tcl commands. Expert Tcl users make a clear distinction between the syntax of the core, "content-free" language, and the library of implemented commands, which can and frankly do all kinds of idiosyncratic and non-uniform things. The usual implementation of clock, in fact, involves a small Lex-Yacc source.


If the real goal is to parse Tcl source, MUCH the best vehicle is Tcl itself. The tclparser extension is particularly apt for this, though there are several other implementations.


In July, 1993, Terrence Monroe Brannon posted message <[email protected]>, which contained flex lexer for a tcl to c compiler he was working on at the time. He also posted some yacc code for the same work <[email protected]>. [CL will googlify these refs when he makes time, so they're "clickable".]


[cultural conflict--Tcl'ers don't argue about grammar the way others do. Action is in library, extensibility definitions, ...]


So what is Tcl's grammar?

  • A script (program source, text) is a sequence of zero or more statements.
  • A statement is either a comment or a command list (why aren't comments just no-op commands? 'Sure seems as though that would simplify things ... -- one reason appears to be to surpress command substitution.)
  • A command list is a list of zero or more words.
  • Words are more-or-less arbitrary strings, possibly including white space, lexified with just a few special characters. The traditional reference for parsing Tcl at this level is the "Tcl.n man page" [L2 ].

[Is the description above accurate? Does it handle comments and newlines at least consistently?]


KBK 2002-04-04: The complete syntax of the core Tcl language fits on a single page. [name redacted] calls it the Endekalogue, and it's in the manual page labelled, 'Tcl' [L3 ].

If you understand it fully, you understand Tcl. That's the beauty of the language. Commands add their own interpretations of arguments, but the basic syntax of the language is always covered by the same eleven rules.


Semi-formal definitions for strings as lists

The way in which strings are parsed as lists is documented in the lindex manual page.

Well-formed list: any string s for which

string equal "{$s}" [list $s]

returns 1. This definition excludes strings without whitespace.

proc test-well-formed s {
        string equal "{$s}" [list $s]
}
test-well-formed "a b c"    ;# => 1
test-well-formed "a { c"    ;# => 0
test-well-formed "a b\nc"   ;# => 1

RS uses this well-formedness test:

proc isList x {expr {![catch {llength $x}]}}

Canonical list: any well-formed list s for which

proc test-canonical s {
   if {[catch {llength $s}]} {return 0}
   string equal $s [list {*}$s]
}

returns 1.

The latter test proc by DKF.


From one perspective, "Funky Tcl extensibility" expresses one limit to the ability of conventional parsing to encompass Tcl.


jcw 2004-05-13: Tcl has no BNF, it uses more a macro-processing model for its language style. But I was wondering: is there a deep reason for that? Is there anything that prevents having a classical syntax parsing step, while still maintaining the "everything is a string" and "copy on write" mantra? A bit like "expr", but at the statement level, with control structures and all?

CMCc: Isn't it the case that every statement is a list, terminated by ; or \n. The simpler question is then: is BNF sufficiently powerful to represent any possible list?

DKF: The endekalogue can be written using BNF quite easily. It just doesn't tell you a vast amount the deeper meaning of Tcl.

KJN: One way to avoid the "more-or-less unproductive confrontation" mentioned at the start of the page is to give the questioner the BNF for the endelalogue/dodekalogue.

jcw: Whoops, I think I asked the wrong question, sorry. I mean would it be possible to fit a traditional BNF-type language onto the Tcl core? Like, say: "if (a%10==0) { b = 12; print('b = ' + b, newline=0) }"? IOW, a completely different parsing step. Due to Tcl's dynamism, it could all happen at runtime.

Lars H: Isn't expr the example of doing this? It's small, but it has a very traditional BNF grammar. Some other command could be created which essentially evaluated code in language X rather than mathematical expressions. Some companion of proc could be created which took bodies in language X rather than in Tcl. And so on.

In 2006, the L language became an example of this "language X".


Customized version of BYACC which produces Tcl parsers: [L4 ]


A noob here. I've cobbled some yacc together, legal, but CHAR and ANYCHAR are very iffy. No lexer and no semantic actions. It gives me an insight into what's going on and what isn't, in the sense that tcl seems to be all about the semantics.

        %token ANYCHAR CHAR NEWLINE SEMICOLON TAB_OR_SPACE 

        %%

        script 
        : commands 
        ;

        commands 
        : command
        | commands SEMICOLON command
        | commands NEWLINE command
        ;

        command 
        : words
        ;

        words 
        : firstWord
        | firstWord TAB_OR_SPACE trailingWords
        | comment
        ;

        firstWord 
        : word
        ;

        trailingWords 
        : word
        | trailingWords TAB_OR_SPACE word
        ;

        word 
        : simpleWord
        | quotedWord
        | bracketedWord
        | wordForSubstitution
        | wordForExpansion
        ;

        simpleWord 
        : CHAR
        | simpleWord CHAR
        ;

        quotedWord 
        : '"' word '"'
        ;

        bracketedWord 
        : '{' word '}'
        ;

        wordForSubstitution 
        : '[' word ']'
        ;

        wordForExpansion 
        : '{' '*' '}' word
        ;

        comment 
        : '#' charList NEWLINE
        ;

        charList 
        : ANYCHAR
        | charList ANYCHAR
        ;

        %%

Here's my attempt. Syntax is EBNF, except that [], {}, ?, *, and + have been borrowed from regular expressions and , is optional. In addition, - may be used to remove characters from a non-terminal composed only of terminals (set difference).

        script = whitespace* ( command ( terminator script )? | comment ( '\n' script )? )? ;

        command = word ( whitespace+ word )* whitespace* ;
        comment = '#' ( '\\' character | ( character - '\n' ) )* ;

        word = quote | expansion | brace | ( subcommand | variable | escape-sequence | ( character - space - '\\' ) )+ ;

        quote     = '"' ( bracket | variable | continuation | escape-sequence | ( character - '"' ) )* '"' ;
        expansion = '{*}' word ;
        brace     = '{' ( brace | continuation | '\\' ( character - '\n' ) | ( character - '}' ) )* '}' ;

        bracket         = '[' script-bracket ']' ;
        script-bracket  = whitespace* ( command-bracket ( terminator script-bracket )? | comment ( '\n' script-bracket )? )? ;
        command-bracket = word-bracket ( whitespace+ word-bracket )* whitespace* ;
        word-bracket    = quote | expansion | brace | ( bracket | variable | escape-sequence | ( character - space - '\\' - ']' ) )+ ;

        variable       = '$' ( name index? | variable-brace ) ;
        name           = ( [0-9A-Z_a-z] | ':' ':'+ )+ ;
        index          = '(' ( subcommand | variable | continuation | escape-sequence | ( character - ')' ) )* ')' ;
        variable-brace = '{' ( name-brace index-brace | ( character - '}' )* ) '}' ;
        name-brace     = ( character - '(' - '}' )* ;
        index-brace    = '(' ( character - '}' )* ')' ;

        escape-sequence = '\\' ( octal{1,3} | 'x' hex+ | 'u' hex{1,4} | 'U' hex{1,8} | ( character - '\n' ) ) ;

        whitespace   = space | continuation ;
        continuation = '\\\n' space* ;

        terminator = '\n' | ';' ;
        space      = ' ' | '\f' | '\r' | '\t' | '\v' ;
        octal      = [0-7] ;
        hex        = [0-9A-Fa-f] ;
        character  = [\U0000 - \U10FFF] ;

pyk 2021-05-19: Added missing parenthesis in "name-brace" and changed "character" to the full Unicode range. Here's an alternative description of braced variables:

    variable-brace       = '{' array-brace | name-brace '}' ;
    array-brace          = array-name-brace? index-brace? ;
    array-name-brace     = ( character - '(' - '}' )+
    index-brace          = '(' (character - '}')* ')' ;
    name-brace           = ( character - '}' )* ;