Version 4 of vertcl

Updated 2014-08-04 03:34:08 by pooryorick

vrtcl, by PYK, is an extensible Tcl notation for hierarchical data. It is a work in progress, and its current state is embryonic.

Description

vrtcl is a record-centric data format that seamlessly integrates non-record structures, namely lists. It supports type indicators for values, but makes it easy to avoid them when they aren't wanted or needed. The goal of vrtcl is to be more readable and concise than other formats without sacrificing expressivity, and also to be extensible.

vrtcl was conceived in the course of pondering JSON, which bills itself as a "fat-free alternative to XML". Although JSON is lean, Tcl syntax is leaner and cleaner. Other Tcl formats such as huddle and TDL exist, but vrtcl is different in some key characteristics.

The path to vrtcl began with a comparison between a JSON example and one plausible Tcl equivalent:

JSON:

{
    "title": "Example Schema",
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "age": {
            "description": "Age in years",
            "type": "integer",
            "minimum": 0
        }
    },
    "required": ["firstName", "lastName"]
}

Tcl:

title {Example Schema}
type object
properties {
    firstName {
        type string
    }
    lastName {
        type string
    }
    age {
        description {Age in years}
        type integer
        minimum 0
    }
}
required {firstName lastName}

Here is another example from json.org, in XML, JSON, and Tcl:

<!DOCTYPE glossary PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
    <glossary><title>example glossary</title>
        <GlossDiv><title>S</title>
            <GlossList>
                <GlossEntry ID="SGML" SortAs="SGML">
                    <GlossTerm>Standard Generalized Markup Language</GlossTerm>
                    <Acronym>SGML</Acronym>
                    <Abbrev>ISO 8879:1986</Abbrev>
                    <GlossDef>
                        <para>A meta-markup language, used to create markup
languages such as DocBook.</para>
                        <GlossSeeAlso OtherTerm="GML">
                        <GlossSeeAlso OtherTerm="XML">
                    </GlossDef>
                    <GlossSee OtherTerm="markup">
                </GlossEntry>
            </GlossList>
        </GlossDiv>
    </glossary>
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}
glossary {
    title {example glossary}
    GlossDiv {
        title S
        GlossList {
            GlossEntry {
                ID SGML
                SortAs SGML
                GlossTerm {Standard Generalized Markup Language}
                Acronym SGML
                Abbrev {ISO 8879:1986}
                GlossDef {
                    para {
                        A meta-markup language, used to create markup
                        languages such as DocBook.
                    }
                    GlossSeeAlso {GML XML}
                }
                GlossSee markup
            }
        }
    }
}

In the Tcl examples above, the format is that of a script where each command takes exactly one argument and knows the type of that argument. Even though JSON expresses types in its syntax, JSON data is usually also consumed by a script which already knows what type to interpret the data as, since it conforms to a scheme that is usually documented as part of a service API. In most cases, the additional type syntax of JSON is not needed.

The first type information that comes to mind for the Tcl format is the ability to distinguish between a word and a list. One approach in the spirit of Tcl would be to interpret all values as lists, making sure that where a single value is intended, it is formatted as a list containing one item:

{
    title {{Example Schema}}
    type object
    properties {
        firstName {
            type string
        }
        lastName {
            type string
        }
        age {
            description {{Age in years}}
            type integer
            minimum 0
        }
    }
    required {firstName lastName}
}
glossary {
    title {{example glossary}}
    GlossDiv {
        title S
        GlossList {
            GlossEntry {
                ID SGML
                SortAs SGML
                GlossTerm {{Standard Generalized Markup Language}}
                Acronym SGML
                Abbrev {{ISO 8879:1986}}
                GlossDef {
                    para {{
                        A meta-markup language, used to create markup
                        languages such as DocBook.
                    }}
                    GlossSeeAlso {GML XML}
                }
                GlossSee markup
            }
        }
    }
}

This is already a reasonable format to work with. A more full-featured notation would support type information and other metadata in an extensible way. In JSON, it is possible to differentiate between the string "3.14159" and the number 3.14159, but all other numeric types are inferred by the format of the number. This is a hint that explicit typing is often not needed. In JSON, this numeric type inference stems from from the origins of JSON in Javascript, which only provides one numeric type. Given that JSON data is typically described in the documentation for an API, the explicit double-quote syntax to differentiate numbers and strings is in practice probably always redundant with the documentation, and therefore not really necessary. Where type information is used by the consuming program, it would probably always be trivial (excepting of course political issues such as backwards compatibility and organizational inertia) to add a field to the API to specify the type of the value, rather than relying on the syntax of the serialization format.

[TODO: Find real-world examples of programs that actually rely on this distinction]

The Tcl examples above are valid scripts in which each "command" takes exactly one argument. Where explicit typing is desired, a variation could be employed in which a command given only one argument treats it as its list value, and when it is given two arguments, treats the first argument as metadata, and the second as the value:

glossary {
    title {{example glossary}}
    GlossDiv record {
        title S
        GlossList record {
            GlossEntry record {
                ID SGML
                SortAs SGML
                GlossTerm {{Standard Generalized Markup Language}}
                Acronym SGML
                Abbrev {ISO 8879:1986}
                GlossDef record {
                    para {{
                        A meta-markup language, used to create markup
                        languages such as DocBook.
                    }}
                    GlossSeeAlso list {GML XML}
                }
                GlossSee markup
            }
        }
    }
}

Specification

The vrtcl data format, which is a syntactically-valid Tcl script, is specified as follows:

Unless otherwise noted, terminology defined in the dodekalogue has the syntax described there, but sometimes with redefined semantics.

A document is an independent record. A record is a script composed of fields. Each field is a command whose semantics depend on the number of words in the command:

0
The field is unnamed. Its value is the first word of the field.
1
The first word is the name of the field, and second word is its value.
2
The first word is the name of the field, the second word is the metadata for the value, and the third word is the value, the type of which depends on the metadata.

The type of the metadata is record. If the first field of the metadata contains only one word, the name of the field is type and its value is the first word. If the second field contains only one word, the name of the field is itype and its value is the first word.

The built-in metadata fields are:

type
the type of the value. Its default value is list
itype
the default type of items in containers like list.

The built-in types are:

record or rec or r
a record. Its default type is field.
field or fld or f
a command composed of at most three elements, as described in this specification.
word or wrd or w
a word.
array or arr or a
A record in which the fields are not named and a field is composed of at most two words. In the two-word case, the first word is the metadata, and the second is the value.
list or lst or l
A Tcl list. The default itype is list.
null or nul or z
A null type has no value, and is written like this:
number or num or n
a number in any format undersood by expr, except octal

null is written like this:

field1 null {this value is discarded}

or, in an array:

null {this value is discarded}

When the name of the first field of a document is meta, the field is the metadata for the document.

The built-in record metadata fields are:

types
a record in which each field describes the types of values that may be encountered in the document. The name of each field is the name of the type
elements
a record in which each field describes the fields that may appear in the record.

The built-in fields of a metadata type record are:

default
the default value for type.
   `contains`:  a record in which fields identify which elements can occur in the type, along with constraints its the occurence.

The built-in fields of a metadata elements record are

type
the type of the value

Examples

#! /bin/env tclsh

#this is a vrtcl document

transaction {
    location 83391
    customer 17611
    date {2014 05 13}
    time 07:23:00
    items list {eggs milk}
    #also valid:
    #items {{eggs milk}}
    amount 8.32
}

transaction {
    location 16912
    customer 17611
    date {2014 07 13}
    time 18:47:17
    items list {donuts {ice cream}}
    #also valid:
    #items {{donuts {ice cream}}}
    amount 14.71 
}
#! /bin/env tclsh

#This is a vrtcl document

META {
    types {
        #change the `itype` of record to `array`
        record {
            defaults {
                itype array
            }
        }

        #change the `itype` for list to field
        list {defaults {itype field}}

    }
}

#the default itype for a record is now "array", so there are no record names 

{one two {{three four}}}
#if the itype of list had not been changed to field, it would have been:
#{one two {three four}}
record {
    name Capaneus
    son Sthenelus
}

One advantage of vrtcl over huddle is that there is no need for a keyword like HUDDLE in the data. Values that look like vrtcl data can be given the word or list types, and those values will not be misinterpreted as part of the structure of the vrtcl record. vrtcl records can be composed of other vrtcl records.

In huddle the type of each value is explicit in the notation. vrtcl relies on its rules to obviate the need for explicit type notation in the common cases.

Here are some comparisons with huddle:

#huddle
HUDDLE {D {
    a {s b} c {s d}}}

#vrtcl
{} record {a b; c d}


#huddle
HUDDLE {
    L {{s e} {s f} {s g} {s h}}}

#vrtcl
{e f g h}

#vrtcl, explicit list declaration
{} list {e f g h}

#vrtcl, when in an array
list {e f g h}

#vrtcl, each element having its own type
{} record {
    {} word e
    {} number 3.14159
    {one two {three four}}
    #the same as above
    {} list {one two {three four}}
    #the same as above
    {} {one two {three four}}
    #the same as a above
    {} {list; field} {one two {{three four}}}
}

#vrtcl, each element having its own type, `array` syntax

{} array {
    word e
    number 3.14159
    {one two {three four}}
    #the same as above
    list {one two {three four}}
    #the same as above
    {list; field} {one two {{three four}}}
}



#huddle
HUDDLE {D {bb {D {a {s b} c {s d}}} cc {L {{s e} {s f} {s g} {s h}}}}}

#vrtcl
{} record {bb record {a b; c d}; cc list {e f g h}}

#HUDDLE
{L {
    {D {
        bb {
            D {
                a {s b}
                c {s d}}}
        cc {L {
                {s e} {s f} {s g} {s h}}}}}
    {s p}
    {L {{s q} {s r}}}
    {s s}}}

#vrtcl
{} array {
    record {
        bb {
            a b
            c d
        }
        cc list {e f g h}
    }
    p
    {q r}
    s
}

HUDDLE {D {a {L {{D {c {s 1}}} {D {d {L {{s 2} {s 2} {s 2}}} e {s 3}}}}} b {L {{D {f {s 4} g {s 5}}}}}}}

#vrtcl
{} record {
    a array {
        record {
            c 1
        }
        record {
            d {2 2 2}
            e 3
        }
    }
    b array {
        record {
            f 4
            g 5
        }
    }
}

#vrtcl, using lists instead of arrays
{} record {a {list; record} {{c 1} {d {2 2 2}; {e 3}}}; b {list; record} {{f 4; g 5}}}

#same as above, with shorter type names
#vrtcl, using lists instead of arrays
{} r {a {l; r} {{c 1} {d {2 2 2}; {e 3}}}; b {l; r} {{f 4; g 5}}}

Questions and Comments

AMG: Is this called vrtcl or vertcl? Both spellings appear on this page. Do these two terms refer to distinct yet related concepts?

PYK 2014-08-03: I was playing with both names, and I guess my mind hadn't settledd on one. I'm going to go with with "vrtcl", which will entail changing the name of this page. I just changed all spellings on this page to "vrtcl".


AMG: "Each field is a command whose semantics depend on the number of words in the command:" Customary Tcl terminology counts the name of the command as the first word of the command. Here, you probably mean to say arguments instead of words.

PYK 2014-08-03: the vrtcl specification avoids the use of "arguments", using "words" instead, and "first word" refers to th word in the position that normally names a command. This aligns with rule 2, which defines the name of the command as the first word in the command. I note though that rule 2 is ambiguous or even contradictory, saying that the first word is used to locate command procedure, but then immediately stating that all words of the command are passed to the procedure, which is of course not true if the command name is considered one of the words of the command. Perhaps rule 2 should be ammended to state that "all of the remaining words are are passed to the command procedure".


AMG: How does this compare to TDL? See also the configuration file format I derived from TDL and implemented in Config file using slave interp.

I'll answer my own question. TDL and my derived format target XML (including attributes), whereas you target JSON (which lacks XML-like attributes). When attributes are not used, my format matches your basic specification, though not your explicit type tagging and extra list encoding variations.

PYK: Yes. TDL riffs on XML, supporting attributes as its metatdata feature, but vrtcl provides a full isomorphic metadata mechanism.


AMG: I have no JSON experience, so I would very much like to know whether it's possible for a program to make decisions based on the type of a value. This is essentially the same question as whether or not explicit type tagging has any value. You argue that it does not (EIAS and all that), but then proceed to present a way to incorporate type tagging, with only a TODO note to research whether or not there's any reason to do so. Perhaps someone else could fill us in.

Like you, I argue that inline type tagging is not useful because the consumer already knows what types to expect. In other words, the schema is embedded in the program itself.

That's not to say explicit schemas aren't useful. They are good documentation, they make it possible for a generic program to validate a document, and they may assist compression. But merely embedding types in the document does not constitute a true schema. It cannot protect against misspelled field names, and it fails to document data relationships and unused features. So explicit schemas really ought to be external documents. In short, I see no benefit to embedding type tags in this application.

Well, actually, I do see one benefit, and maybe this is what you had in mind. Since other formats require type tags, and since it's generally impossible to unambiguously detect the "type" of a Tcl word, compatibility with other formats requires type tagging. That type tagging can go inline or in an explicit schema.

As you may guess from what I said about schemas, I prefer the latter approach, though I freely admit that the former is much easier to implement. There is a third approach, which is to write a program that implicitly embeds a schema and whose purpose is conversion. This surely works, but I think of it as a one-off sort of solution, whereas this page is dedicated to describing a general approach, so again I think an explicit schema would be more appropriate.

... I hate to say it, but there is a fourth way to go. PLEASE do not use this. [tcl::unsupported::representation] tells you what C-based type the script most recently wanted any given value to be. Again, don't do this, since it will result in a very brittle and unpredictable program, requiring the end user to do anti-Tcl contortions to get the desired results, and even then with surprises. The only reason I mention this at all (aside from taking the opportunity to warn against it) is because a real-world Tcl extension does it: tcom. And it is sometimes very shocking [L1 ].


AMG: What is the point of applying an extra level of list encoding to each command's argument? It's already a single word on account of being one argument. I don't see how this helps in disambiguating its type, if that's even a goal (see above).

PYK: See my comments below.


AMG: The format you present is script-oriented, meaning that you use newlines as field separators. Do you also allow semicolons? How about substitutions? Comments? Double quotes instead of braces? Backslashes? Blank lines? Is whitespace significant, e.g. indenting? What about non-data commands such as looping and conditionals, such as those I demonstrate in Config file using slave interp?

Except for those showing type tagging, all the examples you present could instead be viewed as dicts, since newlines work as list/dict element separators just as well as any other form of whitespace. What benefit does your approach have over nested dicts?


PYK 2014-08-02: Replying to all AMG's comments in one swoop:

Some of the rules and examples have changed overnight, making some of AMG's comments appear a bit non-sequitor. Sorry about that.

The Amazon DynamoDB API Reference selects a comparison mode based on the type of AttributeValueList, so it's certainly possible for a program to react to the type of a value, but it would also be trivial to communicate the type as the value of an additional field, perhaps AttributeValueListType, which obviates the need to have type indicators built into the syntax. vrtcl is very much about clean syntax, and some of the complexity of the ruleset arises in order to minimize syntax and maximize conciseness and legibility.

vrtcl supports type indicators in order to support lossless tranformations to/from other formats, such as XML and JSON, and also because even though type indicators are often not needed, maybe sometimes they are, and maybe if vrtcl supports them, they will be put to good, judicious, and even ingenuitive use.

vrtcl aspires to be extensible like XML is, and JSON is not. If the type tagging is removed from the lexical level but supported in the grammar, additional types can more easily be added. The metatada record is also wide open to other novel use.

I fully agree with your analysis of the usefulness of schemas, and the elements metadata field is a hint at the direction vrtcl will take in support of schemas. So why also support type tags in record fields? Because schemas can be a hassle to write, and making them mandatory would be a barrier to usability in many cases. Even if a programmer has an idea of what the data schema is, it can take a while during development for the schema to settle down. In the meantime, the programmer often just wants to throw data around around and work with the data in an ad-hoc fashion for a while. vrtcl aims to be a superior alternative at all stages. Besides, if both schemas and inline metadata are available, interesting patterns might emerge. In short, the reasons are what you guessed I had in mind, and then some.

Supporting metadata at each field could end up giving schemas a CSS characteristic, where schemas for subdocuments are built up in a cascading manner. Metatada is a vrtcl document in its own right, giving it a flexibility that XML, with its special syntax for attributes, doesn't have. How far down the metadata rabbit hole will people go? vrtcl leaves that undefined to keep the possibilities open.

I have no intention of using [tcl::unsupported::respresentation] :)

Regarding the extra level of list encoding, I just corrected one spot where {{three four}} was used when it should have been {three four}, which may have thrown you off. It's not that there's an extra level of list encoding, but just that some of the rules make it necessary to quote things that way in some cases, depending on the desired outcome. Take for example, this field:

example {list; field} {one two {{three four}}}

{{three four}} occurs with the two pairs of braces because the list item is configured as a field. If a field contains only one word, that word is the value of the field, and is interpreted as a list. This rule minimizes the number of characters necessary convey the structure in some common cases. The idea is that the programmer doesn't want to be bothered with naming the field or providing any metadata. In most cases, a list value will be exactly what's wanted, or at least it isn't that difficult to format a value as a list containing one item.

At points where it feels like there are too many braces, vrtcl probably supports some syntax that will eliminate those extra braces. In the previous example, The default itype of a list list is also list, meaning that it can be written like this:

example {one two {three four}}

Regarding semicolons, yes, they have their normal meaning. Some of th examples already that illustrates semicolons, and a few examples contain comments. Regarding substitutions, double quotes instead of braces, Backslashes, blank lines, whitespace, and indenting: Yes! I haven't ruled any of these things out for vrtcl, and I hope it can continue to support all of them. All normal Tcl processing will be performed on the words of a field prior to its evaluation by vrtcl. Tcl syntax makes those things easy enough to avoid when they are unwanted. But this is where the embryonic status of vrtcl comes into play. Any of these features that turn out to be too distracting or unwieldy, might still get cut, though I don't anticipate that. As is the case with standard Tcl, substituted values will not be rescanned, but macros such as those found in Config file using slave interp could be implemented in a more controled manner through the metadata "hook".

I'm not currently considering adding explicit general scripting capability in the form of looping, conditionals, or other programming features. vrtcl is, after all, a data format. I think the XSLT approach, in which a separate document is created that describes transformations to effect on other documetns, makes for a good separation of concerns. The metadata records of vrtcl turn it into the swiss cheese of data formats, so there's plenty of room to fill in the blanks as occasion arises.

Thank you for the feedback!

Page Authors

PYK