Tcl data types

Purpose: to discuss the various types of variables Tcl supports


Tcl is often said to be a 'strings only' language. However, a string is not always 'just a string', strings only doesn't imply scalars, etc.

In this section, I hope that contributors will add information about the various kinds of data that Tcl can manage, as well as discuss the kinds of things Tcl variables can represent.

  • Tcl objects
  • scalar strings - ASCII vs UNICODE
  • string arrays (hash tables)
  • list - regular and keyed
  • handles (values returned from open, etc)
  • widgets (values returned from Tk commands)
  • Byte Codes (compiled script code)

Tcl Objects

Tcl stores values internally as "objects." Every object has a string representation, but it may also have another, internal, representation. Tcl will automatically translate the object into whatever format is needed, but it will perform the translations only as-needed.

For example, if you set a variable to a value like this:

     set a 2

Tcl will create a variable object named a and assign it a string value with a single character (not counting a null terminator) "2". If you use the variable a in a context that requires an integer, like:

     incr a

then Tcl will translate the string "2" into an integer. The incr command will change the integer value, so the string value will be invalidated. Tcl will continue to use the variable a as an integer until it is needed as a string. If you print out the value of a, then Tcl will create the appropriate string representation.

This "dual representation" of Tcl Objects makes it possible to implement many efficient data structures. Tcl has object representations for booleans, integers, floating point numbers, lists and byte codes.

Note that the major model of the operation of Tcl objects is that of immutability. An object is never expected to change in value behind Tcl's back. However, Tcl violates strict immutability when it is handling an object with a single reference (a refcount of 1) - since the system knows that nothing else is expecting the object to stay the same, that object can then be updated in-place, leading to a substantial performance improvement. -- DKF


Scalar Strings

The standard internal string representation (in 8.1 and later) is that of UTF-8-encoded Unicode text (see Unicode and UTF-8). This has many good features, especially including that it allows existing extensions to work in an internationalised setting without being rewritten to use double-byte characters. But this does not come without a cost.

The main problem with the use of UTF-8 for string operations is that it is not guaranteed to have constant-size characters. The only way of working out the length of a string, even given the size of the memory that holds the string, is to iterate across all the characters that string. As you can imagine, this incurs a significant penalty for some kinds of operation!

The way Tcl resolves this (from which version number?) is to use an additional object type for Unicode text where all the characters are in double-byte form. This has the constant-char-size property and it makes many string operations much cheaper.

The only down-side of this is potential shimmering between Unicode objects and other objects. You get this if you alternate between taking the length of a value as a string, and performing some other object operation (e.g. length as a list.) The real question is how often this is a problem, since most code treats values as being of only one kind at a time, and the rest tends to be doing something that is just plain awkward in any language. :^)

An alternative solution that was discussed involved storing the Unicode length directly in the Tcl_Obj structure, but that has the disadvantage of breaking every object-aware extension! Not everyone likes glibc-itis (major, large-scale, binary incompatability without a major version number change.)

DKF


String Arrays see Arrays / Hash maps


Lists


Handles


Widgets


Byte Codes

The most significant performance enhancement in Tcl 8.0 is the addition of a byte-code compiler. The compiler will automatically translate a script into byte codes the first time it is evaluated. (This is called "just in time" or "on the fly" compilation, if you are looking for the buzz words.) The byte codes are then executed by a virtual machine.

Subsequent executions of the script (or loop) directly execute the byte codes, instead of re-parsing the script. If you change the script (e.g., dynamic script generation), then it will have to be recompiled. Otherwise the byte codes are saved and re-used.

Unfortunately, you can't do anything to or with the byte codes other than execute them. Scriptics has a compiler in their commercial products, but the open source Tcl distribution doesn't have a way to save or execute stored byte codes.


SYStems I like to think of data as one of 3 thing

  1. Anything that can be received as input
  2. Anything that can be produced as an output
  3. Anything that can be modified by an operation (process, function, collarboration, method, etc ...)

At the Tcl script level, data is anything that can be represented by a string, thus "everything is a string", a string can represent

  • a number (float, integer)
  • text (i.e. string)
  • a list
  • xml
  • handles (names)
  • a Tcl script

At the command level, well a command can manipulate all sorts of data as long as it can live on string inputs and outputs, handles or names are usually used to deal with data that can not be represented as a string and passed around as -anonymous- values (streams, databases, tables, commands), I used the word anonymous because I think that a named value is a whole lot more and whole lot different than an anonymous value, you can not trace anonymous value, neither do you want to.


Tcl Data Structures