Working with Tcl Strings in C

Tcl likes to think of all strings as the same sort of thing: sequences of characters that happen to be Unicode characters. Any character can be used. Binary strings just happen to be a subset of the entire space of strings where the characters correspond to bytes; there are no other restrictions. (In Tcl 7 and before, this was very much not the case; strings were totally C strings and so binary data couldn't be handled at all.)

However, this happy model hides quite a bit of complexity. This means that when working with Tcl's strings in C, you have to consider how you are working with them. You, the programmer, know what you are doing with them and what you need of your inputs.

Binary Strings

Binary strings are conceptually sequences of bytes.

Binary strings have their own Tcl_ObjType implementation, bytearray.

Consuming a binary string

To read a binary string, use Tcl_GetByteArrayFromObj() to get the byte sequence and length from the string. The Tcl_Obj value that you pass in will probably come from an argument or by reading a variable.

Producing a binary string

To produce a binary string value for Tcl to use (e.g., as contents of a variable or the result of a command), use Tcl_NewByteArrayObj(). You need to give the byte array buffer and length of the byte sequence; Tcl will copy this.

Unicode Strings

Unicode strings are conceptually sequences of Unicode characters, i.e., a sequence of values of the type Tcl_UniChar (whose size will vary with Tcl versions and build options).

Unicode strings have their own Tcl_ObjType implementation, unicode.

Consuming a unicode string

Tcl_GetUnicodeFromObj() is your primary tool as NUL (\u0000) characters are definitely possible and legal. Also useful will be Tcl_GetUniChar(), Tcl_GetCharLength(), Tcl_GetRange().

Producing a unicode string

The main tool is Tcl_NewUnicodeObj(), but Tcl_AppendUnicodeToObj() and Tcl_AppendObjToObj() may be helpful too.

General Strings

Here, I refer to strings used for things like option names in command arguments and error messages and stack traces; there will be very few non-ASCII characters that you care about as the only ones you are looking for really are ASCII. If you're only working with strings to really interface with Tcl, this is the type of string you probably care about.

In this case, directly accessing the bytes field directly is OK, but using Tcl_GetString() and Tcl_GetStringFromObj() are preferred. If you are creating them, Tcl_NewStringObj() and Tcl_ObjPrintf() are useful tools. Writing to a bytes field of an object is usually not a good idea, unless you are in the process of creating that object or specifically writing the updateStringProc or setFromAnyProc implementations of a Tcl_ObjType.

General strings do not have a specific Tcl_ObjType; they're the information that (in Tcl 8.*) is not held in type-specific data.