Version 25 of null

Updated 2007-01-08 04:28:29

AMG: Everything is a string is nice and all, but for many applications it's important to have a special value that's outside the allowable domain.

If the domain of values is numbers, any non-numeric string (e.g., "") will do, so "" can be used to signify that the user didn't specify a number. C strings can't contain NUL and therefore are free to reserve NUL as a terminator or field separator. Unix filenames reserve / and NUL, so / is available to separate path components and NUL can be used with find -print0, xargs -0, and cpio -0 to separate filenames in a list. (The more common practice of separating filenames with whitespace breaks whenever whitespace is used in filenames.)

But if the allowable domain of values is any string at all, no string can be reserved for a special purpose.

Since Tcl has nothing that is not a string, the only remaining solution is to have a separate, out-of-band way of tracking the special case. Returning to the C example, if a program needs to support having NUL in the middle of a string, it must either encode the string using a possibly fragile quoting scheme, or it can use a separate variable to track its length. As for the Unix filename example, if a filename needs to contain a /, it absolutely must be encoded, for instance as %2F, but then the quote character must also be encoded (%25). This is because Unix filenames have no room for an out-of-band channel. (By the way, KDE uses this encoding scheme to support / in filenames.) In Tcl, a separate variable can be used, such as a variable that's false when the user didn't specify a string.

This can be very cumbersome and isn't always viable (again, when the domain is all strings). Two examples are default arguments and SQL nulls. Foolproof tracking of the former requires the proc to accept args and do its own You can use the trick from MLdefaulting; [llength $args] serves as the out-of-band channel. Tracking the latter may require asking the database to prepend a special character to all non-null string results; basically the first character is the out-of-band communication channel identifying the nullity of the result. A more straightforward option is to SELECT the NOTNULL of the string columns whose values could be null.


jhh proposes a possible solution in TIP 185 [L1 ]. Basically, {null}! is recognized by the parser as a null, which is not a string; it is distinct from all possible strings. "{null}!" is, of course, a seven-character-long string, and it's also a one-element list whose sole element is a null.

I (AMG) have several strong comments regarding the TIP:

  • I prefer to say "null" instead of "null string" because I feel that a null is not a string at all. It's the one thing that isn't a string! I guess we'll need to change our motto. :^)
  • Likewise, I'd rather not tack the null management functionality onto the [string] command.
  • I think I'd prefer a [null] command for generating nulls and testing for nullity. It's best not to use the == and != expr operators for this purpose; null isn't equal to anything, not even null.
  • We can ditch the {null}! syntax in favor of using the [null] command to generate nulls, but then [null] cannot be implemented in pure script. This might be an important concern for safe interps.
  • Automatic compatibility with "null-dumb" commands is a mistake; it's the responsibility of the script to perform this interfacing.
  • When passed a null, the Tcl_GetType() and Tcl_GetTypeFromObj() functions should return TCL_ERROR or NULL (in the case of Tcl_GetString() and Tcl_GetStringFromObj()).
  • Most commands should be "null-dumb". Only make a command handle nulls when it is clear how they should be interpreted.
  • The non-object Tcl commands can probably represent nulls as null pointers ((void*)0 or NULL). If for some reason that can't work, reserve a special address for nulls by creating a global variable.

Feel free to argue. :^)


AMG: Here's a silly and inefficient proc to help me play around with the ideas presented above:

 proc foobar {varname {value {null}!}} {
    upvar 1 $varname var
    if {![null $value]} {
       set var $value
    }
    return $var
 }

This proc should behave the same as [set].

You will notice that I used {null}! even though in my above comments I suggested removing it in favor of always using [null] to obtain nulls. But it turns out that's not feasible in the above code; it would only result in $value defaulting to the string "[null]". To get the desired behavior, I'd have to write [list varname [list value [null]]]], which is far from readable. (With Tcl 9.0 Wishlist #67, it becomes (varname (value [null])), which I can live with.)

That's one black mark against my idea...

A more worrying problem is that [foobar] can't be used to set a variable to null! Why? Because the domain of $value includes all strings and null, there is (once again) no possible value outside the domain that can be used to indicate that a special condition occurred and cannot be "forged" by the caller. So what are nulls good for again?

I'm up to two black marks now. It's not looking good.

It seems nulls aren't as useful as originally hoped. (Notice the use of the passive voice.) But are they still good for something? The reason [foobar] doesn't work in the above case is that it is being driven by the script, and the script is capable of producing nulls. If its input instead came from a file or socket, it would be just fine because reading from a channel will never result in a null. Of course, at this point I'm reminded of tainting, which might be a better solution.


wdb When switching from Lisp to Tcl, the lack of some special value such as NULL was one of the drawbacks I decided that I can live with it. It is the price of the simplicity I am willing to pay. There are more than one cases where something similar is resolved by some trade-off:

  • In the switch statement, the word default impacts the value "default".
  • In proc's arg list, the word args impacts the choice of argument names.
  • In Snit and Itcl, the argument #auto or %AUTO% impacts the choice of instance name.
  • And so on.

Extending the value range of type string leads to the consequence of leaving the principle eias. It is possible, and sometimes even desirable, to extend it. If so, ask yourself, if Tcl is your right choice anymore.

If you ask me: I prefer the state as is. The drawbacks are known, and as mentioned above, I can live with them.

AMG: switch can select on the value "default" if "default" is not the last option given. proc can accept an argument named "args" if it's not the last one in the list (although see Tcl 9.0 Wishlist #77). I'm just pointing out that these "keywords" only have special meaning when in combination with some other out-of-band data, which in these cases is list position. One more example is the use of - to signify an option. To disambiguate, we have -- to partition the argument list into options and non-options (see '--' in Tcl).

Yes, it's totally true we can live without nulls. The real problem comes when interfacing with systems that do have nulls. Tcl has no easy and safe way to represent them. Reserving a string will work most of the time, but the Tcl script becomes confused when the reserved string collides with valid data. This may happen by accident or as part of a malicious attack, which means even nonsense strings like "ßÿÑâŖΊ" aren't safe.

All the other stuff I said about nulls is just cute, sugary things we can do with them if they were added.


wdb (again) but if really necessary, it is possible to introduce typed data to tcl. Just put them in a list the first of which contains the type, and the second the data as follows:

 set typed_value1 {allowed {hello world}}
 set typed_value2 {disallowed {bye bye}}

This example shows the use of two data types allowed and disallowed. It allows easily to construct a null value by choice of type disallowed.

AMG: This is like jhh's method of prepending a special character to all non-null SQL results, except of course it's cleaner.

NEM: Tagged data is also how functional programming languages like ML and Haskell handle optional data/NULLs:

 # data Maybe a = Just a | Nothing
 proc Just data { return [list Just $data] }
 proc Nothing {} { return [list Nothing] }
 set val1 [Just "some data including Just and Nothing"]
 set val2 [Nothing]

Then you can test for missing data (NULL/Nothing) using a switch:

 switch -exact [lindex $data 0] {
     Just    { do stuff with [lindex $data 1] }
     Nothing { handling missing data }
 }

Alternatively, in many cases you can use the (non-)existence of a variable or dictionary/array element to test for nullity. e.g. in a database-like interface:

 $db query $query row {
     if {![info exists row(name)]} { # name is NULL }
 }

Lars H: In the original discussion of TIP#185, the following methods were proposed for interfacing Tcl with systems that have NULL values:

  1. If the external function returns a value or NULL, then have the corresponding Tcl command return a list of one element for non-NULL values or an empty list for a NULL value. In Tcl 8.5, {*} greatly simplifies using such commands.
  2. If the external function returns a "record" where some of the entries may be NULLs, then have the corresponding Tcl command return a dictionary which only has entries for the fields with non-NULL values.

Type-tagging values using lists as shown above may also be necessary when interacting with other systems, as some indeed take different actions for data of different types (even if the values are the same). tcom apparently has some troubles in this area, as it does not provide for specifying the type of data to pass on. In TclAE, the types are instead explicitly specified.

What NULL proponents should take note of is that Tcl values, as a consequence of the dodekalogue, constitute a monoid [L2 ] with the empty string as identity element and string concatenation (cconcat, for those who require a command name) as operation. The Everything is a string principle says that the monoid of Tcl values is in fact a free monoid (currently the free monoid of words in the alphabet of all BMP Unicode code-points), and I think it is an extremely good principle, but the dodekalogue does not explicitly proclaim it. Hence one could imagine a Tcl where there in addition to the strings exists a NULL value, but then it would have to be sorted out how this NULL should act under concatenation. What is passed on to A in the following commands?

  A [null][null]
  A [null]somestring
  A somestring[null]

Another problem with introducing special values like NULL is that there's no reason to believe that one special value is always going to be sufficient: once in widespread use, someone will come up with a situation where NULLs should be handled as an ordinary value, but at the same time needs a SUPERNULL that isn't! On the whole, it is much simpler to avoid introducing any special values.

AMG: Existence checking can work. Maybe sqlite eval's two- and three-argument forms can unset the variable or array element to signify that its value is null. The script already knows all the variable names, so it shouldn't need to be explicitly told what's null. But on the other hand, maybe the array, dictionary, or whatever can be accompanied by a list of all fields whose values turned up null.

Encoding the data as a list seems clever. If [llength] is zero, the data is null. If [llength] is one, the data is stored in [lindex 0]. Use {*} to get at it most easily. I imagine it's possible to recursively apply this encoding to dictionaries and nested lists.

Regarding the combination of nulls and non-nulls, jhh's TIP suggested that concatenating a null with a string resulted in a null. "Nulls propagate. A null combined with any nonnull is null. Appending a null to a string, or substituting a null into a string nulls the entire string." By this rule, A will receive null in all three cases.

My [foobar] example shows a case where SUPERNULL would at first glance appear to help, but of course it's a ridiculous thing to ask for, especially since it would still not allow setting a variable to SUPERNULL. What's asked for is a value outside the input domain, but no such value can exist because a variable can be set to anything (string or otherwise). Therefore the only solution is the out-of-band channel, as in [llength $args] indicating how many arguments were passed. With some sugar it might be possible to add a command to check if an argument was explicitly passed or if it was left at its default; this seems like a halfway point because [llength $args] is being used internally but to the programmer it's no different than checking for null. I don't propose such a thing; I'm just giving examples.

Lars, as you say, null would only be useful for this purpose so long as the command is intended only to interface with stuff that cannot generate its own nulls. But this is a funky reason to advocate null--- the original impetus was the desire to interface with stuff that does generate nulls.


jhh: Let me try to clarify what I was proposing in TIP 185 [L3 ]. (I see more comments have come in since I looked 2007-01-06 14:42GMT, so this is not up to date on the discussion.) Forgive me if I screw up this funky wiki markup language.

I agree, every thing is, and should, be a string, but I regret Tcl has no way of representing a null string, unlike most other languages, even Java and lowly Visual Basic. There are really two proposals here, and I probably confused people them by lumping them together in one TIP. The reason I did was that the two together are synergistic and compelling (at least to me). They are:

  1. Extend the meaning of a string to include a null string. I will call this TIP 185a.
  2. Extend the meaning of list and dicts to allow the representation of unknown elements. I will call this TIP 185b.

TIP 185a might be helpful on its own. For example, SQLite's loop construct (e.g., "db eval {select * from accts} {} { ...}") can return null information without using a contrived query statement and similarly contrived decoder code, thus opening the door to the creation of general purpose packages to integrate databases.

Either proposal could exist alone, but together they allow Tcl, the preeminent system glue language, to transparently manipulate system communications without kludging them as they enter and leave. A lot of my coding time is spent on this silly matter; if the system involved two or more database engines, as is common, the time is thus multiplied. And every programmer is doing the same thing, over and over. Aside from that, those of us who think Tcl is the best medium for exploring ideas and algorithms would be grateful at having this gap filled, and those unfamiliar with nulls would soon find nulls quite useful in their own right.

TIP 185 was not well received, perhaps partly due to my poor presentation of the idea -- I would do it differently now, having seen the response. My recent thinking is to wait for Tcl 9, or perhaps submit TIP 185a separately. I think the most difficult issues are implementation and performance, and the handling of legacy, null-dumb commands: should they see an empty string and proceed, or should they fail? Currently I am leaning toward the latter, more conservative direction.

Most voices against the idea have argued from misunderstanding, or have been vague, so I am not yet convinced the idea has no merit. Little else has been presented that would really help the matter, aside from endless accounts of the workarounds that we all have to invent in the absence of true null handling. These are invariably presented as reasons for why we don't need the feature. The sheer number and variety seems to argue just the opposite. I think if we can recognize the lack of null handling as a real shortcoming and get people working on the problem, we can come up with a fine solution. Imagine if, back at Tcl 6, people had said "speed isn't an issue, just use C for that, then wrap it in Tcl," and did not pursue the performance issue.

At the recent conference in Naperville Illinois, there were two papers on relational algebra packages that might be helpful, though neither package currently supports nulls. I approached Andrew Mangogna, author of TclRAL: A Relational Algebra for Tcl [L4 ], about adding the feature, and he explained that he is a follower of C.J.Date, who opposed the use of nulls. I explained my pragmatic argument: perhaps the largest use for Tcl is interfacing with databases, nearly all of which, including SQLite, routinely represent null data. The relational algebra packages presented might provide an interface that could help compensate for Tcl's weakness in this area. He was not interested. Theoretical orthodoxy seemed to be important than practical programming.

Later I talked to Jean-Claude Wippler jcw, author of Vlerq + Ratcl = Easy Data Management [L5 ], and a very practical programmer, and he found the pragmatic argument more compelling, and promised to look into it. I hope he was in ernest. While such a package is not a full answer to the problem, could provide

  • A common interface for relational databases, so database APIs could be more uniform.
  • Providing null handling -- a sort of "standard workaround."
  • A more complete implementation of relational algebra than can be offered by most database engines, optimized for large databases and high query volumes.

A few miscellaneous points :

1. If I revise TIP 185, I will change the nomenclature to "unknown" ({unk}!) instead of "null," because the word is more accurate, and because the English meaning of "null" as "nothing," instead of "unknown," was such a stumbling block for so many people. I originally chose "null" because it is used in SQL. C.J.Date (1995, An Introduction to Database Systems, 6th ed) uses "UNK" for unknown values in relational algebra, but it wasn't adopted in SQL. I have used the "null" nomenclature here, for consistency with the present discussion.

2. A null concatenated with a string is a null. Again, the "null" nomenclature is confusing. One envisions something like "The quick \000 fox", but that thing in the middle might be "brown" or "" or the full contents of the 1957 Encyclopedia Britannica, we just don't know, thus making the whole string unknown, or null.

This example seems to suggest the programmer might have been better off modeling the phrase as a list of words:

   set phrase [list The quick {null}! fox]

or

   set color [null]

   set phrase [list The quick $color fox]

Note that if he then joins the list, it will collapse into a null, like a black hole.

3. A null command can generate a null (TIP 185a), but can not replace the {null}! syntax. Its purpose is to implement TIP 185b. Sticking null in the middle of a list serialization string turns it into a null (by point 2, above), instead of a serialization of a list with a null element.

4. Three valued logic is widely used and well standardized, particularly in relational databases. Yes, a null is not equal or unequal to anything, including null: the result is never true or false, it is null. However, null && false is false, etc. There is a whole set of tautologies, I think I included most of them in the TIP write-up. These will propagate sensibly through expressions, just as IEEE floating point NANs do.

I mentioned above that C.J.Date argues against the use of nulls. This is not because of problems with three valued logic, but because of unavoidable paradoxical result sets in formal relational algebra. I reiterate my pragmatic argument to Andrew Mangogna, above.

5. One of the most common misunderstandings in previous discussions is that nulls can not be serialized. This is and example of the synergism of TIP 185a and b: 185b enables this -- merely encapsulate a transmission in a list. Now you can send any data structure representable by a string or list, containing nulls in whole or embedded in any part. It can be handled on the other end by general purpose code, without any need for a special protocol or understanding of the transmitted data structure. This data format might well be appreciated, like Tk, by a broader audience than just the Tcl community. We would be leading, instead of lagging behind, in the area of data management.

Sorry for the long exposition. Hope some of you look it over and find it useful.

NEM: From the discussion above, it appears that your main point for including a special null non-value into Tcl is that NULLs are widespread in the world of SQL databases (your pragmatic argument). Can you describe how the alternatives outlined by others, which do not require such a radical alteration of Tcl's EIAS philosophy, fail to address this issue? In particular, I see at least three separate means of encoding the concept of missing data into a run-of-the-mill string:

  1. Using a list encoding: llength $data == 0 for NULL, otherwise lindex $data 0 is the real data.
  2. Using a tagged list encoding: Nothing vs {Just $data} (a variant of the above).
  3. Using a dict or array encoding where missing data is represented by a missing key, so that dict exists can be used to check for null/missing data.

To me, these are all much nicer options than introducing a special non-string non-value.


slebetman I'd just like to point out that Tcl is not the only language that doesn't have a NULL data. Neither C nor C++ has NULL data. NULL pointers yes, nul byte yes, but not NULL data. This is not just pedantic but fundamental in C. C is like Tcl (or should it be that Tcl is modeled like C?): "everything is a number". C, freestanding in itself, doesn't recognise the nul terminator in strings as anything other than ordinary data. It is the standard library that cares, not C itself (though it is part of the C standard). Even for a NULL pointer which C does fuss about, it is only special when used as a pointer. The NULL pointer if used as data will be treated by C as a regular integer. In C NULL data is just a convention, nothing more.

NULL data is useful for determining the difference between the user entering an empty string and the user not entering anything though. But in this case I don't see why we can't steal the C convention and signify NULL data with \0. I don't know much about SQL, am I missing something? This is all very confusing because it starts out with a misleading C example (A nul in the middle of a "standard" C string is by definition the end of the string hence is not in the middle) then goes on to talk about a completely different concept of NULL data in SQL. Note that when sending data to an external program, Tcl already support C nul. There is absolutely no problem in this department so I have no idea what the C compatibility complaint is about in this discussion:

  # Saves a file which C can read as containing 3 strings:
  set f [open test.dat w]
  puts -nonewline $f "this\0is a\0test"
  close $f

  # It's also easy to parse C strings:
  set f [open test.dat r]
  set strlist [split [read $f] \0]
  close $f

Indeed for me, Tcl has one mechanism not found in C/C++ (at least not usable in runtime code): the nonexistent variable! For true NULL values I normally simply don't set the variable at all unless needed. Then use info exists to check the difference between an empty (possibly binary) string and a nonexistent string. Though you need to be careful to always unset the variable at the end of the block lest you accidentally use the previously set value. There are already commands in Tcl that use this convention. regexp is one example which simply doesn't set the match variables if the values don't exist.


[ Category Language ]