Data structures: from the bit to the Web

Richard Suchenwirth 2002-12-09 - If we think of structures as "things built from other things", the lowly bit does not qualify as data structure - it is the minimal, atomar part of all others, capable only of expressing two values: 0 or 1, true or false.

Bit sequences of length n can hold 2**n values, which makes them more suitable for most purposes (e.g. 8 bits, a byte, value range 0..255 - see Playing with bits). When joined to sequences again, bit sequences need not be of constant length, if they can be told apart: see Huffman coding, Elias coding, or bit vectors. Bytes are also used to cover all sort of "binary" (e.g. image) data, although correctly expressed, everything composed of bits is binary...

Characters used to fit in 5/6/7/8 bits, but in the course of internationalization, one should use Unicode and UTF-8, where a character needs a bit sequence of 16 (or more) bits, resp. if coded as UTF-8, a byte sequence of 1..3. Note that a character is just a view on an unsigned integer (up to 16 bits wide in current Tcl, while Unicode has drifted towards 31 bits since 3.1) and can be losslessly converted to and fro with scan and format.

Character sequences, or less formally: strings, are extremely useful for all kinds of things. In Tcl, everything is a string - among many other things, every script is a string. Under the hood, a Tcl object may well have another representation too - as pure signed integer (32 or 64 bits wide), or as double-precision float, or others. But for simplicity, you may still consider Tcl strings that look like numbers as strings.

Lists, sequences of items, which can in turn be strings, numbers, or other lists, can in Tcl be seen as a view on (well-formed) strings, although in memory they're more efficiently implemented. By virtue of recursively embedding lists again, they can be used for trees (e.g. every XML document is a tree - and a string !), tables (including matrices), stacks and queues and just about any complex data structures you can think of. Also, every Tcl command is a list, the first word being the command name, the rest its arguments.

Hash tables can be imagined as tables of two columns and arbitrarily many rows - the first column contains the unique key, the second the associated value, which is returned when you access by the key. Hash tables can be implemented in various ways.

And the Web, you ask? Well, one can see it as an extremely complex structure, that takes hours to describe, but one can also consider it as a black box with hash table functionality: give a string (the URL), receive a sequence of bytes back (very often text, but it could be binary image data etc... or any variation on "404 - Page not found" ;-)


NEM - And here is the web as a Tcl array (hashtable):

package require http
array set web {}
trace add variable web read fetch
proc fetch {v1 url op} {
    if {[info exists ::web($url)]} {
        # Do nothing
    } else {
        set t [http::geturl $url]
        set ::web($url) [http::data $t]
        http::cleanup $t
    }
}

RS:-) A good example for arrays as cached functions...