lquery

Richard Suchenwirth 2004-08-07 - I often need to look up data from tables, for instance country codes - would you have known that MH stands for the Marshall Islands? One can implement such tables with arrays:

 array set country {DE Germany FR France MH {Marshall Islands} NL Netherlands ... ...}
 puts $country(MH)

but such an array cannot be passed by value, and searching the other way ("what is the country code for Marshall Islands?") is less convenient:

 proc arr'get {arrName query} {
    upvar 1 $arrName arr
    foreach {key value} [array get arr] {
        if {$value eq $query} {return $key}
    }
 }
 puts [arr'get country {Marshall Islands}]

Note that arr'get is still pretty compact. What happens if a value is not contained in the table? Then the foreach loop eventually ends, and the proc body ends too, returning the last result, which is the constant {} from foreach. Some people might want to add the line

    return {}

after the foreach, but the effect without it is absolutely the same, so I'm for simple.

From Tcl 8.5 one can use a dict for such purposes, but I'm not there yet (all my installations are 8.4, occasionally even an 8.3). What we have since time immemorial, however, is lists, just as the last argument to the array set command above (in fact, it's a string to begin with, which can be interpreted as a list). We can use this value directly, and do two-way lookup with lsearch. In the country code example, lsearch may have three kinds of result:

  • (a) -1, if nothing was found
  • (b) an even number (0,2,4,..) if a country code was found
  • (c) an odd number (1,3,5..) if a country name was found

In case (a), {} is a suitable response to an unsuccessful query. In case (b) the access function should return the element following the index of the query, while for (c) we'd like to see the preceding element. This specification can be implemented in a pretty compact way:

 proc lquery {list query} {
    set pos [lsearch $list $query]
    lindex $list [expr {$pos+1-2*($pos%2)}]
 }

Case (a) is silently covered because for pos=-1, ($pos%2) yields 1 (for odd), which makes the index arithmetic turn out -2, on which lindex raises no error, but returns the empty string, as specified. We can use lquery directly for lists having an {a b a b ...} pattern, where querying for an a gives its b, and vice versa:

 set country {DE Germany FR France MH {Marshall Islands} NL Netherlands ... ...}
 puts [lquery $country France]

Testing:

 %lquery $country France
 FR
 % lquery $country FR#
 % lquery $country FR
 France

As lsearch by default takes a glob pattern, we can do wildcard queries, and receive the first entry matching the query:

 % lquery $country F*
 France
 % lquery $country Ma*
 MH

Or we can raise an error for an ambiguous pattern, to avoid bugs coming from an unexpected result, most simply by giving the -all switch:

 proc lquery {list query} {
    set pos [lsearch -all $list $query]
    if [llength $pos] {lindex $list [expr {$pos+1-2*($pos%2)}]}
 }

pos will now be {} on "not found", so we can call the lindex part only if it is non-empty. If lsearch finds more than one match, it returns a list of the matching positions, which expr can't digest:

 % lquery $country *a*
 can't use non-numeric string as operand of "+"

The error message isn't too intuitive, however, so let's add one line to improve that:

 proc lquery {list query} {
    set pos [lsearch -all $list $query]
    if {[llength $pos]>1} {error "ambiguous query $query"}
    if [llength $pos]     {lindex $list [expr {$pos+1-2*($pos%2)}]}
 }
 % lquery $country *a*
 ambiguous query *a*

The data list can be kept in a global variable, so it is available all the time, but a niftier way is to shrink-wrap it into an interp alias:

 interp alias {} country {} lquery $country
 % country FR
 France
 % country Mar*
 MH

So five lines of code give us a cute little data retrieval tool, which can however deal with quite voluminous data sets. Just that linear lsearch gets slower with longer lists. Let's finally test the timing behavior (on my old 200MHz box at home):

 foreach n {10 100 1000 10000} {
    set data {}
    for {set i 0} {$i<$n} {incr i} {
        lappend data a$i b$i
    }
    set average [expr $n/2] ;# middle of the list
    puts "average $i: [time {lquery $data a$average} 100]"
    set last [lindex $data end]
    puts "last    $i: [time {lquery $data $last} 100]"
    puts "missing $i: [time {lquery $data not'there} 100]"
 }

Funny that searching for an element in the middle, or end, takes longer than for one that isn't there (probably due to string comparison optimization):

 average 10: 303 microseconds per iteration
 last    10: 279 microseconds per iteration
 missing 10: 177 microseconds per iteration
 average 100: 534 microseconds per iteration
 last    100: 510 microseconds per iteration
 missing 100: 354 microseconds per iteration
 average 1000: 2720 microseconds per iteration
 last    1000: 2825 microseconds per iteration
 missing 1000: 2050 microseconds per iteration
 average 10000: 27310 microseconds per iteration
 last    10000: 27242 microseconds per iteration
 missing 10000: 22187 microseconds per iteration

But it is evident that you need not fear tables of 10000 entries, if you just got 27 ms to spare (or less, on today's much faster boxes). If that's too slow, you can still go back to arrays - just be aware that they take more memory: 42 bytes overhead per entry, as opposed to 12 bytes per list element. So even for large look-up tables, lquery has its advantages - and in any case it gave me another fun project :)