critbit

AMG: A critbit tree (a.k.a. Patricia tree, Radix tree, trie) is a binary search tree in which traversal proceeds to the left or right subtree depending on the value of certain critical bits in the key. See [L1 ] and [L2 ] for more information.

Critbit trees can only store (or index) prefix-free strings. They don't work when one key is the prefix of another key, for example "Bob" and "Bobby". Critbit trees work great for IP addresses and other structures that are all the same length.

Critbit trees can be adapted for use with arbitrary strings, even strings with prefixes. The trick is to alter the key value to be prefix-free. One approach is to prefix each key with its length, though this requires declaring a maximum length beforehand, or establishing a way to encode variable lengths which is itself prefix-free. Another approach is to prefix each byte with a 1 bit and append a 0 bit to the key. This is known as kart, or key alteration radix tree [L3 ]; its advantage over length prefixing is it maintains the lexicographical sort order property of classic critbit trees.

C strings are inherently prefix-free because they are NUL-terminated and contain no embedded NULs, so critbit trees can store (or index) them directly. Tcl's internal string representation also has this property despite being able to contain embedded NULs since it represents them as the two-byte sequence 0xc0 0x80. For a C/Tcl string to be a prefix of another would imply that the other string has characters after a NUL byte, which can never happen.

Implementations may choose to eliminate redundant (common) prefixes by storing them inside interior nodes, though this comes at the cost of needing to reconstitute the full strings in the process of walking the tree. For large strings this may be a benefit, and this can form the basis of a compression algorithm. Needless to say, if a critbit tree is used to index another data structure without itself storing the strings (only tracking their critical bits, etc.), this implementation option is not available.


Stefan K: I've implemented a critbit C extension here: [L4 ].

AMG: Big thanks for bringing your code back online!

AMG: Your otherBits technique deserves closer examination. I'm pretty sure the idea came from djb but could be mistaken. You store a mask that identifies which bit is the critical bit. Every bit of your mask is set except the critical bit. For example, for "I" (0x49) and "a" (0x61), the critical bit mask is 0xf7. 0x49^0x61 is 0x28, meaning two bits differ. The least-significant of these bits (0x08) is the critical bit, then complement for the mask (0xf7).

Your implementation uses a loop to compute the mask. Have you considered something like the following?

unsigned char
critbit(unsigned char a, unsigned char b)
{
    unsigned char c = a ^ b;    /* Find all bits that differ. */
    c |= c << 1;                /* Propagate to higher-order bits. */
    c |= c << 2;                /* Keep going, have more bits to do. */
    c |= c << 4;                /* Set the critbit and all higher bits. */
    return ~c | c << 1;         /* Set all bits except the critbit. */
}

Transliterated into Tcl, for easier experimentation:

proc critbit {a b} {
    set c [expr {$a ^ $b}]
    set c [expr {$c | $c << 1}]
    set c [expr {$c | $c << 2}]
    set c [expr {$c | $c << 4}]
    expr {(~$c | $c << 1) & 255}
}

Why use a mask like this? Your code for checking for the value of the critical bit minimizes conditionals and branching (so a conditional-free approach like the above could be an asset). Instead, you use addition and check for overflow. Here's your code:

dir = (1 + (i->otherBits | c)) >> 8;

Let's examine from the inside out. The expression i->otherBits|c yields one of two things:

ResultCondition
0xffwhen the 0 bit in i->otherBits lines up with a 1 bit in c
i->otherBitswhen the 0 bit in i->otherBits lines up with a 0 bit in c

i->otherBits is less than 0xff. Adding one to the expression will therefore yield 0x100 or something less. Shifting right by 8 will yield 1 or 0. This final result is equal to the value of the critical bit in c.


AMG: Here's my Tcl implementation of length-prefixed critbit and kart. To be clear, it's in no way intended for production use, rather demonstration of concepts. I should also add NUL-terminated critbit, since I feel that is more practical than kart anyway.

package require Tcl 8.5

set kart 1

if {$kart} {
    # Key alteration <URL:http://code.dogmap.org/kart/>.
    proc Bits {key} {
        set bits {}
        foreach char [split $key ""] {
            lappend bits 1 {*}[split [binary scan $char B* b; set b] ""]
        }
        lappend bits 0
    }
} else {
    # Length prefix.
    proc Bits {key} {
        binary scan [binary format I [string length $key]]$key B* bits
        split $bits ""
    }
}

proc Walk {tree kb} {
    while {[llength $tree] == 3} {
        set tree [lindex $tree [expr {1 + [lindex $kb [lindex $tree 0]]}]]
    }
    lindex $tree 0
}

proc insert {cbvar key} {
    upvar 1 $cbvar critbit
    if {![info exists critbit] || ![llength $critbit]} {
        set critbit [list $key]
    } else {
        set kb [Bits $key]
        set other [Walk $critbit $kb]
        if {$key ne $other} {
            set ob [Bits $other]
            for {set m 0} {[lindex $kb $m] == [lindex $ob $m]} {incr m} {}

            set tree $critbit
            set path {}
            while {[llength $tree] == 3 && [lindex $tree 0] < $m} {
                set child [expr {1 + [lindex $kb [lindex $tree 0]]}]
                set tree [lindex $tree $child]
                lappend path $child
            }

            if {![lindex $kb $m]} {
                lset critbit $path [list $m [list $key] $tree]
            } else {
                lset critbit $path [list $m $tree [list $key]]
            }
        }
    }
}

proc remove {cbvar key} {
    upvar 1 $cbvar critbit
    if {[info exists critbit]} {
        if {[llength $critbit] == 1} {
            if {$key eq [lindex $critbit 0]} {
                set critbit {}
            }
        } elseif {[llength $critbit] == 3} {
            set kb [Bits $key]
            set tree $critbit
            set path {}
            while {[llength $tree] == 3} {
                set child [expr {1 + [lindex $kb [lindex $tree 0]]}]
                set tree [lindex $tree $child]
                lappend path $child
            }
            if {$key eq [lindex $tree 0]} {
                lset critbit [lrange $path 0 end-1]\
                    [lindex $critbit {*}[lrange $path 0 end-1]\
                    [expr {3 - [lindex $path end]}]]
            }
        }
    }
}

proc find {cbvar key} {
    upvar 1 $cbvar critbit
    expr {[info exists critbit] && [llength $critbit]
       && $key eq [Walk $critbit [Bits $key]]}
}

These procedures might be helpful for analyzing the resulting trees.

proc Descend {tree {path {}}} {
    if {[llength $tree] == 3} {
        Descend [lindex $tree 1] [concat $path [list [lindex $tree 0]]]
        Descend [lindex $tree 2] [concat $path [list [lindex $tree 0]]]
    } else {
        puts [format "%-60s %s" $path [join [Bits [lindex $tree 0]] ""]]
    }
}

proc Depth {tree {level 0}} {
    if {[llength $tree] == 3} {
        incr level
        expr {max([Depth [lindex $tree 1] $level],
                  [Depth [lindex $tree 2] $level])}
    } else {
        return $level
    }
}

Example usage, with $kart enabled:

% insert cb "Green Shell"
{Green Shell}
% insert cb Mario
5 {{Green Shell}} Mario
% insert cb Mushroom
5 {{Green Shell}} {13 Mario Mushroom}
% insert cb "Rainbow Road"
4 {5 {{Green Shell}} {13 Mario Mushroom}} {{Rainbow Road}}
% insert cb "Mario Circuit"
4 {5 {{Green Shell}} {13 {45 Mario {{Mario Circuit}}} Mushroom}} {{Rainbow Road}}
% find cb Mario
1
% find cb "Mario Circuit"
1
% find cb Luigi
0
% remove cb Mushroom
4 {5 {{Green Shell}} {45 Mario {{Mario Circuit}}}} {{Rainbow Road}}
% find cb Mushroom
0
% insert cb Mushroom
4 {5 {{Green Shell}} {13 {45 Mario {{Mario Circuit}}} Mushroom}} {{Rainbow Road}}
% Depth $cb
4
% Descend $cb
4 5       1010001111011100101011001011011001011011011101001000001010100111011010001011001011011011001011011000
4 5 13 45 1010011011011000011011100101011010011011011110
4 5 13 45 1010011011011000011011100101011010011011011111001000001010000111011010011011100101011000111011101011011010011011101000
4 5 13    1010011011011101011011100111011010001011100101011011111011011111011011010
4         1010100101011000011011010011011011101011000101011011111011101111001000001010100101011011111011000011011001000

Example usage, with $kart disabled:

% insert cb "Green Shell"
{Green Shell}
% insert cb Mario
28 Mario {{Green Shell}}
% insert cb Mushroom
28 Mario {30 Mushroom {{Green Shell}}}
% insert cb "Rainbow Road"
28 Mario {29 {30 Mushroom {{Green Shell}}} {{Rainbow Road}}}
% insert cb "Mario Circuit"
28 Mario {29 {30 Mushroom {{Green Shell}}} {31 {{Rainbow Road}} {{Mario Circuit}}}}
% find cb Mario
1
% find cb "Mario Circuit"
1
% find cb Luigi
0
% remove cb Mushroom
28 Mario {29 {{Green Shell}} {31 {{Rainbow Road}} {{Mario Circuit}}}}
% find cb Mushroom
0
% insert cb Mushroom
28 Mario {29 {30 Mushroom {{Green Shell}}} {31 {{Rainbow Road}} {{Mario Circuit}}}}
% Depth $cb
3
% Descend $cb
28       000000000000000000000000000001010100110101100001011100100110100101101111
28 29 30 000000000000000000000000000010000100110101110101011100110110100001110010011011110110111101101101
28 29 30 000000000000000000000000000010110100011101110010011001010110010101101110001000000101001101101000011001010110110001101100
28 29 31 00000000000000000000000000001100010100100110000101101001011011100110001001101111011101110010000001010010011011110110000101100100
28 29 31 0000000000000000000000000000110101001101011000010111001001101001011011110010000001000011011010010111001001100011011101010110100101110100