Version 6 of trie

Updated 2008-06-10 02:49:52 by nem

From Wikipedia, the free encyclopedia

In computer science, a trie [L1 ], or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of any one node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that happen to correspond to keys of interest.


NaviServer uses a trie for url dispatching:


NEM 2008-06-09: Here's a very simplistic trie implementation based on straight-forward use of nested dicts (typically a trie in C or Java would instead using a fixed-size array and indexing directly based on character (e.g. restricting words to contain only characters a-z)):

#  trie.tcl --
#
#       Simple implementation of tries in Tcl.
#

package require Tcl     8.5
package provide trie    0.3

namespace eval ::trie {
    namespace export {[a-z]*}
    namespace ensemble create

    # create an empty trie
    proc create {} { dict create }
    # add a word to a trie contained in trieVar
    proc add {trieVar word} {
        upvar 1 $trieVar trie
        dict set trie {*}[split $word ""] END {}
    }
    # check if a given word is contained in a trie
    proc contains {trie word} {
        dict exists $trie {*}[split $word ""] END
    }
    # get the sub-trie of all words corresponding to a given prefix
    proc get {trie {prefix ""}} {
        if {$prefix eq ""} { return $trie }
        if {![dict exists $trie {*}[split $prefix ""]]} { return {} }
        dict get $trie {*}[split $prefix ""]
    }
    # iterate through all words in a trie calling a callback for each one. The
    # callback will be called with the string of each word.
    proc words {trie cmd {prefix ""}} {
        set tries [list [get $trie $prefix] $prefix]
        set i 0
        while {[llength $tries] > $i} {
            set trie [lindex $tries $i]
            set prefix [lindex $tries [incr i]]
            # set tries [lassign $tries trie prefix] ;# VERY slow!
            if {[dict exists $trie END]} { 
                uplevel 1 [linsert $cmd end $prefix]
            }
            dict for {k v} $trie {
                lappend tries $v $prefix$k
            }
            incr i
        }
    }
    # remove a word from a trie
    proc remove {trieVar word} {
        upvar 1 $trieVar trie
        if {![contains $trie $word]} { return }
        dict unset trie {*}[split $word ""] END
        # Could/should compact the trie at this point if no other words with
        # this word as a prefix.
    }
    # count the number of words in the trie
    proc size {trie {prefix ""}} {
        set count 0
        words $trie count $prefix
        return $count
    }
    
    # private helpers
    proc count {args} {
        upvar 1 count var
        incr var
    }
}

And a quick test/demo:

proc test {} {
    set t [trie create]
    foreach word {howdy hello who where what when why how} { trie add t $word }
    puts "t = $t"
    puts "words:"
    trie words $t puts
    puts "all wh- words:"
    trie words $t puts "wh"
    trie remove t how
    puts "now:"
    trie words $t {lappend words}
    puts [join $words ", "]
}
# A bigger test -- read all words in a text into the trie
proc read-trie file {
    set t [trie create]
    set in [open $file r]
    while {[gets $in line] >= 0} {
        foreach word [regexp -all -inline {[a-zA-Z]+} $line] {
            trie add t $word
        }
    }
    return $t
}
set t [read-trie ~/Desktop/ulyss12.txt] ;# James Joyce's Ulysses
puts "size = [trie size $t]"
dict for {k v} $trie { puts "$k = [trie size $v]" }
puts "Words beginning with 'the':"
trie words $t puts "the"

Interestingly while testing this I noticed that it was taking a huge amount of time to calculate the number of distinct words in the trie (over a minute for just ~37000 words). Profiling revealed that the following idiom was to blame:

 set xs [lassign $xs x y]

which is used to pop elements off the front of a queue. Lassign seems to be quite pathologically slow in this case...