KeyWord In Context Indexing
If you're looking for the keyword search page, which was "mini.net/tcl/kw", it's been moved to: https://wiki.tcl-lang.org/2
CmCc: Google does this, now you can too ... here's a start in pure tcl [L1 ].
My intention was to replace glimpse [L2 ] which is a really good early KWIC indexer with very low overhead. Glimpse is now proprietary, I don't like the alternatives, so wrote kwic.
The code has a simple command wrapper, but it's really a Snit class with some interfaces and a filesystem-persistent representation. I intend to write a tclhttpd interface to it, so we can provide search functionality.
Dependencies: Snit, tcl8.4 or later (uses dicts)
I'll write some docco RSN, promise.
AK I remember glimpse. I stopped using it when it started to consistently crash when applied to my mailspool. It got too big for it I guess.
schlenk Nice, wrote a grossly slow indexing package[L3 ] for Tcl once to use with tclhttpd. (Someone needed a fulltext search system and i had an afternoon to code it...) A good working glimpse replacement (especially if it handles dbs larger than 2 Gigs) would be cool.
AK Some semi-documentation about kwic file-format, etc., taken from the Tcler's chat:
[11:02] Colin I can't even remember writing it, and doubt I could again ... I wonder if I've had a stroke [11:02] Colin Anyway ... since I copied glimpse's design, all good. Let me refresh on details [11:04] Colin Ok. Packed indices are just like glimpse's. So, word + file#. Minimal overhead assuming the set of words is small relative to source file sizes. Like, typical english is 10K words. [11:05] Colin that's on disk. [11:05] Colin It unpacks the disk files into arrays, again keys are words, contents are lists of file numbers. [11:06] Colin Each file is given a unique number, see, its ordinal number in the ordered set of all files being indexed. [11:06] Colin I think I stipulated a maximum of 64K files, hence 16 bits per occurrence. [11:07] Colin so file overhead is strlen(word)+1 + 2*occurrence_count [11:07] Colin Search, once the index is read is constant (hashed) [11:08] Colin cost to unpack it is mostly linear I guess, traverse the index and turn it into a tcl array. [11:08] Colin cost to build the index is proportional to the size indexed, I guess. Uses hash/array to build index. [11:09] Colin fundamentally, I think/hope: linear in textbase size. [11:09] Colin the performance of the thing should be linear and proportional to the size of the stuff searched. [11:10] Colin the size should, for expected text sets (natural language or nearly) be a fraction of the stuff indexed. [11:10] Colin Just like glimpse. [11:12] Colin The *very* clever thing about glimpse is that it realises you can load a file and search for a word very quickly, in ram. [11:12] Colin And so all they need to do is construct the set of all unique words in a set of files, and record in which file each word occurs. [11:13] Colin They give each file a unique number, and so the recording of occurrence is just association of a set of file numbers with a word. [11:13] Colin IIRC, they sort their word->files association, but we don't have to because I load it into an array. [11:14] Colin So they have a second index, being word->word#, then word#->file#+ then file#->filename [11:15] aku so the index is word->file, not word->exact location, and exact location are found later by regular search. smaller index, possibly slower in search, but not too much, as you said. [11:15] Colin Exactly!