'''KeyWord In Context''' Indexing

If you're looking for the keyword search page, which was "mini.net/tcl/kw", it's been moved to: http://wiki.tcl.tk/2

[CmCc]: Google does this, now you can too ... here's a start in pure tcl [http://sharedtech.dyndns.org/~colin/kwic].

My intention was to replace glimpse [http://webglimpse.net/] which is a really good early KWIC indexer with very low overhead.  Glimpse is now proprietary, I don't like the alternatives, so wrote kwic.

The code has a simple command wrapper, but it's realy a [Snit] class with some interfaces and a filesystem-persistent representation.  I intend to write a [tclhttpd] interface to it, so we can provide search functionality.

Dependencies:   [Snit], tcl8.4 or later (uses [dict]s)

I'll write some docco RSN, promise.


[AK] I remember glimpse. I stopped using it when it started to consistently crash when applied to my mailspool. It got too big for it I guess.

[schlenk] Nice, wrote a grossly slow indexing package[http://www.uni-oldenburg.de/~schlenk/index/index02.zip] for Tcl once to use with tclhttpd. (Someone needed a fulltext search system and i had an afternoon to code it...) A good working glimpse replacement (especially if it handles dbs larger than 2 Gigs) would be cool.

[AK] Some semi-documentation about kwic file-format, etc., taken from the Tcler's chat:

 [11:02] Colin	I can't even remember writing it, and doubt I could again ... I wonder if I've had a stroke 
 [11:02] Colin	Anyway ... since I copied glimpse's design, all good.  Let me refresh on details
 [11:04] Colin	Ok.  Packed indices are just like glimpse's.  So, word + file#.  Minimal overhead assuming the set of words is small relative to source file sizes.  Like, typical english is 10K words.
 [11:05] Colin	that's on disk.
 [11:05] Colin	It unpacks the disk files into arrays, again keys are words, contents are lists of file numbers.
 [11:06] Colin	Each file is given a unique number, see, its ordinal number in the ordered set of all files being indexed.
 [11:06] Colin	I think I stipulated a maximum of 64K files, hence 16 bits per occurrence.
 [11:07] Colin	so file overhead is strlen(word)+1 + 2*occurrence_count
 [11:07] Colin	Search, once the index is read is constant (hashed)
 [11:08] Colin	cost to unpack it is mostly linear I guess, traverse the index and turn it into a tcl array.
 [11:08] Colin	cost to build the index is proportional to the size indexed, I guess.  Uses hash/array to build index.
 [11:09] Colin	fundamentally, I think/hope:  linear in textbase size.
 [11:09] Colin	the performance of the thing should be linear and proportional to the size of the stuff searched.
 [11:10] Colin	the size should, for expected text sets (natural language or nearly) be a fraction of the stuff indexed.
 [11:10] Colin	Just like glimpse.

 [11:12] Colin  The *very* clever thing about glimpse is that it realises you can load a file and search for a word very quickly, in ram.
 [11:12] Colin  And so all they need to do is construct the set of all unique words in a set of files, and record in which file each word occurs.
 [11:13] Colin  They give each file a unique number, and so the recording of occurrence is just association of a set of file numbers with a word.
 [11:13] Colin  IIRC, they sort their word->files association, but we don't have to because I load it into an array.
 [11:14] Colin  So they have a second index, being word->word#, then word#->file#+ then file#->filename
 [11:15] aku   so the index is word->file, not word->exact location, and exact location are found later by regular search. smaller index, possibly slower in search, but not too much, as you said.
 [11:15] Colin  Exactly!

* [http://gardenofwales.org.uk/videos/  free bit tit sex videos]

----
[Category Application]