distributed in core database

by Theo Verelst

Many, many version of tcl ago, I had the (of course not extremely original) interest in storing database-like data in tcl list(s), and also quite some time ago I made some Database routines for in core which are still in bwise. They are far from complete or generally applicable, though the simple idea is fine for many things. As time progresses since the time of mini and workstation computers to even portables with more power and memory then most of those ever had, the question is whether we still can learn something from the techniques, like databases, from the past.

One of the most important trends since workstations made their entree in computer (in my case EE-) land, is the distribution of data and workload. Early in unix's history, it was possible to use things like remsh and rexec and rlogin, which I at the time used to do various things in computer graphics, though not organized as it became later in the term 'distributed processing'. But in short, when workstations became very fast, and more abundantly present, an important computer system design though became distributed processing, mainly to make all the CPU horsepower and storage space, which is mostly unused a lot of the time in offices, available for great peaks of computations, and even to simply put a lot of fast machines together to become faster as a single machine ever could.

Using tcl, the approach can be prototyped fairly easily, and is definitely valid also for PC setups, and not just for scientific purposes.

This page will describe a two machine setup to find files (from the filesystems) in parallel on both machines, so a faster distributed search results, and it's a fun enough and useful database example.

BEFORE EXECUTING THE SCRIPTS ON THIS PAGE, BEWARE THAT YOU MIGHT BE CREATING VERY LARGE FILES AND LISTS IN CORE !!!

As a first experiment, I use some bwise routines and few lines of script to make a (relatively large) database, and to begin with (though I could do this in tcl, I just happened to have done it using cygwin and bash):

 exec find /  > /tmp/filelist.txt

to get a extensive list of all files on disc, one per line in the file. On cygwin, possibly use /cygdrive/c/ or something.

Now start bwise, with a console window, and load the text file and list-ify its contents, adding a prefixed 'comment' field per file:

 open_text /tmp/filelist.txt
 foreach i [split [.tt.t get 0.0 end] \n] {
    lappend dbvar [list 
       [list file $i]
       [list comments {} ]
    ]
 }

In one of my example cases, I got a list this size:

    llength $dbvar
 240668

Now use

 dbaccess

to get a simple GUI to inspect and update entries from the list.

On a 2.6 GHz machine, after I added a single t in the comment of entry 200,000, a (full) search took 5 sec:

    time {dbsearch t comments}
 5030446 microseconds per iteration

I used another machine, which probably turned swapping (??) during the experiment, with approx. the same number of files, but 20 megabye of txt file instead of 10, and as a result took a few minutes to complete the search.

Now let's connect the searches up.

I got held up, so I outline a possible experiment to make the title match: the two (or more) machines should be connected up, and it is assumed that for most operations, let's say searches, the result of the operations on the database don't generate excessive amounts of data, so that we can socket-connect various in-core databases (various tclsh or wishes running them), and get the remote outputs back over the socket connection, and merge then together (re-sorting the result, when we wanted sorted overall output).

Alternatively (I played with this) something like pcom can be used to interface the machines. Always take care when executing data on other machines; it may be not a safe connection, and when decoding commands, unpredictable tcl can get executed. The page remote execution using tcl and Pcom examplifies what can be used to do the remote exec, for return lists on one line this works fine for 2-way distributed.

more to follow.

Category Database