Building a file index

ulis, 2003-10-11.

If you have to access repeatedly a big file you need to optimize the i/o by building a file index. With a file index you can access the n-th enreg inside the big file with few i/o.

The idea is a simple one: you keep handy the offsets of the enregs in the big file and then you access the enreg with a single seek.

To access efficiently the offsets, you record them in an index file in a fixed format.


'''The proc to build the index''' file of a big file of newline delimited enregs:

  # --------------------
  # build_index
  #
  #   builds a file index for a big file
  # --------------------
  # parm1: big file name
  # parm2: index file name
  # --------------------
  # return: count of enregs
  # --------------------
  proc build_index {bigfile indexfile} \
  {
    # index the file
    set count 0
    set fpi [open $bigfile r]
    set fpo [open $indexfile w]
    while {![eof $fpi]} \
    {
      puts -nonewline $fpo [format %-8.8s [tell $fpi]]
      ### enregs are newline delimited lines ###
      set line [gets $fpi]
      incr count
    }
    puts -nonewline $fpo [format %-8.8s [tell $fpi]]
    close $fpo
    close $fpi
    return $count
  }

The proc to access the n-th enreg given the file pointers to the big file and to the index file:

  # --------------------
  # get_enreg
  #
  #    get the n-th enreg of a big file indexed by an index file
  # --------------------
  # parm1: the index of the enreg inside the big file
  # parm2: the handle of the opened big file
  # parm3: the handle of the opened index file
  # --------------------
  # return: the n-th enreg of the big file
  # --------------------
  proc get_enreg {n fileptr indexptr} \
  {
    # get the starting & endind offsets
    seek $indexptr [expr {$index * 8}] start
    set start [read $indexptr 8]
    set end [read $indexptr 8]
    # seek inside the big file
    seek $fileptr $start start
    # return the enreg
    set enreg [read $fileptr [expr {$end - $start}]]
  }

See also