globfind

globfind is designed as a fast and simple alternative to fileutil::find.

News

SEH 2022-09-07: globfind 2.0 released.

globfind is a directory hierarchy search utility. It takes advantage of glob's ability to use multiple patterns to scan deeply into a directory structure in a single command, hence the name.

Version 2.0 is a rewrite from scratch to be faster and more compact, featureful and error-resilient.

On searches of large directory spaces for files matching a glob pattern, globfind typically runs about three times faster than fileutil::findByPattern, and about 150% of the speed of GNU find.

Support for Tcl versions before 8.5 has been dropped.

Link to previous version of page

fileutil::find is useful but it has several deficiencies:

  • On Windows, hidden files are mishandled.
  • On Windows, checks to avoid infinite loops due to nested symbolic links are not done.
  • On Unix, nested loop checking requires a "file stat" of each file/dir encountered, a significant performance hit.
  • The basedir from which the search starts is not included in the results, as it is with GNU find.
  • If the basedir is a file, it is returned in the result not as a list element (like glob) but as a string.
  • The fileutil::find calls itself recursively, and thus risks running into interp recursion limits for very large systems.
  • fileutil.tcl contains three separate instantiations of find for varying os's/versions. Maintenance nightmare.

globfind eliminates all the above deficiencies. It checks for nested symbolic links in a platform-independent way and scans directory hierarchies without recursion.

For speed and simplicity, globfind takes advantage of glob's ability to use multiple patterns to scan deeply into a directory structure in a single command, hence the name. Its calling syntax is the same as fileutil::find, so with a name change it could be used as a drop-in replacement:

Usage: globfind ?basedir? ?filtercmd? ?switches?

Options:

basedir - the directory from which to start the search.  Defaults to current
directory.

filtercmd - Tcl command; for each file found in the basedir, the filename will
be appended to filtercmd and the result will be evaluated.  The evaluation
should return a boolean value; only files whose return code is true will be
included in the final return result. ex: {file isdir}

switches - The switches will "prefilter" the results before the filtercmd is
applied.  The available switches are:

  -depth      - sets the number of levels down from the basedir into which the 
                filesystem hierarchy will be searched. A value of zero is
                interpreted as infinite depth.

  -pattern    - a glob-style filename-matching wildcard. ex: -pattern *.pdf

  -types      - any value acceptable to the "types" switch of the glob
                command. ex: -types {d hidden}

  -redundancy - eliminates redundant listing of real files that may occur due
                to symbolic links that link to directories within basedir (at
                the cost of slower execution). Stores names of such symbolic
                links in ::fileutil::globfind::redundant_files. Sets
                ::fileutil::globfind::REDUNDANCY to 1 if redundancies found,
                otherwise 0.

See Also

Matthias Hoffmann - Tcl-Code-Snippets - misc - globx
getfiles cached
JOB - 2016-07-12 20:07:18: Relies on the same glob statement plus the ability to store search results in an additional cache file.

Misc

AQI 2016-07-16: rglob is another much faster alternative to fileutil::find.

proc rglob {basedir pattern} {

    # Fix the directory name, this ensures the directory name is in the
    # native format for the platform and contains a final directory seperator
    set basedir [string trimright [file join [file normalize $basedir] { }]]
    set fileList {}

    # Look in the current directory for matching files, -type {f r}
    # means ony readable normal files are looked at, -nocomplain stops
    # an error being thrown if the returned list is empty
    foreach fileName [glob -nocomplain -type {f r} -path $basedir $pattern] {
        lappend fileList $fileName
    }

    # Now look for any sub direcories in the current directory
    foreach dirName [glob -nocomplain -type {d r} -path $basedir *] {
    # Recusively call the routine on the sub directory and append any
    # new files to the results
        set subDirList [rglob $dirName $pattern]
        if { [llength $subDirList] > 0 } {
            foreach subDirFile $subDirList {
                lappend fileList $subDirFile
            }
        }
    }
    return $fileList
}

example:

time {util::rglob [pwd] *.tcl} 100

1505.3 microseconds per iteration

time {fileutil::findByPattern [pwd] *.tcl} 100

8957.71 microseconds per iteration

both provide the same result for basic glob matches

its even better on larger file structures

time {fileutil::findByPattern [pwd] *.tcl} 10

2696438.9 microseconds per iteration

time {util::rglob [pwd] *.tcl} 10

277771.5 microseconds per iteration

Page Authors

Gerald Lester
PYK