[MJ] - While trying to find duplicate files in my images directory, I decided to hack together a small Tcl script to determine which files are duplicates. The script requires [tcllib] for [md5] and [fileutil]. The script finds duplicates based on md5 hash (skipping large files for performance reasons) and stores the hashes array in an output file. Using this file to delete the duplicates is still to be done.

---- 

 package require Tcl 8.5
 package require md5
 package require fileutil

 if {$argc != 1} {
        puts stderr "usage [file tail [info script]] output"
        exit 1
 } else {
        lassign $argv output
 }

 set max_size 10e7

 proc file_md5 {filename} {
        variable hashes
        variable max_size
        if {![file isdirectory $filename] && [file size $filename] < $max_size} {
           set file [file join [pwd] $filename]  
           set md5 [md5::md5 -hex -filename $filename]
           lappend hashes($md5) $file
           puts "$md5: $file" 
        }
        return 0
 }

 fileutil::find . file_md5
 puts "Storing hashes in $output"
 set f [open $output w]
 puts $f [array get hashes]
 close $f 

----
!!!!!!
%| [Category Application] |%
!!!!!!