Version 17 of diff in Tcl

Updated 2003-03-23 19:32:19

if 0 {

GPS asks in the comp.lang.tcl newsgroup:

 In my Tcl file server I use chroot for security reasons, and this has
 caused problems because my server also relies on diff.  I know that I
 could recompile GNU diff as a standalone executable and put it within
 my chroot, but I don't want to infect my program with the GPL nor 
 force my users to compile GNU binutils for this.  I've been looking
 all over the place for a diff that is portable and can be used
 commercially, but I haven't been able to find one.  So, I come to
 you for help writing a diff tool in Tcl.  I'm overwhelmed when I look
 at existing code written in C, so I truly do need the help of
 someone more experienced.

KBK answers:

The heart of any implementation of diff is a computation of the longest common subsequence (or some approximation to it) -- that is, the largest set of lines that the two files have in common. There's a copy of the original paper of diff on Doug McIlroy's homepage [L1 ].

A "dynamic programming" approach to identifying the longest common subsequence has been known in the computing folklore for many years. There is a discussion of the folklore at [L2 ] - follow the Download links at the upper right to get a copy of the paper. There's a Tcl implementation of the folklore algorithm over at Longest common subsequence: folklore algorithm.

Below is a longest-common-subsequence procedure that incorporates McIlroy's improvements; the internals are extremely similar to those of diff. Despite being a pure-Tcl implementation, it appears to be adequately fast for most purposes.

The display that it produces is not the same as that of diff, but can be easily adapted to do whatever you need.


}

 # Copyright (c) 2003 by Kevin B. Kenny.  All rights reserved.
 # See the file,
 # 'http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/tcllib/tcllib/license.terms'
 # for terms and conditions of redistribution.

 package require Tcl 8.0

 namespace eval list { namespace export longestCommonSubsequence }

 # Do a compatibility version of [lset] for pre-8.4 versions of Tcl.
 # This version does not do multi-arg [lset]!

 if { [package vcompare [package provide Tcl] 8.4] < 0 } {
     proc list::K { x y } { set x }
     proc list::lset { var index arg } {
         upvar 1 $var list
         set list [lreplace [K $list [set list {}]] $index $index $arg]
     }
 }

 # list::longestCommonSubsequence --
 #
 #       Computes the longest common subsequence of two lists.
 #
 # Parameters:
 #       sequence1, sequence2 -- Two lists to compare.
 #
 # Results:
 #       Returns a list of two lists of equal length. 
 #       The first sublist is of indices into sequence1, and the
 #       second sublist is of indices into sequence2.  Each corresponding
 #       pair of indices corresponds to equal elements in the sequences;
 #       the sequence returned is the longest possible.
 #
 # Side effects:
 #       None.

 proc list::longestCommonSubsequence { sequence1 sequence2 } {

     # Construct a set of equivalence classes of lines in file 2 

     set index 0
     foreach string $sequence2 {
         lappend eqv($string) $index
         incr index
     }

     # K holds descriptions of the common subsequences.
     # Initially, there is one common subsequence of length 0,
     # with a fence saying that it includes line -1 of both files.
     # The maximum subsequence length is 0; position 0 of
     # K holds a fence carrying the line following the end
     # of both files.

     lappend K [list -1 -1 {}]
     lappend K [list [llength $sequence1] [llength $sequence2] {}]
     set k 0

     # Walk through the first file, letting i be the index of the line and
     # string be the line itself.

     set i 0
     foreach string $sequence1 {

         # Consider each possible corresponding index j in the second file.

         if { [info exists eqv($string)] } {

             # c is the candidate match most recently found, and r is the
             # length of the corresponding subsequence.

             set c [lindex $K 0]
             set r 0

             foreach j $eqv($string) {

                 # Perform a binary search to find a candidate common
                 # subsequence to which may be appended this match.

                 set max $k
                 set min $r
                 set s [expr { $k + 1 }]
                 while { $max >= $min } {
                     set mid [expr { ( $max + $min ) / 2 }]
                     set bmid [lindex [lindex $K $mid] 1]
                     if { $j == $bmid } {
                         break
                     } elseif { $j < $bmid } {
                         set max [expr {$mid - 1}]
                     } else {
                         set s $mid
                         set min [expr { $mid + 1 }]
                     }
                 }

                 # Go to the next match point if there is no suitable
                 # candidate.

                 if { $j == [lindex [lindex $K $mid] 1] || $s > $k} {
                     continue
                 }

                 # s is the sequence length of the longest sequence
                 # to which this match point may be appended. Make
                 # a new candidate match and store the old one in K
                 # Set r to the length of the new candidate match.

                 set newc [list $i $j [lindex $K $s]]
                 lset K $r $c
                 set c $newc
                 set r [expr $s+1]

                 # If we've extended the length of the longest match,
                 # we're done; move the fence.

                 if { $s >= $k } {
                     lappend K [lindex $K end]
                     incr k
                     break
                 }

             }

             # Put the last candidate into the array

             lset K $r $c

         }

         incr i

     }

     set q [lindex $K $k]

     for { set i 0 } { $i < $k } {incr i } {
         lappend seta {}
         lappend setb {}
     }
     while { [lindex $q 0] >= 0 } {
         incr k -1
         lset seta $k [lindex $q 0]
         lset setb $k [lindex $q 1]
         set q [lindex $q 2]
     }

     return [list $seta $setb]

 }

 # MAIN PROGRAM
 #
 # Usage:
 #       diff.tcl file1 file2
 #
 # Output:
 #       Puts out a list of lines consisting of:
 #               n1<TAB>n2<TAB>line
 #
 #       where n1 is a line number in the first file, and n2 is a line number in the second file.
 #       The line is the text of the line.  If a line appears in the first file but not the second,
 #       n2 is omitted, and conversely, if it appears in the second file but not the first, n1
 #       is omitted.

 # Open the files and read the lines into memory

 set f1 [open [lindex $argv 0] r]
 set lines1 [split [read $f1] \n]
 close $f1
 set f2 [open [lindex $argv 1] r]
 set lines2 [split [read $f2] \n]
 close $f2

 set i 0
 set j 0

 foreach { x1 x2 } [list::longestCommonSubsequence $lines1 $lines2] {
     foreach p $x1 q $x2 {
         while { $i < $p } {
             set l [lindex $lines1 $i]
             puts "[incr i]\t\t$l"
         }
         while { $j < $q } {
             set m [lindex $lines2 $j]
             puts "\t[incr j]\t$m"
         }
         set l [lindex $lines1 $i]
         puts "[incr i]\t[incr j]\t$l"
     }
 }
 while { $i < [llength $lines1] } {
     set l [lindex $lines1 $i]
     puts "[incr i]\t\t$l"
 }
 while { $j < [llength $lines2] } {
     set m [lindex $lines2 $j]
     puts "\t[incr j]\t$m"
 }

if 0 {


Brilliant! I was going to try porting Tim Peters ndiff.py from the Python Tools/Scripts directory but have yet to find the time. He argues that he has an approach that is more human friendly, i.e. more in sync with what a person would consider the differences. (VPT)


This is a small diff program I wrote a few years ago. I needed a simple diff program without resorting to external programs (I wanted it to work unchanged on windows), where neither speed nor minimal changes were critical (it was part of a moo verb editor, so the diff-ed entities were at most around 60 or 70 lines). It gives reasonable results; from what I recall (I haven't used it much lately) the biggest problem is that moving a line will result in everything between the new line location and the old location showing up as deleted then added after the change.

The operation is really simple - roughly, loop over the lines in the new file, for each line look forward from the current point in the old file for the same line. If a match is not found, mark the current line as new, if its found further along mark the intervening stuff as deleted. Adjust the position in the old file and repeat. Not exactly accurate since I'm going by old fuzzy memory :)

 proc adj {args} {
   eval .diff.orig yview $args;
   eval .diff.new yview $args;
   }

 proc wdiff {l1 l2} {
   catch {destroy .diff};
   toplevel .diff;

   menu .diff.mbar;

   menu .diff.mbar.file
   .diff.mbar.file add command -label "Close" -command {destroy .diff};
   .diff.mbar add cascade -menu .diff.mbar.file -label "File";

   .diff configure -menu .diff.mbar;

   text .diff.orig -yscrollcommand ".diff.scr set";
   text .diff.new  -yscrollcommand ".diff.scr set";
   scrollbar .diff.scr -command {adj};

   pack .diff.orig .diff.scr .diff.new -side left;
   pack .diff.scr -expand t -fill y;

   .diff.new tag configure del -background green -foreground black; 
   .diff.orig tag configure del -background green -foreground black; 
   .diff.new tag configure ins -background red -foreground black; 
   .diff.orig tag configure ins -background red -foreground black; 
   .diff.new tag configure cha -background yellow -foreground black; 
   .diff.orig tag configure cha -background yellow -foreground black; 

   set len1 [llength $l1];
   set len2 [llength $l2];
   for {
        set s1 [lindex $l1 [set c 0]];
        set s2 [lindex $l2 [set d 0]];
        } {$c<$len1 || $d<$len2} {
        set s1 [lindex $l1 [incr c]];
        set s2 [lindex $l2 [incr d]]; 
        } {
        if {$c>=$len1} {
          # 2col "" "> $s2";
          2col "" del $s2 ins 
          # puts "i$d> $s2";
          continue;
        }
        if {$d>=$len2} {
          # puts "d$c> $s1";
          2col "$s1" ins "" del;
          continue;
        }
        if {![string compare $s1 $s2]} {
          # puts "$s1 $s2";
          2col $s1 "" $s2 "";
          continue;
        }
        for {set cc $c} {$cc < $len1} {incr cc} {
          if {![string compare [lindex $l1 $cc] $s2]} {
                for {set n $c} {$n<$cc} {incr n} {
                  2col "[lindex $l1 $n]" ins "" del;
                }
            # 2col "$s1" "<";
                # puts "d$cc< $s1";
                set c $cc;
                break;
          }
        }
        if {$cc == $c} {incr c -1; incr d -1; continue;}
        for {set cd $d} {$cd < $len2} {incr cd} {
          if {![string compare [lindex $l2 $cd] $s1]} {
                # puts "add $d-$cd";
                for {set n $d} {$n<$cd} {incr n} {
                  2col "" del "[lindex $l2 $n]" ins;
                }
                # puts "i$cd> $s2";
                set d $cd;
                break;
          }
        }
        if {$cd == $d} {incr c -1; incr d -1; continue;}
        2col "$s1" cha "$s2" cha;
        # puts "c$c< $s1";
        # puts "c$d> $s2";
   }
   .diff.orig configure -state disabled
   .diff.new configure -state disabled
 }

 proc 2col {a at b bt} {
   .diff.orig insert end "$a\n";
   if {$at != ""} {.diff.orig tag add $at {end-2 lines} {end-1 lines}}
   .diff.new insert end "$b\n";
   if {$bt != ""} {.diff.new tag add $bt {end-2 lines} {end-1 lines}}
 }
 #comment

 proc fdiff {a b} {
   set h [open $a];
   set l1 {};
   while {![eof $h]} {lappend l1 [gets $h]}
   set h [open $b];
   set l2 {};
   while {![eof $h]} {lappend l2 [gets $h]}
   wdiff $l1 $l2;
 }

[ Category Algorithm

Category Application ] }