[GPS] asks in [the comp.lang.tcl newsgroup]: In my Tcl file server I use chroot for security reasons, and this has caused problems because my server also relies on diff. I know that I could recompile GNU diff as a standalone executable and put it within my chroot, but I don't want to infect my program with the GPL nor force my users to compile GNU binutils for this. I've been looking all over the place for a diff that is portable and can be used commercially, but I haven't been able to find one. So, I come to you for help writing a diff tool in Tcl. I'm overwhelmed when I look at existing code written in C, so I truly do need the help of someone more experienced. [KBK] answers: The heart of any implementation of '''diff''' is a computation of the ''longest common subsequence'' (or some approximation to it) -- that is, the largest set of lines that the two files have in common. There's a copy of the original paper of '''diff''' on Doug McIlroy's homepage [http://www.cs.dartmouth.edu/~doug/]. A "dynamic programming" approach to identifying the longest common subsequence has been known in the computing folklore for many years. There is a discussion of the folklore at [http://citeseer.nj.nec.com/simon88sequence.html] - follow the ''Download'' links at the upper right to get a copy of the paper. Here is one Tcl implementation of the "folklore algorithm." I'd like to see one with McIlroy's improvements; this one is ''slow.' ---- namespace eval list { } namespace eval list::longestCommonSubsequence { namespace export compare } # Internal procedure that indexes into the 2-dimensional array t, # which corresponds to the sequence y, looking for the (i,j)th element. proc list::longestCommonSubsequence::Index { t y i j } { set indx [expr { ([llength $y] + 1) * ($i + 1) + ($j + 1) }] return [lindex $t $indx] } # Internal procedure that implements Levenshtein to derive the longest # common subsequence of two lists x and y. proc list::longestCommonSubsequence::ComputeLCS { x y } { set t [list] for { set i -1 } { $i < [llength $y] } { incr i } { lappend t 0 } for { set i 0 } { $i < [llength $x] } { incr i } { lappend t 0 for { set j 0 } { $j < [llength $y] } { incr j } { if { [string equal [lindex $x $i] [lindex $y $j]] } { set lastT [Index $t $y [expr { $i - 1 }] [expr {$j - 1}]] set nextT [expr {$lastT + 1}] } else { set lastT1 [Index $t $y $i [expr { $j - 1 }]] set lastT2 [Index $t $y [expr { $i - 1 }] $j] if { $lastT1 > $lastT2 } { set nextT $lastT1 } else { set nextT $lastT2 } } lappend t $nextT } } return $t } # Internal procedure that traces through the array built by ComputeLCS # and finds a longest common subsequence -- specifically, the one that # is lexicographically first. proc list::longestCommonSubsequence::TraceLCS { t x y } { set trace {} set i [expr { [llength $x] - 1 }] set j [expr { [llength $y] - 1 }] set k [expr { [Index $t $y $i $j] - 1 }] while { $i >= 0 && $j >= 0 } { set im1 [expr { $i - 1 }] set jm1 [expr { $j - 1 }] if { [Index $t $y $i $j] == [Index $t $y $im1 $jm1] + 1 && [string equal [lindex $x $i] [lindex $y $j]] } { lappend trace xy [list $i $j] set i $im1 set j $jm1 } elseif { [Index $t $y $im1 $j] > [Index $t $y $i $jm1] } { lappend trace x $i set i $im1 } else { lappend trace y $j set j $jm1 } } while { $i >= 0 } { lappend trace x $i incr i -1 } while { $j >= 0 } { lappend trace y $j incr j -1 } return $trace } # list::longestCommonSubsequence::compare -- # # Compare two lists for the longest common subsequence # # Arguments: # x, y - Two lists of strings to compare # matched - Callback to execute on matched elements, see below # unmatchedX - Callback to execute on unmatched elements from the # first list, see below. # unmatchedY - Callback to execute on unmatched elements from the # second list, see below. # # Results: # None. # # Side effects: # Whatever the callbacks do. # # The 'compare' procedure compares the two lists of strings, x and y. # It finds a longest common subsequence between the two. It then walks # the lists in order and makes the following callbacks: # # For an element that is common to both lists, it appends the index in # the first list, the index in the second list, and the string value of # the element as three parameters to the 'matched' callback, and executes # the result. # # For an element that is in the first list but not the second, it appends # the index in the first list and the string value of the element as two # parameters to the 'unmatchedX' callback and executes the result. # # For an element that is in the second list but not the first, it appends # the index in the second list and the string value of the element as two # parameters to the 'unmatchedY' callback and executes the result. proc list::longestCommonSubsequence::compare { x y matched unmatchedX unmatchedY } { set t [ComputeLCS $x $y] set trace [TraceLCS $t $x $y] set i [llength $trace] while { $i > 0 } { set indices [lindex $trace [incr i -1]] set type [lindex $trace [incr i -1]] switch -exact -- $type { xy { set c $matched eval lappend c $indices lappend c [lindex $x [lindex $indices 0]] uplevel 1 $c } x { set c $unmatchedX lappend c $indices lappend c [lindex $x $indices] uplevel 1 $c } y { set c $unmatchedY lappend c $indices lappend c [lindex $y $indices] uplevel 1 $c } } } return } # With this code in hand, we can now write the external parts of a '''diff''' command. The various options of '''diff''' alter how it ''displays'' the comparison, but not its fundamental operation. Here's an external wrapper that gives very simple-minded output. namespace import list::longestCommonSubsequence::compare proc umx { index value } { variable lastx variable xlines append xlines "< " $value \n set lastx $index } proc umy { index value } { variable lasty variable ylines append ylines "> " $value \n set lasty $index } proc matched { index1 index2 value } { variable lastx variable lasty variable xlines variable ylines if { [info exists lastx] && [info exists lasty] } { puts "[expr { $lastx + 1 }],${index1}c[expr {$lasty + 1 }],${index2}" puts -nonewline $xlines puts "----" puts -nonewline $ylines } elseif { [info exists lastx] } { puts "[expr { $lastx + 1 }],${index1}d${index2}" puts -nonewline $xlines } elseif { [info exists lasty] } { puts "${index1}a[expr {$lasty + 1 }],${index2}" puts -nonewline $ylines } catch { unset lastx } catch { unset xlines } catch { unset lasty } catch { unset ylines } } # Really, we should read the first file in like this: # set f0 [open [lindex $argv 0] r] # set x [split [read $f0] \n] # close $f0 # But I'll just provide some sample lines: set x {} for { set i 0 } { $i < 10 } { incr i } { lappend x a r a d e d a b r a x } # The second file, too, should be read in like this: # set f1 [open [lindex $argv 1] r] # set y [split [read $f1] \n] # close $f1 # Once again, I'll just do some sample lines. set y {} for { set i 0 } { $i < 10 } { incr i } { lappend y a b r a c a d a b r a } compare $x $y matched umx umy matched [llength $x] [llength $y] {} ---- It is unfortunate that the time taken to compute differences is quadratic. Alas, nobody seems to know how to do any better. ---- Brilliant! I was going to try porting Tim Peters ndiff.py from the Python Tools/Scripts directory but have yet to find the time. He argues that he has an approach that is more human friendly, i.e. more in sync with what a person would consider the differences. (VPT) ---- This is a small diff program I wrote a few years ago. I needed a simple diff program without resorting to external programs (I wanted it to work unchanged on windows), where neither speed nor minimal changes were critical (it was part of a moo verb editor, so the diff-ed entities were at most around 60 or 70 lines). It gives reasonable results; from what I recall (I haven't used it much lately) the biggest problem is that moving a line will result in everything between the new line location and the old location showing up as deleted then added after the change. The operation is really simple - roughly, loop over the lines in the new file, for each line look forward from the current point in the old file for the same line. If a match is not found, mark the current line as new, if its found further along mark the intervening stuff as deleted. Adjust the position in the old file and repeat. Not exactly accurate since I'm going by old fuzzy memory :) proc adj {args} { eval .diff.orig yview $args; eval .diff.new yview $args; } proc wdiff {l1 l2} { catch {destroy .diff}; toplevel .diff; menu .diff.mbar; menu .diff.mbar.file .diff.mbar.file add command -label "Close" -command {destroy .diff}; .diff.mbar add cascade -menu .diff.mbar.file -label "File"; .diff configure -menu .diff.mbar; text .diff.orig -yscrollcommand ".diff.scr set"; text .diff.new -yscrollcommand ".diff.scr set"; scrollbar .diff.scr -command {adj}; pack .diff.orig .diff.scr .diff.new -side left; pack .diff.scr -expand t -fill y; .diff.new tag configure del -background green -foreground black; .diff.orig tag configure del -background green -foreground black; .diff.new tag configure ins -background red -foreground black; .diff.orig tag configure ins -background red -foreground black; .diff.new tag configure cha -background yellow -foreground black; .diff.orig tag configure cha -background yellow -foreground black; set len1 [llength $l1]; set len2 [llength $l2]; for { set s1 [lindex $l1 [set c 0]]; set s2 [lindex $l2 [set d 0]]; } {$c<$len1 || $d<$len2} { set s1 [lindex $l1 [incr c]]; set s2 [lindex $l2 [incr d]]; } { if {$c>=$len1} { # 2col "" "> $s2"; 2col "" del $s2 ins # puts "i$d> $s2"; continue; } if {$d>=$len2} { # puts "d$c> $s1"; 2col "$s1" ins "" del; continue; } if {![string compare $s1 $s2]} { # puts "$s1 $s2"; 2col $s1 "" $s2 ""; continue; } for {set cc $c} {$cc < $len1} {incr cc} { if {![string compare [lindex $l1 $cc] $s2]} { for {set n $c} {$n<$cc} {incr n} { 2col "[lindex $l1 $n]" ins "" del; } # 2col "$s1" "<"; # puts "d$cc< $s1"; set c $cc; break; } } if {$cc == $c} {incr c -1; incr d -1; continue;} for {set cd $d} {$cd < $len2} {incr cd} { if {![string compare [lindex $l2 $cd] $s1]} { # puts "add $d-$cd"; for {set n $d} {$n<$cd} {incr n} { 2col "" del "[lindex $l2 $n]" ins; } # puts "i$cd> $s2"; set d $cd; break; } } if {$cd == $d} {incr c -1; incr d -1; continue;} 2col "$s1" cha "$s2" cha; # puts "c$c< $s1"; # puts "c$d> $s2"; } .diff.orig configure -state disabled .diff.new configure -state disabled } proc 2col {a at b bt} { .diff.orig insert end "$a\n"; if {$at != ""} {.diff.orig tag add $at {end-2 lines} {end-1 lines}} .diff.new insert end "$b\n"; if {$bt != ""} {.diff.new tag add $bt {end-2 lines} {end-1 lines}} } #comment proc fdiff {a b} { set h [open $a]; set l1 {}; while {![eof $h]} {lappend l1 [gets $h]} set h [open $b]; set l2 {}; while {![eof $h]} {lappend l2 [gets $h]} wdiff $l1 $l2; } ---- [Category Application]