diff in Tcl

: In my Tcl file server I use chroot for security reasons, and this has caused problems because my server also relies on diff. I know that I could recompile GNU diff as a standalone executable and put it within my chroot, but I don't want to infect my program with the GPL nor force my users to compile GNU binutils for this. I've been looking all over the place for a diff that is portable and can be used commercially, but I haven't been able to find one. So, I come to you for help writing a diff tool in Tcl. I'm overwhelmed when I look at existing code written in C, so I truly do need the help of someone more experienced.

Description

KBK answers:

The heart of any implementation of diff is a computation of the longest common subsequence (or some approximation to it) -- that is, the largest set of lines that the two files have in common. There's a copy of the original paper of diff on Doug McIlroy's homepage .

A "dynamic programming" approach to identifying the longest common subsequence has been known in the computing folklore for many years. There is a discussion of the folklore at Sequence Comparison: Some Theory and Some Practice (1988) Follow the Download links at the upper right to get a copy of the paper. There's a Tcl implementation of the folklore algorithm over at Longest common subsequence: folklore algorithm.

Below is a longest-common-subsequence procedure that incorporates McIlroy's improvements; the internals are extremely similar to those of diff. Despite being a pure-Tcl implementation, it appears to be adequately fast for most purposes.

The display that it produces is not the same as that of diff, but can be easily adapted to do whatever you need.

A more extensive version of this procedure has been submitted for inclusion in Tcllib. For details, see Tcllib feature request 708502 .

# Copyright (c) 2003 by Kevin B. Kenny.  All rights reserved.
# See the file,
# 'http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/tcllib/tcllib/license.terms'
# for terms and conditions of redistribution.

package require Tcl 8.0

namespace eval list { namespace export longestCommonSubsequence }

# Do a compatibility version of [lset] for pre-8.4 versions of Tcl.
# This version does not do multi-arg [lset]!

if { [package vcompare [package provide Tcl] 8.4] < 0 } {
    proc list::K { x y } { set x }
    proc list::lset { var index arg } {
        upvar 1 $var list
        set list [lreplace [K $list [set list {}]] $index $index $arg]
    }
}

# list::longestCommonSubsequence --
#
#       Computes the longest common subsequence of two lists.
#
# Parameters:
#       sequence1, sequence2 -- Two lists to compare.
#
# Results:
#       Returns a list of two lists of equal length. 
#       The first sublist is of indices into sequence1, and the
#       second sublist is of indices into sequence2.  Each corresponding
#       pair of indices corresponds to equal elements in the sequences;
#       the sequence returned is the longest possible.
#
# Side effects:
#       None.

proc list::longestCommonSubsequence { sequence1 sequence2 } {

    # Construct a set of equivalence classes of lines in file 2

    set index 0
    foreach string $sequence2 {
        lappend eqv($string) $index
        incr index
    }

    # K holds descriptions of the common subsequences.
    # Initially, there is one common subsequence of length 0,
    # with a fence saying that it includes line -1 of both files.
    # The maximum subsequence length is 0; position 0 of
    # K holds a fence carrying the line following the end
    # of both files.

    lappend K [list -1 -1 {}]
    lappend K [list [llength $sequence1] [llength $sequence2] {}]
    set k 0

    # Walk through the first file, letting i be the index of the line and
    # string be the line itself.

    set i 0
    foreach string $sequence1 {

        # Consider each possible corresponding index j in the second file.

        if { [info exists eqv($string)] } {

            # c is the candidate match most recently found, and r is the
            # length of the corresponding subsequence.
            
            set c [lindex $K 0]
            set r 0

            foreach j $eqv($string) {
                
                # Perform a binary search to find a candidate common
                # subsequence to which may be appended this match.
                
                set max $k
                set min $r
                set s [expr { $k + 1 }]
                while { $max >= $min } {
                    set mid [expr { ( $max + $min ) / 2 }]
                    set bmid [lindex [lindex $K $mid] 1]
                    if { $j == $bmid } {
                        break
                    } elseif { $j < $bmid } {
                        set max [expr {$mid - 1}]
                    } else {
                        set s $mid
                        set min [expr { $mid + 1 }]
                    }
                }

                # Go to the next match point if there is no suitable
                # candidate.

                if { $j == [lindex [lindex $K $mid] 1] || $s > $k} {
                    continue
                }
                
                # s is the sequence length of the longest sequence
                # to which this match point may be appended. Make
                # a new candidate match and store the old one in K
                # Set r to the length of the new candidate match.
                
                set newc [list $i $j [lindex $K $s]]
                lset K $r $c
                set c $newc
                set r [expr { $s + 1 }]
                
                # If we've extended the length of the longest match,
                # we're done; move the fence.

                if { $s >= $k } {
                    lappend K [lindex $K end]
                    incr k
                    break
                }
                
            }

            # Put the last candidate into the array

            lset K $r $c
            
        }
        
        incr i

    }

    set q [lindex $K $k]

    for { set i 0 } { $i < $k } {incr i } {
        lappend seta {}
        lappend setb {}
    }
    while { [lindex $q 0] >= 0 } {
        incr k -1
        lset seta $k [lindex $q 0]
        lset setb $k [lindex $q 1]
        set q [lindex $q 2]
    }

    return [list $seta $setb]
        
}

# MAIN PROGRAM
#
# Usage:
#       diff.tcl file1 file2
#
# Output:
#       Puts out a list of lines consisting of:
#               n1<TAB>n2<TAB>line
#
#       where n1 is a line number in the first file, and n2 is a line number in the second file.
#       The line is the text of the line.  If a line appears in the first file but not the second,
#       n2 is omitted, and conversely, if it appears in the second file but not the first, n1
#       is omitted.

# Open the files and read the lines into memory

set f1 [open [lindex $argv 0] r]
set lines1 [split [read $f1] \n]
close $f1
set f2 [open [lindex $argv 1] r]
set lines2 [split [read $f2] \n]
close $f2

set i 0
set j 0

foreach { x1 x2 } [list::longestCommonSubsequence $lines1 $lines2] {
    foreach p $x1 q $x2 {
        while { $i < $p } {
            set l [lindex $lines1 $i]
            puts "[incr i]\t\t$l"
        }
        while { $j < $q } {
            set m [lindex $lines2 $j]
            puts "\t[incr j]\t$m"
        }
        set l [lindex $lines1 $i]
        puts "[incr i]\t[incr j]\t$l"
    }
}
while { $i < [llength $lines1] } {
    set l [lindex $lines1 $i]
    puts "[incr i]\t\t$l"
}
while { $j < [llength $lines2] } {
    set m [lindex $lines2 $j]
    puts "\t[incr j]\t$m"
}

VPT: Brilliant! I was going to try porting Tim Peters ndiff.py from the Python Tools/Scripts directory but have yet to find the time. He argues that he has an approach that is more human friendly, i.e. more in sync with what a person would consider the differences.

davidw: diff in Tcl is cool, very portable, doesn't require compilation, and so on, but there is an incorrect assumption in the above quote. If your program is not linked with the GPL code, you do not have a problem. Only if you incorporate GPL code in your program (linking or cut'n'paste) is there a problem. Simply executing a program doesn't expose you to the GPL's requirements.

Another Diff Program

This is a small diff program I wrote a few years ago. I needed a simple diff program without resorting to external programs (I wanted it to work unchanged on windows), where neither speed nor minimal changes were critical (it was part of a moo verb editor, so the diff-ed entities were at most around 60 or 70 lines). It gives reasonable results; from what I recall (I haven't used it much lately) the biggest problem is that moving a line will result in everything between the new line location and the old location showing up as deleted then added after the change.

The operation is really simple - roughly, loop over the lines in the new file, for each line look forward from the current point in the old file for the same line. If a match is not found, mark the current line as new, if its found further along mark the intervening stuff as deleted. Adjust the position in the old file and repeat. Not exactly accurate since I'm going by old fuzzy memory :)

proc adj args {
    eval .diff.orig yview $args
    eval .diff.new yview $args
}

proc wdiff {l1 l2} {
    catch {destroy .diff}
    toplevel .diff;
  
    menu .diff.mbar;
  
    menu .diff.mbar.file
    .diff.mbar.file add command -label Close -command {destroy .diff}
    .diff.mbar add cascade -menu .diff.mbar.file -label File
  
    .diff configure -menu .diff.mbar;
  
    text .diff.orig -yscrollcommand {.diff.scr set}
    text .diff.new  -yscrollcommand {.diff.scr set}
    scrollbar .diff.scr -command adj
  
    pack .diff.orig .diff.scr .diff.new -side left
    pack .diff.scr -expand t -fill y
  
    .diff.new  tag configure del -background green  -foreground black
    .diff.orig tag configure del -background green  -foreground black
    .diff.new  tag configure ins -background red    -foreground black
    .diff.orig tag configure ins -background red    -foreground black
    .diff.new  tag configure cha -background yellow -foreground black
    .diff.orig tag configure cha -background yellow -foreground black
  
    set len1 [llength $l1]
    set len2 [llength $l2]
    for {
        set s1 [lindex $l1 [set c 0]]
        set s2 [lindex $l2 [set d 0]]
    } {$c<$len1 || $d<$len2} {
        set s1 [lindex $l1 [incr c]]
        set s2 [lindex $l2 [incr d]]
    } {
        if {$c>=$len1} {
            # 2col {} "> $s2"
            2col {} del $s2 ins 
            # puts "i$d> $s2"
            continue;
        }
        if {$d>=$len2} {
            # puts "d$c> $s1";
            2col $s1 ins {} del;
            continue;
        }
        if {![string compare $s1 $s2]} {
            # puts "$s1 $s2"
            2col $s1 {} $s2 {}
            continue;
        }
        for {set cc $c} {$cc < $len1} {incr cc} {
            if {![string compare [lindex $l1 $cc] $s2]} {
                for {set n $c} {$n<$cc} {incr n} {
                    2col [lindex $l1 $n] ins {} del
                }
                # 2col $s1 <
                # puts "d$cc< $s1"
                set c $cc
                break
            }
        }
        if {$cc == $c} {
            incr c -1
            incr d -1
            continue
        }
        for {set cd $d} {$cd < $len2} {incr cd} {
            if {![string compare [lindex $l2 $cd] $s1]} {
                # puts "add $d-$cd"
                for {set n $d} {$n<$cd} {incr n} {
                  2col {} del [lindex $l2 $n] ins
                }
                # puts "i$cd> $s2"
                set d $cd
                break
            }
        }
        if {$cd == $d} {
            incr c -1
            incr d -1
            continue
        }
        2col $s1 cha $s2 cha
        # puts "c$c< $s1"
        # puts "c$d> $s2"
    }
    .diff.orig configure -state disabled
    .diff.new configure -state disabled
}

proc 2col {a at b bt} {
    .diff.orig insert end "$a\n"
    if {$at ne {}} {
        .diff.orig tag add $at {end-2 lines} {end-1 lines}
    }
    .diff.new insert end "$b\n"
    if {$bt ne {}} {
        .diff.new tag add $bt {end-2 lines} {end-1 lines}
    }
}
#comment

proc fdiff {a b} {
    set h [open $a]
    set l1 {}
    while {![eof $h]} {lappend l1 [gets $h]}
    set h [open $b]
    set l2 {}
    while {![eof $h]} {lappend l2 [gets $h]}
    wdiff $l1 $l2
}

Current Recommendation

RLE 2012-06-30: Recent versions of tcllib provide all the necessary infrastructure for creating a "diff" output between two strings. See the example below:

package require Tk
package require struct::list

text .diffview
pack .diffview -side top -expand true -fill both

.diffview tag configure inserted -underline true
.diffview tag configure deleted  -overstrike true

# text strings to "difference":

set string1 {the quick brown fox jumped over the lazy dog.}
set string2 {The slow yellow dog jumped over the lazy fox.}

# if {1} = character level differences
# if {0} = word level differences
     
if 0 {
    set list1 [split $string1 {}]
    set list2 [split $string2 {}]
} else {
    set list1 [regexp -all -inline {\S+|\s+} $string1]
    set list2 [regexp -all -inline {\S+|\s+} $string2]
}
 
# these next two lines perform the "diff" operation

set lcsdata  [::struct::list longestCommonSubsequence $list1 $list2]

set diffdata [::struct::list lcsInvertMerge $lcsdata [
    llength $list1] [llength $list2]]

# format the result into the text widget:

foreach item $diffdata {
    lassign $item kind idx1 idx2
    switch -exact $kind {
        added {
            .diffview insert end [join [lrange $list2 {*}$idx2] {}] inserted
        }
        deleted {
            .diffview insert end [join [lrange $list1 {*}$idx1] {}] deleted
        }
        changed {
            .diffview insert end [join [lrange $list1 {*}$idx1] {}] deleted
            .diffview insert end [join [lrange $list2 {*}$idx2] {}] inserted
        }
        unchanged {
            .diffview insert end [join [lrange $list1 {*}$idx1 ] {}] {}
        }
    }
}

WJG 2018-05-28: I liked this solution so much that I wrapped it into a single proc that returns markup strings.

proc stringDiff {str1 str2 {opt -word}} {
    if {[catch {
        package present struct::list
    }]} {
        package require struct::list
    }
         
    if {$opt eq {-char}} {
        # character level
        set list1 [split $str1 {}]
        set list2 [split $str2 {}]
    } elseif { $opt eq "-word" } {
        # word level
        set list1 [regexp -all -inline {\S+|\s+} $str1]
        set list2 [regexp -all -inline {\S+|\s+} $str2]
    } else {
        error "Invalid option \"$opt\". Must be one of -char or -word (default)."
    }
         
    # these next two lines perform the "diff" operation
    set lcsdata  [::struct::list longestCommonSubsequence $list1 $list2]
    set diffdata [::struct::list lcsInvertMerge $lcsdata [
        llength $list1] [llength $list2]]
         
    # format the result into a markup string
    foreach item $diffdata {
        lassign $item kind idx1 idx2
        switch -exact $kind {
            added {
                append res "<u>[join [lrange $list2 {*}$idx2] {}]</u>"
            }
            deleted {
                append res "<s>[join [lrange $list1 {*}$idx1] {}]</s>"
            }
            changed {
                append res "<s>[join [lrange $list1 {*}$idx1] {}]</s>"
                append res "<u>[join [lrange $list2 {*}$idx2] {}]</u>"
            }
            unchanged {
                append res [join [lrange $list1 {*}$idx1] {}]
            }
        }
    }

    return $res
}

Category Application

Category Algorithm

Category File

diff in Tcl

See Also

Description

Another Diff Program

Current Recommendation