Version 1 of WordDiff

Updated 2002-11-25 14:43:36

To create the WikiDiff and make it show point-edits in wiki pages I needed to have way to generate a list of changed words.

My first thought was let's use Tcl, and maybe diff in tcl. But alas. That is too slow currently. Next up was the ever trustworthy unix command diff(1). Sadly, that doesn't do word-by-word.

But that can be overcome by just chopping the entire wikipage with [split $page " "] and handing that to diff:

 set linesep "*()%^&" ;# or something else unlikely to occur in a page

 #spaces atound the linesep so they do not stay attached to words 
 #A lot of changes occur at EOL... wouldn't look nice if diff thought
 #that linesep is part of a normal 'word'
 regsub -all {\n} $oldtext " $linesep " old
 set old [join [split $old " "] \n]

 regsub -all {\n} $newtext " $linesep " new
 set new [join [split $new " "] \n]

 #log $new
 set diffs [split [doDiff $old $new] \n]

And then to do the actual diff: [doDiff old new]

    proc doDiff { oldtext newtext } {
        #do the actual diff(1) on the two strings and return the diff.
        set fp [open "/tmp/worddiff.[pid].old" w]
        puts -nonewline $fp $oldtext
        close $fp

        set fp [open "/tmp/worddiff.[pid].new" w]    
        puts -nonewline $fp $newtext
        close $fp

        catch { 
            exec -- /usr/bin/diff "/tmp/worddiff.[pid].old" "/tmp/worddiff.[pid].new" > "/tmp/worddiff.[pid].diff"
        } res 

        switch [lindex $::errorCode 2] {
            0 { set diff "" }
            1 {        set diff [fileread "/tmp/worddiff.[pid].diff"] }            
            2 {        error $res }
        }
        file delete -force "/tmp/worddiff.[pid].old" "/tmp/worddiff.[pid].new" "/tmp/worddiff.[pid].diff"

        return $diff
    }

That was the easy bit. And it works rather nicely. Diff makes a list with positions in the oldtext. I turn oldtext turn back into a list by splitting it with by \n. (Actually, I forgot at first, which produces very interesting results that made my eyes go cross.)

To make life easier later on, I take the diff output and the oldtext (now a list) and create a new list with tagged words: Each word that is mentioned in the diff output gets a tag new or old, by simply setting the list entry to {word new} {or old}, so the end result of [makeTagDiff] is:

{this is a sample list {of old} {with new} some text {which old} {isn't old} {very old} {good. old} {that new} {I new} {like. new}}

Creating this list was the hardest bit, mostly because I was too tired to notice some obvious bugs. (Did I mention I forgot to turn oldtext back into a list?!?) Words got dropped, inserted in the wrong places, etc.

After that html-i-fying that tagged list isn't very hard:

 foreach word $taglist {
  set newtag [lindex $word 1]
  if { $newtag != oldtag } {
      set html "</span>"
      if { $newtag is not empty } {
         lappend html "<span class=$tag>"
      }
      lappend html "$html[lindex $word 0]"
      set oldtag $newtag
  } else {
      lappend html [lindex $word 0]
  }
 }

All that remains is:

 set html [join $html " "] 
 regsub -all $linesep $html \n html

and [puts $html]

That is the simplified version. The actual code http://pascal.scheffers.net/wikidiff/worddiff.tcl.txt is a bit more complex, because I also want to create a 'context' diff, where only a couple of lines around each change are shown. Context diff is also an interesting problem, see the code for that.

It isn't perfect. There are some weird things that happen when you add whitespace at the end of a line. That must have something to do with the way I change \n to a $linesep and back to \n again. And then there are things outside my control because, on complex edits, diff doesn't always generate the nicest output imaginable. That is okay for a diff file that you use to patch software, but not nice for me. And then there must also be other bugs.

-- PS 25nov02