Longest common substring

aricb 2008-11-12 The longest common substring problem is to find the longest contiguous sequence of symbols shared by two strings. Wikipedia has a nice explanation of the problem [1 ].

The difference between a longest common substring and a longest common subsequence is the criterion of contiguity. The symbols that make up a longest common substring must appear in both strings as a contiguous string. The symbols that make up a longest common subsequence need only appear in the same order in each string; there may be intervening symbols in between. Given the strings "hippopotamus" and "rhinoceros", the longest common substring is "hi", while the longest common subsequence is "hioos".

The algorithm used below uses "dynamic programming" (also called memoing or tabulation) to find the longest common substring(s). The basic approach used here is described in the Wikipedia article referenced above. One difference from the algorithm described on Wikipedia is that the proc below never retains more than two rows of the dynamic programming matrix at any time (it is never necessary to look farther back than the previous row, so we save memory this way). This proc is written to find substrings within strings, but could easily be rewritten to find sublists within lists.

  proc longestCommonSubstring {s t} {
    set initialrow [lrepeat [expr {[string length $t] + 1}] 0]
    set prevrow $initialrow
    set result [list]
    set longestlength 0
    foreach schar [split $s ""] {
      set j 0
      set row $initialrow
      foreach tchar [split $t ""] {
        incr j
        if {$tchar eq $schar} {
          set substringlength [expr {1 + [lindex $prevrow [expr {$j - 1}]]}]
          lset row $j $substringlength
          if {$substringlength > $longestlength} {
            set longestlength $substringlength
            set result [list [string range $t [expr {$j - $longestlength}] [expr {$j - 1}]]]
          } elseif {$substringlength == $longestlength} {
            lappend result [string range $t [expr {$j - $longestlength}] [expr {$j - 1}]]
          }
        }
      }
      set prevrow $row
    }
    return $result
  }

If you aren't interested in substrings smaller than a particular size, you can save some time by breaking out of the loop early if it becomes apparent that no substring of length $minlength or larger is present. This is only useful if $minlength > 1; there is a very slight performance penalty imposed by [break] in the case of $minlength == 1 (which in practice is no different from $minlength == 0).

  proc longestCommonSubstring2 {s t {minlength 0}} {
    if {[string length $s] > [string length $t]} {
      set u $s
      set s $t
      set t $u
    }
    set initialrow [lrepeat [expr {[string length $t] + 1}] 0]
    set prevrow $initialrow
    set result [list]
    set longestlength 0
    set ithreshold [expr {$minlength - [string length $s]}]
    foreach schar [split $s ""] {
      incr ithreshold
      set j 0
      set row $initialrow
      foreach tchar [split $t ""] {
        incr j
        if {$tchar eq $schar} {
          set substringlength [expr {1 + [lindex $prevrow [expr {$j - 1}]]}]
          lset row $j $substringlength
          if {$substringlength > $longestlength} {
            set longestlength $substringlength
            set result [list $j]
          } elseif {$substringlength == $longestlength} {
            lappend result $j
          }
        }
      }
      set prevrow $row
      if {$ithreshold > $longestlength} {
        break
      }
    }
    set substrings [list]
    foreach index $result {
      lappend substrings [string range $t [expr {$index - $longestlength}] [expr {$index - 1}]]
    }
    return $substrings
  }