Longest common substring

aricb 2008-11-12: The longest common substring problem is to find the longest contiguous sequence of symbols shared by two strings. Wikipedia has a nice explanatation of the problem.

Description

The difference between a longest common substring and a longest common subsequence is the criterion of contiguity. The symbols that make up a longest common substring must appear in both strings as a contiguous string. The symbols that make up a longest common subsequence need only appear in the same order in each string; there may be intervening symbols in between. Given the strings "hippopotamus" and "rhinoceros", the longest common substring is "hi", while the longest common subsequence is "hioos".

The algorithm used below uses "dynamic programming" (also called memoing or tabulation) to find the longest common substring(s). The basic approach used here is described in the Wikipedia article referenced above. One difference from the algorithm described on Wikipedia is that the proc below never retains more than two rows of the dynamic programming matrix at any time (it is never necessary to look farther back than the previous row, so we save memory this way). This proc is written to find substrings within strings, but could easily be rewritten to find sublists within lists.

proc longestCommonSubstring {s t} {
        set initialrow [lrepeat [expr {[string length $t] + 1}] 0]
        set prevrow $initialrow
        set result [list]
        set longestlength 0
        set t [split $t[set t {}] {}]
        foreach schar [split $s ""] {
                set j 0
                set row $initialrow
                foreach tchar $t {
                        incr j
                        if {$tchar eq $schar} {
                                set substringlength [expr {1 + [lindex $prevrow [expr {$j - 1}]]}]
                                        set longestlength $substringlength
                                        set result [list [string range $t [
                                                expr {$j - $longestlength}] [expr {$j - 1}]]]
                                } elseif {$substringlength == $longestlength} {
                                        lappend result [string range $t [
                                                expr {$j - $longestlength}] [expr {$j - 1}]]
                                }
                        }
                }
                set prevrow $row
        }
        return $result
}

If you aren't interested in substrings smaller than a particular size, you can save some time by breaking out of the loop early if it becomes apparent that no substring of length $minlength or larger is present. This is only useful if $minlength > 1; there is a very slight performance penalty imposed by break in the case of $minlength == 1 (which in practice is no different from $minlength == 0).

proc longestCommonSubstring2 {s t {minlength 0}} {
        if {[string length $s] > [string length $t]} {
                set u $s
                set s $t
                set t $u
        }
        set initialrow [lrepeat [expr {[string length $t] + 1}] 0]
        set prevrow $initialrow
        set result [list]
        set longestlength 0
        set ithreshold [expr {$minlength - [string length $s]}]
        set t [split $t[set t {}] {}]
        foreach schar [split $s {}] {
                incr ithreshold
                set j 0
                set row $initialrow
                foreach tchar $t {
                        incr j
                        if {$tchar eq $schar} {
                                set substringlength [
                                        expr {1 + [lindex $prevrow [expr {$j - 1}]]}]
                                lset row $j $substringlength
                                if {$substringlength > $longestlength} {
                                        set longestlength $substringlength
                                        set result [list $j]
                                } elseif {$substringlength == $longestlength} {
                                        lappend result $j
                                }
                        }
                }
                set prevrow $row
                if {$ithreshold > $longestlength} {
                        break
                }
        }
        set substrings [list]
        foreach index $result {
                lappend substrings [string range $t [
                        expr {$index - $longestlength}] [expr {$index - 1}]]
        }
        return $substrings
}

Category Algorithm

Category String Processing

Category Word and Text Processing

Longest common substring

See Also

Description