Contest - finding repeating substrings

AM (23 december 2004) Here is a little puzzle for the days around Christmas and the Gregorian New Year:

Given a piece of text (where all whitespace has been replaced by single spaces, to make it less ambiguous), such as the description of the Endekalogue, are there any substrings of given length (longer than, say, 10 characters) that appear at least twice?

Requirements:

  • It should be a pure Tcl script that finds the answer for an arbitrary piece of text
  • It should be "best" in some way

The best solutions are either:

  • the shortest (smallest size of the code for any artistic but consistent definition of size)
  • the fastest (as measured by the [time] command)
  • the least number of command executions (as measured by the [info cmdcount] command)
  • the most fantastic use of regular expressions
  • the most obfuscated
  • the most elegant variation on the question (such as presenting a histogram of repeated substrings during the computation ...)
  • the best in any other challenging category

Rewards for each category: honorary mention on the Wiki

Add your solutions below!

Disclaimer: I have no other intention with this contest than the contest itself - it just struck me that I would not know how to solve it using [regexp], but I considered the use of regexps for about 10 seconds ....


adavis Does this fit the bill?...

 # Text to be checked for repeated substrings.
 set testtext "well then, has this got some repeated strings? this is a test string"

 # Length of required substrings minus one.
 set findlength 3

 for {set i 0} {$i <= [expr {[string length $testtext] - $findlength}]} {incr i} {
    set teststring [string range $testtext $i [expr {$i + $findlength}]]
    if {[catch {incr result($teststring)}]} {set result($teststring) 1}
 }

 parray result

...I've used short text and substring for clarity.

AM Yes, seems a winner sofar (23 december 10:10 GMT) :)

AM I should have thought a bit longer about regexps, here is a solution with a single pattern:

 set testtext "..." ;# As above
 regexp {(.{3,}).*\1} $text => str
 puts "$str"

gives: " th" after thinking very hard ... Not the longest, but at least one that is repeated.

(As for speed:

 [AM] solution: 10387172 microseconds per iteration
 [adavis] solution: 4946 microseconds per iteration

)


Peter Spjuth: I tried the variation of finding the longest.

Inspired by Arjen's regexp I did this very slow thingy:

 set minlength 4
 set start 0
 while {[regexp -indices -start $start "(.{$minlength,}).*\\1" $testtext => ixs]} {
    foreach {si ei} $ixs break
    set minlength [expr {$ei - $si + 2}]
    set start [expr {$si + 1}]
 }
 puts \"[string range $testtext $si $ei]\"

A faster one:

 # Find the longest repeated substring
 proc findLongestSubstring {str {minlen 5}} {
    set len [string length $str]
    set first 0
    set last -1

    for {set t 0 ; set u [expr {$minlen - 1}]} {$u < $len} {incr u} {
        if {[string first [string range $str $t $u] $str [expr {$u + 1}]] >= 0} {
            set first $t
            set last  $u
        } else {
            incr t
        }
    }

    string range $str $first $last
 }

DKF: Hmm, why is the regexp version so slow? This experiment is a little better, but still not great...

 for {set i 0;set len 0} {$i+$len < [string length $testtext]} {incr i} {
    regexp -start $i {(.+).*\1} $testtext -> matched
    if {[string length $matched] > $len} {
       puts >>$matched<<
       set len [string length $matched]
    }
 }