[AM] (23 december 2004) Here is a little puzzle for the days around Christmas and the Gregorian New Year: Given a piece of text (where all whitespace has been replaced by single spaces, to make it less ambiguous), such as the description of the [Endekalogue], are there any substrings of given length (longer than, say, 10 characters) that appear at least twice? Requirements: * It should be a pure Tcl script that finds the answer for an arbitrary piece of text * It should be "best" in some way The best solutions are either: * the shortest (smallest size of the code for any artistic but consistent definition of ''size'') * the fastest (as measured by the [[time]] command) * the least number of command executions (as measured by the [[info cmdcount]] command) * the most fantastic use of regular expressions * the most obfuscated * the most elegant variation on the question (such as presenting a histogram of repeated substrings during the computation ...) * the best in any other challenging category Rewards for each category: honorary mention on the Wiki '''Add your solutions below!''' ''Disclaimer:'' I have no other intention with this contest than the contest itself - it just struck me that I would not know how to solve it using [[regexp]], but I considered the use of regexps for about 10 seconds .... ---- [adavis] Does this fit the bill?... # Text to be checked for repeated substrings. set testtext "well then, has this got some repeated strings? this is a test string" # Length of required substrings minus one. set findlength 3 for {set i 0} {$i <= [expr {[string length $testtext] - $findlength}]} {incr i} { set teststring [string range $testtext $i [expr {$i + $findlength}]] if {[catch {incr result($teststring)}]} {set result($teststring) 1} } parray result ...I've used short text and substring for clarity. [AM] Yes, seems a winner sofar (23 december 10:10 GMT) :) [AM] I should have thought a bit longer about regexps, here is a solution with a single pattern: set testtext "..." ;# As above regexp {(.{3,}).*\1} $text => str puts "$str" gives: " th" after thinking very hard ... Not the longest, but at least one that is repeated. (As for speed: [AM] solution: 10387172 microseconds per iteration [adavis] solution: 4946 microseconds per iteration ) ---- [Peter Spjuth]: I tried the variation of finding the longest. Inspired by Arjen's regexp I did this very slow thingy: set minlength 4 set start 0 while {[regexp -indices -start $start "(.{$minlength,}).*\\1" $testtext => ixs]} { foreach {si ei} $ixs break set minlength [expr {$ei - $si + 2}] set start [expr {$si + 1}] } puts \"[string range $testtext $si $ei]\" A faster one: # Find the longest repeated substring proc findLongestSubstring {str {minlen 5}} { set len [string length $str] set first 0 set last -1 for {set t 0 ; set u [expr {$minlen - 1}]} {$u < $len} {incr u} { if {[string first [string range $str $t $u] $str [expr {$u + 1}]] >= 0} { set first $t set last $u } else { incr t } } string range $str $first $last } ---- [DKF]: Hmm, why is the [regexp] version so slow? This experiment is a little better, but still not great... for {set i 0;set len 0} {$i+$len < [string length $testtext]} {incr i} { regexp -start $i {(.+).*\1} $testtext -> matched if {[string length $matched] > $len} { puts >>$matched<< set len [string length $matched] } } <>Uncategorized