AMG: I guess RS wanted this page, so here goes... :^)
RS: Oops, sorry, what I meant was Shuffle a list ...
AMG: Ah, well, at least I got to write another un/sorting story page.
This page discusses shuffling algorithms and implementations thereof.
Here's a proc, adapted from unsort:
proc shuffle {data} { set rand_data [list] foreach elem $data { lappend rand_data [list [expr {rand()}] $elem] } set result [list] foreach elem [lsort -real -index 0 $rand_data] { lappend result [lindex $elem 1] } return $result }
It works, but it's seems suboptimal.
For starters, the real sort bothers me. I imagine sorting by integer would be more efficient, but then again producing random integers is (currently) slower than producing random reals. Change rand() for int(0x7fffffff * rand()) and -real for -integer and it winds up taking longer. See the benchmark below. (By the way, how am I supposed to portably get the maximum and minimum integer?)
Next, having up to three copies of the list (one of which is augmented) appears to be a waste of RAM.
So would it be better to do an in-place shuffle? For each element of $data it could swap with another, randomly-selected element of $data. I'll take a shot at it:
proc shuffle {data} { set length [llength $data] for {set idx_1 0} {$idx_1 < $length} {incr idx_1} { set idx_2 [expr {int($length * rand())}] set temp [lindex $data $idx_1] lset data $idx_1 [lindex $data $idx_2] lset data $idx_2 $temp } return $data }
Big improvement.
Here's my magical test data:
set data [lrepeat 100000 x y z z y]
If you're trying this at home from your interactive tclsh, you might want to type "; puts -nonewline {}" after the end of the above line or else its printed result will scroll for a very long time.
Now I time the shuffles:
.................................................. : : 400 MHz : 1500 MHz : :....................:.............:.............: : 1, random reals : 21.847024 s : 9.541117 s : : 2, random integers : 24.649960 s : 9.857447 s : : 3, in-place swaps : 7.650484 s : 2.328508 s : :....................:.............:.............:
Wow, I think I will go with #3! Hold on while I update unsort... :^)
AMG: Per Lars H's suggestion, I improved the code again. As Lars explains in unsort, the old code doesn't have a perfectly uniform distribution because, for any given list, the size of the set of permutations from which it chooses is not a multiple of the size of the set of all possible permutations. This is similar to doing "random() % 3" in C and getting 0's more often than 2's.
proc shuffle {data} { set length [llength $data] for {} {$length > 1} {incr length -1} { set idx_1 [expr {$length - 1}] set idx_2 [expr {int($length * rand())}] set temp [lindex $data $idx_1] lset data $idx_1 [lindex $data $idx_2] lset data $idx_2 $temp } return $data }
That ought to do it! However, it's (theoretically) a little bit slower due to modifying more variables in the inner loop. I also made a less-readable version that incr's idx_1 rather than recalculating it with expr, for comparison's sake. Here are the numbers, using the same benchmark.
.................................................. : : 400 MHz : 1500 MHz : :....................:.............:.............: : 4 , more random : 8.428660 s : 2.286190 s : : 4a, less readable : 7.824165 s : 2.163990 s : :....................:.............:.............:
For reasonably-sized lists (under 500,000 elements, I guess), the speed difference between #3, #4, and #4a is negligible. So I choose #4, which has a more uniform distribution and doesn't have grody sources due to silly optimizations.
I wonder why the 400 MHz results for #4 and #4a are slower than for #3 yet the 1500 MHz results are the other way around...? I guess this proc is random in more than one way.
SMP: Here is what 4a would look like and a re-benchmark. I think a 4.1% speedup is worth it. I deal with list sizes in the millions. I also use upvar to prevent passing gigantic strings by value (I don't know if Tcl tries to optimize or not).
proc shuffle {in_list_ out_list_} { upvar 1 $in_list_ in_list upvar 1 $out_list_ out_list set out_list $in_list for {set idx_1 [expr {[llength $out_list] - 1}]} {$idx_1 > 0} {incr idx_1 -1} { set idx_2 [expr {int(($idx_1 + 1) * rand())}] set temp [lindex $out_list $idx_1] lset out_list $idx_1 [lindex $out_list $idx_2] lset out_list $idx_2 $temp } }
Tested using the benchmark method above.
........................................................... : 4 : 365282 microseconds per iteration : : 4a : 350104 microseconds per iteration : :....................:....................................:
Here's another algorithm. Basically you randomly select an element from the input list, remove it from the input list, and place it at the end of the output list. The reason I thought of it is, for unsort it's not necessary to maintain an output list; simply puts the elements as they come.
But this algorithm involves a lot of copying--- lset is nice for in-place modifies, but there's no in-place element removal. So it's liable to be slower than #4.