unsort

AMG: unsort is a simple UNIXy stdin->stdout pipeline tool for randomizing line order. It's simple, works quite nicely, and is surprisingly useful.

Without further ado, I present to you the source:

 #!/bin/sh
 # The next line restarts with tclsh.\
 exec tclsh "$0" ${1+"$@"}

 proc main {} {
     set lines [lrange [split [read stdin] \n] 0 end-1]
     set count [llength $lines]

     for {set idx_1 0} {$idx_1 < $count} {incr idx_1} {
         set idx_2 [expr {int($count * rand())}]
         set temp [lindex $lines $idx_1]
         lset lines $idx_1 [lindex $lines $idx_2]
         lset lines $idx_2 $temp
     }

     puts [join $lines \n]
 }

 main

 # vim: set ts=4 sts=4 sw=4 tw=80 et ft=tcl:

Even though the above really is good enough, I'm still interested in seeing performance improvements. Any ideas?

Just in case you're into ancient history, here's the slow, old, real-sorting version of main I wrote a few hours before the shiny new swappy version:

 proc main {} {
     set lines [list]
     foreach line [lrange [split [read stdin] \n] 0 end-1] {
         lappend lines [list [expr {rand()}] $line]
     }

     foreach line [lsort -real -index 0 $lines] {
         puts [lindex $line 1]
     }
 }

Lars H: This is probably nit-picking, but I'm a bit concerned by the way you permute the lines. You $count times pick one-out-of $count, so you are effectively choosing one out of $count**$count, which is not a multiple of the number of permutations of $count objects. Hence the unsorting isn't perfectly uniform.

If you do instead

     for {} {$count>1} {incr count -1} {
         set idx_1 [expr {$count-1}]
         set idx_2 [expr {int($count * rand())}]
         set temp [lindex $lines $idx_1]
         lset lines $idx_1 [lindex $lines $idx_2]
         lset lines $idx_2 $temp
     }

then (for a perfectly random rand (ha!)) the routine will pick any permutation with equal probability.

AMG: Hmm, I think you're right. The first swappy version uses the shuffle algorithm I've been using for years (but didn't think to apply to Tcl until yesterday), so I didn't give it much thought. Then when I tested it a bit, the results did seem a bit off. Random, yes, but there seemed to be a pattern, a tendency for a certain ordering. I can't explain it. Probably just my imagination...

I updated my local copy, and I'm about to go update the version in Shuffling a list.


Let's give it a try, shall we?

 [andy@blender|~/devel/misc]$ unsort <<EOF
 > alpha
 > bravo
 > charlie
 > delta
 > EOF
 charlie
 bravo
 delta
 alpha

I usually use unsort for creating a randomized playlist:

 [andy@blender|~]$ mplayer $(find ~/media/music -type f | unsort)

Works great! I'm using it right now, as a matter of fact.

Actually this breaks down for very, very large playlists since they don't fit in argv. Just pipe the result into a playlist file and feed that to mplayer (or use "-playlist /dev/stdin").

AK. Heh. The A musicbox of mine also uses a shuffle proc. It is the version which picks one element at random, removes it, and places it at the end of the result list. Works fine for the about 3000 and something files from my CDs.


WJP: Another approach is to use lsort with a random comparison function, e.g.:

  set MyRandomizedList [lsort -command RandomCompare $MyList]

where RandomCompare is a function of two variables that returns 1, 0, or -1, e.g.:

 proc RandomCompare {a b} {
    set ::RandomSeed [expr ($::RandomSeed*9301 + 49297) % 233280]
    return [expr int(3 * $::RandomSeed/double(233280)) -1]
 }

(Note that RandomSeed must be initialized, perhaps to pid or clock seconds.)

aspect: This approach will not randomise uniformly, for a similar reason to AMG's shuffle at the top of the page.


A non-Tcl approach that for some purposes is faster and more flexible is my sort utility msort [L1 ] which provides, in addition to such things as lexicographic, numeric, and numeric string comparison, random comparison.


RS 2012-09-28 Funny, I just needed the same thing today. Here's what I coded:

 proc main argv {
    while {[gets stdin line] >= 0} {
        while 1 {
            set r [expr rand()]
            if ![info exists A($r)] break
        }
        set A($r) $line
    }
    foreach i [array names A] {
        puts $A($i)
    }
 }
 main $argv

AMG: The Brace your expr-essions rule applies not only to [expr] but also to [if], [while], and [for]. The argument to your [if] will be constructed by string concatenation of "!" against a zero or one. This presents no safety problem but does cause shimmering.

Not that this matters for randomization (heck, maybe it helps), but [array names]'s return value is in no particular order. I wouldn't call it random order though, since it derives from the semi-predictable behavior of the hash function.

RS The real randomization comes from the rand() array indices. The info exists check makes sure no duplicates are drawn, which would lead to loss of data.

AMG: rand() doesn't directly set the order in which the elements come out of the array. The output order derives from the Tcl hash function applied to the string representations of the random numbers. This will still be random, but my point is that this code doesn't appear to work quite the way you say it does.


See also Shuffling a list, Shuffle a list, and Randomizing a list.