Version 10 of unsort

Updated 2008-05-13 18:14:20 by WJP

AMG: unsort is a simple UNIXy stdin->stdout pipeline tool for randomizing line order. It's simple, works quite nicely, and is surprisingly useful.

Without further ado, I present to you the source:

 #!/bin/sh
 # \
 exec tclsh "$0" "$@"

 proc main {} {
     set lines [lrange [split [read stdin] \n] 0 end-1]
     set count [llength $lines]

     for {set idx_1 0} {$idx_1 < $count} {incr idx_1} {
         set idx_2 [expr {int($count * rand())}]
         set temp [lindex $lines $idx_1]
         lset lines $idx_1 [lindex $lines $idx_2]
         lset lines $idx_2 $temp
     }

     puts [join $lines \n]
 }

 main

 # vim: set ts=4 sts=4 sw=4 tw=80 et ft=tcl:

Even though the above really is good enough, I'm still interested in seeing performance improvements. Any ideas?

Just in case you're into ancient history, here's the slow, old, real-sorting version of main I wrote a few hours before the shiny new swappy version:

 proc main {} {
     set lines [list]
     foreach line [lrange [split [read stdin] \n] 0 end-1] {
          lappend lines [list [expr {rand()}] $line]
     }

     foreach line [lsort -real -index 0 $lines] {
         puts [lindex $line 1]
     }
 }

Lars H: This is probably nit-picking, but I'm a bit concerned by the way you permute the lines. You $count times pick one-out-of $count, so you are effectively choosing one out of $count**$count, which is not a multiple of the number of permutations of $count objects. Hence the unsorting isn't perfectly uniform.

If you do instead

     for {} {$count>1} {incr count -1} {
         set idx_1 [expr {$count-1}]
         set idx_2 [expr {int($count * rand())}]
         set temp [lindex $lines $idx_1]
         lset lines $idx_1 [lindex $lines $idx_2]
         lset lines $idx_2 $temp
     }

then (for a perfectly random rand (ha!)) the routine will pick any permutation with equal probability.

AMG: Hmm, I think you're right. The first swappy version uses the shuffle algorithm I've been using for years (but didn't think to apply to Tcl until yesterday), so I didn't give it much thought. Then when I tested it a bit, the results did seem a bit off. Random, yes, but there seemed to be a pattern, a tendency for a certain ordering. I can't explain it. Probably just my imagination...

I updated my local copy, and I'm about to go update the version in Shuffling a list.


Let's give it a try, shall we?

 [andy@blender|~/devel/misc]$ unsort <<EOF
 > alpha
 > bravo
 > charlie
 > delta
 > EOF
 charlie
 bravo
 delta
 alpha

I usually use unsort for creating a randomized playlist:

 [andy@blender|~]$ mplayer $(find ~/media/music -type f | unsort)

Works great! I'm using it right now, as a matter of fact.

Actually this breaks down for very, very large playlists since they don't fit in argv. Just pipe the result into a playlist file and feed that to mplayer (or use "-playlist /dev/stdin").

AK. Heh. The A musicbox of mine also uses a shuffle proc. It is the version which picks one element at random, removes it, and places it at the end of the result list. Works fine for the about 3000 and something files from my CDs. --- WJP: Another approach is to use lsort with a random comparison function, e.g.:

  set MyRandomizedList [lsort -command RandomCompare $MyList]

where RandomCompare is a function of two variables that returns 1, 0, or -1, e.g.:

proc RandomCompare {a b} {

    set ::RandomSeed [expr ($::RandomSeed*9301 + 49297) % 233280]
    return [expr int(3 * $::RandomSeed/double(233280)) -1]

}

(Note that RandomSeed must be initialized, perhaps to pid or clock seconds.)

A non-Tcl approach that for some purposes is faster and more flexible is my sort utility msort (http://billposer.org/Software/msort.html ) which provides, in addition to such things as lexicographic, numeric, and numeric string comparison, random comparison.


See also Shuffling a list


Under which category or categories should this page be filed?

Ideas: [ [Category Sorting] | [Category UNIX] | [Category Command Line Tools] | [Category Entropy Source] :^) ]

LV Hmm - a sorting category might be useful to have. Perhaps a filter category as well.