Sequential correlations and the output of rand

Purpose: Show how to improve the output of the math function rand to reduce sequential correlation.


The sequence of numbers returned by rand is prone to sequential correlations if large runs of numbers are taken. For example, if clusters of six sequential results from

    expr { rand () }

are analyzed, it will turn out that they lie on only 40 hyperplanes in six-dimensional space. This limitation means that rand() is not suitable by itself for applications like high-resolution Monte Carlo integration.

The following code implements a "better" rand that reduces this problem by shuffling the sequence that is returned. For further discussion, see Donald E. Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms, which has a wealth of information about random number generators and their foibles.


Implementation using array

 namespace eval BayesDurham {
 
     namespace export rand
 
     variable poolSize 97;               # Size of the pool of random numbers
 
     variable v;                         # Pool of random numbers
 
     variable y;                         # Most recently generated random number
 
 }
 
 #----------------------------------------------------------------------
 #
 # BayesDurham::rand --
 #
 #       Generate a random floating point number between 0 and 1
 #
 # Parameters:
 #       None.
 #
 # Results:
 #       Returns a randomly chosen floating point number in the 
 #       range [0 .. 1)
 #
 # Side effects:
 #       Stores state in several namespace variables.
 #
 # The BayesDurham::rand procedure attempts to improve upon the 
 # system-supplied rand() function by breaking up sequential
 # correlations.  It works by shuffling a vector of numbers supplied by
 # the system's random number generator, using its own output to choose
 # the next slot in the vector.
 #
 #----------------------------------------------------------------------
 
 proc BayesDurham::rand {} {
 
     variable poolSize
     variable v
     variable y
 
     # Test whether we're initializing the sequence
 
     if { ![info exists v] } {
 
         # Discard some random numbers - in case the RNG was just
         # seeded.
 
         for { set i 0 } { $i < $poolSize } { incr i } {
             expr { rand() }
         }
 
         # Save a pool of random numbers
 
         for { set i 0 } { $i < $poolSize } { incr i } {
             set v($i) [expr { rand() }]
         }
 
         # Generate one more random number
 
         set y [expr { rand() }]
         
     }
 
     # Use the most recently generated number to select from the pool.
 
     set i [expr { int( double( $poolSize ) * $y ) }]
     set y $v($i)
     set v($i) [expr { rand() }]
 
     return $y
 
 }

Implementation using list

EKB Here's an alternative implementation:

    namespace eval shufflerand {
        variable vals
        variable vallen 98
    }

    proc shufflerand::init {args} {
        variable vals
        variable vallen
        
        set haveseed false
        
        foreach {arg argval} $args {
            switch -- $arg {
                -len {
                    if [string is integer $argval] {
                        if {$argval > 0} {
                            set vallen $argval
                        } else {
                            error "[namespace current]::init: Length must be greater than zero"
                        }
                    } else {
                        error "[namespace current]::init: Length must be an integer"
                    }
                }
                -seed {
                    if [string is integer $argval] {
                        set seed $argval
                        set haveseed true
                    } else {
                        error "[namespace current]::init: Seed must be an integer"
                    }
                }
                default {
                    error "[namespace current]::init: Possible arguments are: -seed -len"
                }
            }
        }
        
        set vals {}
        
        if $haveseed {
            lappend vals [expr {srand($seed)}]
        } else {
            lappend vals [expr {rand()}]
        }
        
        for {set i 1} {$i < $vallen} {incr i} {
            lappend vals [expr {rand()}]
        }
    }

    proc shufflerand::rand {} {
        variable vals
        variable vallen
        
        set retval [lindex $vals end]
        
        set ndx [expr {int($vallen * $retval)}]
        
        lset vals end [lindex $vals $ndx]
        lset vals $ndx [expr {rand()}]
        
        return $retval
    }

    # Test
    console show
    shufflerand::init
    puts "A list of 10 random numbers from shuffled list:"
    for {set i 0} {$i < 10} {incr i} {
        puts [shufflerand::rand]
    }

    puts "\nTime to generate 1000 values using rand():"
    puts [time {for {set i 0} {$i < 1000} {incr i} {expr {rand()}}}]

    puts "\nTime to generate 1000 values using shufflerand:"
    puts [time {for {set i 0} {$i < 1000} {incr i} {shufflerand::rand}}]

Here's some sample output from the test code at the end:

 A list of 10 random numbers from shuffled list:
 0.00284572923688
 0.897247713477
 0.649551953492
 0.456922465682
 0.0448791366279
 0.0441727326457
 0.232282039818
 0.645288598093
 0.753969432672
 0.207090688034

 Time to generate 1000 values using rand():
 558 microseconds per iteration

 Time to generate 1000 values using shufflerand:
 4052 microseconds per iteration

So, it's about 7-8 times slower than the straight linear congruential generator that's bundled with Tcl.


EKB Hm...Now using Tcl 8.5, rand() takes a little longer, shufflerand a little less long. Here are some runs for rand(), shufflerand, and Bayes-Durham:

 Time to generate 1000 values using rand():
 738 microseconds per iteration

 Time to generate 1000 values using shufflerand:
 3202 microseconds per iteration

 Time to generate 1000 values using Bayes-Durham rand:
 3941 microseconds per iteration

So slightly longer with Bayes-Durham (about 20%). Possibly because of the test for whether the pool had been created or not. I decided not to include that in the "shufflerand" proc because typically the call to rand() is in the heart of a loop. I also cached the pool size.


Implementation for 8.5 using list

EKB This is a version for Tcl 8.5 that replaces the existing rand() with a shuffled version, so code that uses rand() doesn't have to be rewritten:

 namespace eval shufflerand {
    variable vals
 }

 proc shufflerand::init {} {
    variable vals
    
    for {set i 0} {$i < 98} {incr i} {
        lappend vals [expr {rand()}]
    }
    
    # lcrand -- "Linear Congruential rand"
    rename ::tcl::mathfunc::rand ::tcl::mathfunc::lcrand
    proc ::tcl::mathfunc::rand {} "[namespace current]::rand"
 }

 proc shufflerand::rand {} {
    variable vals
    
    set retval [lindex $vals end]
    
    set ndx [expr {int(98 * $retval)}]
    
    lset vals end [lindex $vals $ndx]
    lset vals $ndx [expr {lcrand()}]
    
    return $retval
 }

 # Test
 console show

 puts "\nTime to generate 1000 values using default rand():"
 puts [time {for {set i 0} {$i < 1000} {incr i} {expr {rand()}}}]

 shufflerand::init

 puts "\nTime to generate 1000 values using shuffled rand():"
 puts [time {for {set i 0} {$i < 1000} {incr i} {expr {rand()}}}]

The effect on shuffling on the period

Lars H: I claim in Random Number that shuffling such as the above doesn't do much for the period of the PRNG. The following is the reasoning that led me to that conclusion.

Modified implementation for analysis

Not all of the data present in the numbers being shuffled are part of the practical state of the shuffler. Concretely, it is possible to view the numbers merely as indices into the array of numbers and still operate the shuffler exactly as above. To illustrate this follows here a modification of the first implementation which keeps the values and indices as separate arrays.

  namespace eval BayesDurham {
     namespace export rand
     variable poolSize 97             ;# Size of the pool of random numbers
     variable value                   ;# Pool of random numbers
     variable index                   ;# Corresponding index values
     variable y                       ;# Index of last output
  }
 
  proc BayesDurham::num2index {num} {
      variable poolSize
      return [expr { int( double($poolSize) * $num ) }]
  }
  
  proc BayesDurham::initFromRand {} {
     variable poolSize
     variable value
     variable index
     variable y
     
     # Discard some random numbers - in case the RNG was just seeded.
     for {set i 0} {$i < $poolSize} {incr i} {expr rand()}
 
     # Save a pool of random numbers
     for {set i 0} {$i < $poolSize} {incr i} {
         set value($i) [expr rand()]
         set index($i) [num2index $value($i)]
     }
 
     # Generate "last" index
     set y [num2index [expr rand()]]
  }
  
  proc BayesDurham::initFromRandIfNecessary {} {
     initFromRand
     
     # Turn self into no-op.
     proc initFromRandIfNecessary {} {}
  }
  
  proc BayesDurham::rand {} {
     variable value
     variable index
     variable y
     
     initFromRandIfNecessary
     
     set i $y
     set y $index($i)
     set r $value($i)
     set value($i) [expr rand()]
     set index($i) [num2index $value($i)]
 
     return $r
 }

Here, the value array has no influence on how the numbers are shuffled; everything is controlled by the index array and the y variable. Hence the period exhibited by BayesDurham::rand is no greater than that exhibited by

  proc BayesDurham::step {} {
     variable index
     variable y
     
     initFromRandIfNecessary
     
     set i $y
     set y $index($i)
     set index($i) [num2index [expr rand()]]
 
     return $y
 }

The poolSize entries in index take values in the range 0 through $poolSize-1, and y also takes values in this range, so the shuffling extends the PRNG state space relevant for the period by a factor pow($poolSize, $poolSize+1). It doesn't extend the period by that much, however.

Probabilistic analysis

For the period to grow by the same factor as the state space when adding shuffling, it would be necessary that the effect on y and the index array of

  for {set n [periodOfRand]} {$n>0} {incr n -1} {BayesDurham::step}

(running through one period of rand, so that we return that part of the PRNG to its original state) preserves all the information stored in these variables, but that is not the case; this operation is not a bijection. Indeed, it is generally not possible to run BayesDurham::step backwards even one step — to deduce from post-state of y and index, and knowing the entire sequence of numbers generated by rand — what the pre-state of these variables were. Suppose [num2index [expr rand()]]] was j in this call. Then some index($i) was set to j, and the old value of index($i) was stored in y, but the old value of y (which, incidentally, is the i) is lost, so the process cannot be reversed. It is possible to determine this old value i of y if only one entry in index is equal to j, but as soon as there is more than one such entry, if could have been either of them; two distinct pre-states can lead to the same post-state, and since we don't know which of the two it was, we've lost 1 bit of information.

[Monowidth formulae below are in TeX notation.] Generally, if there are k entries equal to j in index then we lose \log_2 k bits of information, but how much is lost on average depends on how probable the various arrangements are. The total number of length n lists of integers from 0 to n–1 inclusive is n^n. The number of such lists where exactly k entries are equal to some given j in this range is \binom{n}{k} (n-1)^{n-k} — this corresponds to an expansion of the binomial ((n-1) + 1)^n — and we're not interested in the case k=0 since j is always going to be a number that is actually in the list. Hence there are n^n - (n-1)^n possible values for index, and assuming all of them to be of equal probability (they are after all generated by filling the array with random numbers), we find that the probability of having j appearing exactly k times in index is

  \frac{
     \binom{n}{k} (n-1)^{n-k}
  }{
     n^n - (n-1)^n
  } =
  \frac{ (n-1)^n }{ n^n - (n-1)^n } \binom{n}{k} (n-1)^{-k} =
  ( (\frac{n}{n-1})^n - 1 )^{-1} \prod_{l=1}^k \frac{ n-l+1 }{ l (n-1) }

The expected number of bits lost at each iteration of BayesDurham::step is thus

  \frac{ (n-1)^n }{ n^n - (n-1)^n } \sum_{k=1}^n \binom{n}{k} (n-1)^{-k} \log_2 k

which turns out to be slightly above 0.5 for all n>=42. A Horner-style unrolling of this sum gives the procedure

 proc E_bitloss {n} {
    set sum 0.0
    for {set k $n} {$k >= 1} {incr k -1} {
       set sum [expr {$sum * ($n-$k)/($k+1)/($n-1) + log($k)}]
    }
    expr { $sum/log(2) * $n/($n-1.0) / ( pow( 1-1.0/$n, -$n) - 1) }
 }

With a poolSize of 97, there are (slightly less than) 647 bits of information in the shuffler, so after 1294 steps one expects all of them to have been lost. Since rand has a period of 2147483646, the odds for the shuffler of keeping some information around an entire period of rand seems astronomically small…

Still, that's a purely probabilistic analysis. It should work fine at the beginning, but the more steps being run, the lower is the chance that it accurately describes what is happening. My guess would be that there are small sets of initial values for which the trajectories make it around an entire rand period without colliding, but finding even a pair of such values is not easy, and then comes the matter of what happens on the next rand period. Things have to work out very nicely to even make the shuffler exhibit a period twice that of the underlying PRNG; the typical arrangement would be a period exactly equal to that of the underlying PRNG.

In the long period limit

For a sufficiently good underlying PRNG, it can actually be proved that there is a point along the period where the shuffler is reset to a particular state regardless of how it was initialised; from this follows that the period of the shuffled PRNG is exactly equal to that of the underlying PRNG. The idea is merely to construct a sequence of new indices that will set the shuffler to a given state, since there is a positive (although very small) probability that this particular sequence of numbers will appear on random, and any sufficiently good PRNG will therefore generate it somewhere along its period.

One useful sequence of indices for resetting the shuffler is a long sequence of all zeroes. (Remember, since we're talking indices here, this doesn't necessarily mean that rand outputs 0.0, only that rand*$poolSize < 1.) If y and index(0) are both zero, then a step in which the new index is also a zero won't change the state of the shuffler. Once in such a state, it is possible to set any individual entry in index to zero (while keeping all entries already zero unchanged) by introducing its index j into a sequence of mostly zeroes, as follows:

step new index y index(0) index($j)
0 j 0 0 ?
1 0 0 j ?
2 0 j 0 ?
3 0 ? 0 0

Finally, in a state with unknown value of y and at most k nonzero entries in index, a sequence of k+1 zeroes as incoming new indices will ensure that y=0 and index(0)=0, again while keeping all entries already zero unchanged; the only way for y to stay nonzero is that it picks up a nonzero value from some entry in index, but doing so will increase the number of entries that are zero, so at the very latest after k+1 zeroes, there won't be any left to pick and y=0. (More probable is however that a zero is picked up much earlier, and in that case the number of nonzero elements isn't reduced all that much.)

Combining the above, one finds that a suitable constructed sequence (mostly zeroes) of n^2/2 + O(n) new indices will reset a shuffler with poolSize=n to an all zero state. For n=97, and ignoring the linear term (I just want to get the order of magnitude for the exponent right), this is a sequence of about 4700 random numbers, with an information content of about 31050 bits. In order for a PRNG to reproduce that particular sequence, it needs to reproduce all bit sequences of that length, so the period has to be somewhere in the vicinity of, say, 2^{31065}. The most common Mersenne Twister has a period of 2^{19937}-1, so it is not sufficient, but OTOH the 27th Mersenne prime is 2^{44497}-1, so a twister based on this would suffice with great margin.

Then again, even 2^{19937}-1 (roughly 4.3e6001) is so large that the chances of actually running any computation that many steps is very low (if protons are unstable [L1 ], the entire computer would probably have decayed into radiation long before the time at which the first period is expected to be complete). In these scales, the period is probably not very relevant anyway.