[TR] - Testing some data for normality (whether they are normally distributed or not) is often needed in [statistics]. Here, I have implemented one of these normality tests, called the Lilliefors test [http://en.wikipedia.org/wiki/Lilliefors_test]. It is based on the famous Kolmogorov-Smirnov-Test [http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test]. I know there are better tests than this, so feel free to add Shapiro-Wilks, Anderson-Darling and the like ... The procedure looks like this (and modelled after the same function in R [http://www.r-project.org]): proc LillieforsFit {values} { # # calculate the goodness of fit according to Lilliefors # (goodness of fit to a normal distribution) # # values -> list of values to be tested for normality # (these values are sampled counts) # # calculate standard deviation and mean of the sample: set n [llength $values] if {$n < 5} {error "insufficient data (n<5)"} set sd [sd $values] set mean [mean $values] # sort the sample for further processing: set values [lsort -real $values] # standardize the sample data (Z-scores): foreach x $values {lappend stdData [expr {($x - $mean)/double($sd)}]} # compute the value of the distribution function at every sampled point: foreach x $stdData {lappend expData [pnorm $x]} # compute D+: set i 0 foreach x $expData { incr i lappend dplus [expr {$i/double($n)-$x}] } set dplus [lindex [lsort -real $dplus] end] # compute D-: set i 0 foreach x $expData { incr i lappend dminus [expr {$x-($i-1)/double($n)}] } set dminus [lindex [lsort -real $dminus] end] # calculate the test statistic D # by finding the maximal vertical difference # between the samle and the expectation: set D [expr {$dplus > $dminus ? $dplus : $dminus}] # we now use the modified statistic Z, # because D is only reliable # if the p-value is smaller than 0.1 ###if {$n < 30} {puts "warning: n=$n<30"} return [expr {$D * (sqrt($n) - 0.01 + 0.831/sqrt($n))}] } This function uses a procedure called pnorm which is the cdf (=cumulative distribution function) of the standard normal distribution (see also [gaussian Distribution]). It looks like this: proc pnorm {x} { # # cumulative distribution function (cdf) # for the standard normal distribution like in the statistical software 'R' # (mean=0 and sd=1) # # x -> value for which the cdf should be calculated # set sum [expr {double($x)}] set oldSum 0.0 set i 1 set denom 1.0 while {$sum != $oldSum} { set oldSum $sum incr i 2 set denom [expr {$denom*$i}] #puts "$i - $denom" set sum [expr {$oldSum + pow($x,$i)/$denom}] } return [expr {0.5 + $sum * exp(-0.5 * $x*$x - 0.91893853320467274178)}] } This procedure is accurate until the last digit for number up to 6 or so. There is a quicker alternative if you can afford to loose some accuracy: proc pnorm_quicker {x} { # # cumulative distribution function (cdf) # for the standard normal distribution like in 'R' # (mean=0 and sd=1) # # quicker than the former but not so accurate # # x -> value for which the cdf should be calculated # set n [expr {abs($x)}] set n [expr {1.0 + $n*(0.04986735 + $n*(0.02114101 + $n*(0.00327763 \ + $n*(0.0000380036 + $n*(0.0000488906 + $n*0.000005383)))))}] set n [expr {1.0/pow($n,16)}] # if {$x >= 0} { return [expr {1 - $n/2.0}] } else { return [expr {$n/2.0}] } } '''Now, an example:''' We have 32 birthweight data (taken from Henry C. Thode: "Testing for normality") here as weights in ounces. We want to know if these are normally distributed: puts [LillieforsFit {72 112 111 107 119 92 126 80 81 84 115 118 128 128 123 116 125 126 122 \ 126 127 86 142 132 87 123 133 106 103 118 114 94}] We get a value for D which is 0.82827415657 or approx. 0.828. Now we just need to compare this value with the value 0.895. This is the D value for the 0.05 critical level. Since our value is less the critical value, we might not assume normality (or strictly speaking: under the assumption that our data are normally distributed, there is a possibility of more than 5% (about 9.5% here), that we are wrong). Other critical levels are: array set levels { D % 0.741 20 0.775 15 0.819 10 0.895 5 1.035 1 } '''Further reference:''' http://www.jstatsoft.org/v11/i04/v11i04.pdf (Evaluating the Normal distribution) ---- [Category Mathematics] - [Category Statistics]