Version 4 of testing for normality

Updated 2006-02-28 14:55:47 by AM

TR - Testing some data for normality (whether they are normally distributed or not) is often needed in statistics. Here, I have implemented one of these normality tests, called the Lilliefors test [L1 ]. It is based on the famous Kolmogorov-Smirnov-Test [L2 ].

I know there are better tests than this, so feel free to add Shapiro-Wilks, Anderson-Darling and the like ...

The procedure looks like this (and modelled after the same function in R [L3 ]):

 proc LillieforsFit {values} {
        #
        # calculate the goodness of fit according to Lilliefors
        # (goodness of fit to a normal distribution)
        #
        # values -> list of values to be tested for normality
        # (these values are sampled counts)
        #

        # calculate standard deviation and mean of the sample:
        set n [llength $values]
        if {$n < 5} {error "insufficient data (n<5)"}
        set sd [sd $values]
        set mean [mean $values]

        # sort the sample for further processing:
        set values [lsort -real $values]

        # standardize the sample data (Z-scores):
        foreach x $values {lappend stdData [expr {($x - $mean)/double($sd)}]}

        # compute the value of the distribution function at every sampled point:
        foreach x $stdData {lappend expData [pnorm $x]}

        # compute D+:
        set i 0
        foreach x $expData {
                incr i
                lappend dplus [expr {$i/double($n)-$x}]
        }
        set dplus [lindex [lsort -real $dplus] end]

        # compute D-:
        set i 0
        foreach x $expData {
                incr i
                lappend dminus [expr {$x-($i-1)/double($n)}]
        }
        set dminus [lindex [lsort -real $dminus] end]

        # calculate the test statistic D
        # by finding the maximal vertical difference
        # between the samle and the expectation:
        set D [expr {$dplus > $dminus ? $dplus : $dminus}]
        # we now use the modified statistic Z,
        # because D is only reliable 
        # if the p-value is smaller than 0.1
        ###if {$n < 30} {puts "warning: n=$n<30"}
        return [expr {$D * (sqrt($n) - 0.01 + 0.831/sqrt($n))}]
 }

This function uses a procedure called pnorm which is the cdf (=cumulative distribution function) of the standard normal distribution (see also gaussian Distribution). It looks like this:

 proc pnorm {x} {
        #
        # cumulative distribution function (cdf)
        # for the standard normal distribution like in the statistical software 'R'
        # (mean=0 and sd=1)
        #
        # x -> value for which the cdf should be calculated
        #
        set sum [expr {double($x)}]
        set oldSum 0.0
        set i 1
        set denom 1.0
        while {$sum != $oldSum} {
                set oldSum $sum
                incr i 2
                set denom [expr {$denom*$i}]
                #puts "$i - $denom"
                set sum [expr {$oldSum + pow($x,$i)/$denom}]
        }
        return [expr {0.5 + $sum * exp(-0.5 * $x*$x - 0.91893853320467274178)}]
 }

This procedure is accurate until the last digit for number up to 6 or so. There is a quicker alternative if you can afford to loose some accuracy:

 proc pnorm_quicker {x} {
        #
        # cumulative distribution function (cdf)
        # for the standard normal distribution like in 'R'
        # (mean=0 and sd=1)
      #
      # quicker than the former but not so accurate
        #
        # x -> value for which the cdf should be calculated
        #
        set n [expr {abs($x)}]
        set n [expr {1.0 + $n*(0.04986735 + $n*(0.02114101 + $n*(0.00327763 \
                + $n*(0.0000380036 + $n*(0.0000488906 + $n*0.000005383)))))}]
        set n [expr {1.0/pow($n,16)}]
        #
        if {$x >= 0} {
                return [expr {1 - $n/2.0}]
        } else {
                return [expr {$n/2.0}]
        }
 }

Now, an example: We have 32 birthweight data (taken from Henry C. Thode: "Testing for normality") here as weights in ounces. We want to know if these are normally distributed:

 puts [LillieforsFit {72 112 111 107 119  92 126  80  81  84 115 118 128 128 123 116 125 126 122 \
 126 127 86 142 132  87 123 133 106 103 118 114 94}]

We get a value for D which is 0.82827415657 or approx. 0.828. Now we just need to compare this value with the value 0.895. This is the D value for the 0.05 critical level. Since our value is less the critical value, we might not assume normality (or strictly speaking: under the assumption that our data are normally distributed, there is a possibility of more than 5% (about 9.5% here), that we are wrong).

Other critical levels are:

 array set levels {
 D       % 
 0.741   20
 0.775   15
 0.819   10
 0.895    5
 1.035    1
 }

Further reference: http://www.jstatsoft.org/v11/i04/v11i04.pdf (Evaluating the Normal distribution)


AM (28 february 2006) It seems to me worthwhile to incorporate this in the statistics module of Tcllib.


Category Mathematics - Category Statistics