EKB I have just been introduced to substantively weighted analytical techniques (SWAT) and in particular substantively weighted least squares (SWLS) through the book What Works: A New Approach to Program and Policy Analysis [L1 ]. It's also discussed in an online paper "Best Practices Research: A Methodological Guide for the Perplexed" [L2 ]. It's a very interesting technique for policy analysis, and I created a tcl implementation for it. (There's also an R language implementation available from the author at [L3 ].)
The motivation for SWLS is that in many studies that include a linear regression, the residuals are more interesting than the fitted line. For example, when fitting a variable that gives the performance of a government agency against explanatory variables such as staffing levels, financial resources, and so on, the usual regression parameters tell you about the performance of the average agency for different values of those variables. But what you're usually interested in is the performance of agencies that are above the average, given the levels of resources, so that you can try to identify what they are doing that make them better. SWLS (and, more generally, SWAT) is a systematic way to look for patterns among above or below average data within a data set.
The code is below, and a sample data file, used in the What Works book, is at the end of this page.
Note: The results produced by this code agree with the results in What Works except for a few percent difference in the bounds on the 95% confidence interval. I can't track down the source of the discrepancy... EKB OK, I tracked down the discrepancy. They were using the asymptotic value of the inverse t-distribution as the number of deg of freedom gets very large (1.96), rather than the value for the 42 deg of freedom of the sample data file. The script below finds the appropriate t value for the degrees of freedom in the problem.
Another note: After the original post, I improved on the filltext proc code to be a little more nicely laid out and more informative.
package require Tcl 8.4 package require math::linearalgebra 1.0 package require math::statistics 0.1.1 namespace eval swat { variable epsilon 1.0e-7 variable swls_threshold 0.7 variable swls_beta 10 namespace import \ ::math::linearalgebra::mkMatrix \ ::math::linearalgebra::mkVector \ ::math::linearalgebra::mkIdentity \ ::math::linearalgebra::mkDiagonal \ ::math::linearalgebra::getrow \ ::math::linearalgebra::setrow \ ::math::linearalgebra::getcol \ ::math::linearalgebra::setcol \ ::math::linearalgebra::getelem \ ::math::linearalgebra::setelem \ ::math::linearalgebra::dotproduct \ ::math::linearalgebra::matmul \ ::math::linearalgebra::add \ ::math::linearalgebra::sub \ ::math::linearalgebra::solveGauss # NOTE: The authors of math::linearalgebra forgot to # export ::math::linearalgebra::transpose ######################################## # # t-stats # ######################################## # swat::tstat n # Returns inverse of the single-tailed t distribution # Defaults to alpha = 0.05 # Iterates until result is within epsilon of the target proc tstat {n {alpha 0.05}} { variable epsilon variable tvals if [info exists tvals($n:$alpha)] { return $tvals($n:$alpha) } set deltat [expr {100 * $epsilon}] # For one-tailed distribution, set ptarg [expr {1.000 - $alpha/2.0}] set maxiter 100 # Initial value for t set t 2.0 set niter 0 while {abs([::math::statistics::cdf-students-t $n $t] - $ptarg) > $epsilon} { set pstar [::math::statistics::cdf-students-t $n $t] set pl [::math::statistics::cdf-students-t $n [expr {$t - $deltat}]] set ph [::math::statistics::cdf-students-t $n [expr {$t + $deltat}]] set t [expr {$t + 2.0 * $deltat * ($ptarg - $pstar)/($ph - $pl)}] incr niter if {$niter == $maxiter} { error "swat::tstat: Did not converge after $niter iterations" return {} } } # Cache the result to shorten the call in future set tvals($n:$alpha) $t return $t } ######################################## # # Weighted Least Squares # ######################################## # swat::wls w [list y x's] w [list y x's] ... # Returns: # R-squared # coefficients of x's in fit # jacknifed residuals # standard errors of the coefficients # 95% confidence bounds for coefficients proc wls {args} { # Fill the matrices of x & y values, and weights # For n points, k coefficients # The number of points is equal to half the arguments (n weights, n points) set n [expr {[llength $args]/2}] set firstloop true # Sum up all y values to take an average set ysum 0 # Add up the weights set wtsum 0 # Count over rows (points) as you go set point 0 foreach {wt pt} $args { # Check inputs if {[string is double $wt] == 0} { error "swat::wls: Weight \"$wt\" is not a number" return {} } ## -- Check dimensions, initialize if $firstloop { # k = num of vals in pt = 1 + number of x's (because of constant) set k [llength $pt] if {$n <= [expr {$k + 1}]} { error "swat::wls: Too few degrees of freedom: $k variables but only $n points" return {} } set X [mkMatrix $n $k] set y [mkVector $n] set I_x [mkIdentity $k] set I_y [mkIdentity $n] set firstloop false } else { # Have to have same number of x's for all points if {$k != [llength $pt]} { error "swat::wls: Point \"$pt\" has wrong number of values (expected $k)" # Clean up return {} } } ## -- Extract values from set of points # Make a list of y values set yval [expr {double([lindex $pt 0])}] setelem y $point $yval set ysum [expr {$ysum + $wt * $yval}] set wtsum [expr {$wtsum + $wt}] # Add x-values to the x-matrix set xrow [lrange $pt 1 end] # Add the constant (value = 1.0) lappend xrow 1.0 setrow X $point $xrow # Create list of weights & square root of weights lappend w [expr {double($wt)}] lappend sqrtw [expr {sqrt(double($wt))}] incr point } set ymean [expr {double($ysum)/$wtsum}] set W [mkDiagonal $w] set sqrtW [mkDiagonal $sqrtw] # Calculate sum os square differences for x's for {set r 0} {$r < $k} {incr r} { set xsqrsum 0.0 set xmeansum 0.0 # Calculate sum of squared differences as: sum(x^2) - (sum x)^2/n for {set t 0} {$t < $n} {incr t} { set xval [getelem $X $t $r] set xmeansum [expr {$xmeansum + double($xval)}] set xsqrsum [expr {$xsqrsum + double($xval * $xval)}] } lappend xsqr [expr {$xsqrsum - $xmeansum * $xmeansum/$n}] } ## -- Set up the X'WX matrix set XtW [matmul [::math::linearalgebra::transpose $X] $W] set XtWX [matmul $XtW $X] # Invert set M [solveGauss $XtWX $I_x] set beta [matmul $M [matmul $XtW $y]] ### -- Residuals & jacknife statistics # 1) Generate list of diagonals of the hat matrix set H [matmul $X [matmul $M $XtW]] for {set i 0} {$i < $n} {incr i} { lappend h_ii [getelem $H $i $i] } set R [matmul $sqrtW [matmul [sub $I_y $H] $y]] set yhat [matmul $H $y] # 2) Generate list of residuals, sum of squared residuals, r-squared set sstot 0.0 set ssreg 0.0 # Note: Relying on representation of Vector as a list for y, yhat foreach yval $y wt $w yhatval $yhat { set sstot [expr {$sstot + $wt * ($yval - $ymean) * ($yval - $ymean)}] set ssreg [expr {$ssreg + $wt * ($yhatval - $ymean) * ($yhatval - $ymean)}] } set r2 [expr {double($ssreg)/$sstot}] set adjr2 [expr {1.0 - (1.0 - $r2) * ($n - 1)/($n - $k)}] set sumsqresid [dotproduct $R $R] # 3) Calculate the overall variance, the variance per point & jacknifed stderr set s2 [expr {double($sumsqresid) / double($n - $k)}] for {set r 0} {$r < $n} {incr r} { set wt [lindex $w $r] set hprime [expr {sqrt($wt) * (1.0 - [lindex $h_ii $r])}] set resid [getelem $R $r] set newjns2i [expr {(($n - $k) * $s2 - double($resid * $resid)/$hprime)/($n - $k - 1)}] lappend jns2i $newjns2i lappend jacknife [expr {double($resid)/sqrt($newjns2i * $hprime)}] } ### -- Confidence intervals for coefficients set tvalue [tstat [expr {$n - $k}]] for {set i 0} {$i < $k} {incr i} { set stderr [expr {sqrt($s2 * [getelem $M $i $i])}] set mid [lindex $beta $i] lappend stderrs $stderr lappend confinterval [list [expr {$mid - $tvalue * $stderr}] [expr {$mid + $tvalue * $stderr}]] } return [list $adjr2 $beta $jacknife $stderrs $confinterval] } ######################################## # # Substantively Weighted Least Squares # ######################################## # swat::swls [list y x's] [list y x's] ... # Returns: # - # points with wt 1 # - R-squared # - coefficients of x's for entire sequence of fits # - 95% confidence intervals for coefficients # - weighted average of x values proc swls {args} { variable swls_threshold variable swls_beta set numpoints [llength $args] # Initialize weights to 1.0 for all points for {set i 0} {$i < $numpoints} {incr i} { set wt($i) 1.0 } # Start of big loop over weightings for {set i 0} {$i < $swls_beta} {incr i} { # Prepare the arg to send to wls set wtpts {} set nhigh 0 set sumwts 0 for {set j 0} {$j < $numpoints} {incr j} { set pts [lindex $args $j] if {![info exists numxs]} { set numxs [expr {[llength $pts] - 1}] } lappend wtpts $wt($j) $pts set sumwts [expr {$sumwts + $wt($j)}] for {set n 1} {$n <= $numxs} {incr n} { if {$j > 0} { set sumx($n) [expr {$sumx($n) + $wt($j) * [lindex $pts $n]}] } else { set sumx($n) [lindex $pts $n] } } if {$wt($j) == 1.0} { incr nhigh } } set xave {} for {set n 1} {$n <= $numxs} {incr n} { lappend xave [expr {double($sumx($n))/$sumwts}] } set retval [eval wls $wtpts] # R-squared is first, then coeffs, then jacknifed residuals set r2 [lindex $retval 0] lappend coeffs [list $nhigh $r2 [lindex $retval 1] [lindex $retval 3] [lindex $retval 4] $xave] set jacknife [lindex $retval 2] # Figure out the next set of weights for {set j 0} {$j < $numpoints} {incr j} { if {$i == 0 && [lindex $jacknife $j] < $swls_threshold} { set wt($j) [expr {1.0 - 1.0/double($swls_beta)}] } if {$i > 0 && $wt($j) < 1.0} { set wt($j) [expr {$wt($j) - 1.0/double($swls_beta)}] } } } return $coeffs } } ########################## # # GUI # ########################## set fileinfo(dir) [file dirname $argv0] set fileinfo(name) "" proc fileopen {} { global fileinfo set fname [tk_getOpenFile -initialdir $fileinfo(dir) -initialfile $fileinfo(name) \ -defaultextension ".dat" -filetypes { {{Data files} {.dat}} {{Text Files} {.txt} } {{All Files} *} }] if {$fname == ""} {return} if [catch {open $fname r} fhndl] { tk_messageBox -icon error -message "Error opening file [file tail $fname]: $fhndl" return } set names {} set points {} set firstline true while {[gets $fhndl line] != -1} { # Ignore blank lines & comments if {[string trim $line] == ""} {continue} if {[string index [string trimleft $line] 0] == "#"} {continue} if $firstline { set names [split $line "\t"] set firstline false continue } lappend points [split $line "\t"] } close $fhndl set fileinfo(dir) [file dirname $fname] set fileinfo(name) [file tail $fname] wm title . "$fileinfo(name) - SWLS" filltext $names $points } proc filltext {names points} { .t delete 1.0 end # Go ahead and update, since there can be a delay in loading update set coeffs [eval swat::swls $points] set nstart [lindex [lindex $coeffs 0] 0] set nend [lindex [lindex $coeffs end] 0] .t insert end "Explaining: [lindex $names 0]\n\n" .t insert end "Number of data points: $nstart\n" .t insert end "Number in set of interest: $nend\n\n" .t insert end [format "%-15s\t% 10s\t% 11s\t\%25s\t% 10s\n\n" Variable Coeff "Std Error" "-- 95% Conf Interval --" "Ave Value"] set i 1 foreach val $coeffs { set r2 [lindex $val 1] set xcoeffs [lindex $val 2] set stderrs [lindex $val 3] set cis [lindex $val 4] set xaves [lindex $val 5] .t insert end "== Run $i ==\n" incr i set j 1 foreach xave $xaves xc $xcoeffs ci $cis se $stderrs { if {$j == [llength $xcoeffs]} { set lbl "Constant" set avestring "" } else { set lbl "[lindex $names $j]" set avestring [format "% 10.3f" $xave] } .t insert end [format "%-15s\t% 10.3f\t\u00B1% 10.3f\t\[% 10.3f : % 10.3f\]\t%s\n" \ $lbl $xc $se [lindex $ci 0] [lindex $ci 1] $avestring] incr j } .t insert end [format "Adj. R-squared = %.3f\n\n" $r2] } } wm title . "SWLS" . configure -menu .mainmenu menu .mainmenu .mainmenu add cascade -label "File" -menu .mainmenu.f menu .mainmenu.f -tearoff false .mainmenu.f add command -label "Open" -command fileopen -underline 0 scrollbar .sby -command {.t yview} -orient vertical scrollbar .sbx -command {.t xview} -orient horizontal text .t -width 70 -height 20 -wrap none -yscrollcommand {.sby set} \ -xscrollcommand {.sbx set} -font "Courier 9" grid .t .sby -sticky nsew grid .sbx -sticky nsew grid columnconfigure . 0 -weight 1 grid rowconfigure . 0 -weight 1 focus -force .
You can try it out using the following data file. This is a file of data on child welfare agencies used by the original SWAT creators for the book What Works, formatted for this tcl app.
Organizational Learning ACES Instability Ambiguity Staff Change Divorce Rate Slack Expenditure 1141.4 1.467351 70.43141 23.57 115.5502 26.2 3.470122 3440.159 2667 1.754386 6.407348 32.1 39.5922 27.8 4.105816 8554.186 307.3 0.4215852 63.41155 35.14 106.0642 29.4 5.610657 2544.263 840.7 0 16.39695 23.46 50.57774 30.8 3.848957 2366.732 482.5 1.18499 78.90607 31.95 -1.144 20.3 3.700461 4954.157 550 2.072846 28.10363 17.81 25.76482 22.8 2.74241 2976.427 514.1 0.607718 22.71782 22.23 18.3592 14.9 5.063129 4905.78 1352.5 1.470588 6.602017 65.03 -232.2549 19.6 9.178106 5970.52 1136.5 0.8285004 123.2406 21.44 60.60778 30.8 2.495392 2567.212 1575.4 1.660879 87.06177 31.26 86.82364 22.2 2.100618 2589.531 1170.6 0 13.04299 37.68 31.7781 19.1 4.491279 4141.247 1679.6 0.9624639 7.197373 21.72 79.94753 27 3.074239 3216.067 882.3 1.212856 181.2476 16.97 45.62724 17.4 1.902308 2469.215 1347.5 1.782531 78.82232 11.84 33.76736 26.6 1.762038 1742.745 1353.4 1.431127 11.54932 31.1 57.34841 16.6 3.434238 2659.697 1363.4 2.40481 13.08474 18.11 72.44474 22.5 2.563138 3296.465 1087.9 0.538648 26.32775 23.06 166.6439 21.9 2.649425 3330.793 712.8 0.4703669 27.16063 25.46 35.4368 15.6 4.030189 3416.769 1865.4 2.42915 9.96554 18.14 90.42884 20.9 3.209765 4109.007 1088.4 1.234568 38.82103 40.84 24.8976 14.2 3.488844 5275.824 1308.1 0.6671114 76.34886 26.11 82.58347 12.7 3.641929 4926.775 3485.4 1.17421 197.8029 25.89 80.5744 17.7 1.753225 5419.784 1946.4 1.128159 35.4882 26.85 70.3593 14.9 4.76418 5208.218 1116.8 0.3858025 76.16611 16.39 225.721 21.8 4.611794 2539.369 2047.5 0.5804953 54.3063 21.42 121.5814 22 2.799651 2754.369 893.3 2.475247 7.206166 12.36 103.7574 22.9 3.882953 2083.009 1779.2 0 26.99598 45.8 78.2787 17.6 3.511719 3982.897 703.4 1.557632 3.533641 49.19 -32.1768 54.9 6.609568 4082.064 695.1 2.714932 6.912465 43.87 32.34244 19.7 6.407141 2998.161 1287 0.257732 62.02957 36.51 17.6954 15 4.470681 6310.163 573 0 13.15338 28.29 48.06679 26 2.264973 2759.142 955.6 0.3322627 151.1162 32.31 50.3237 15 3.284161 5634.305 1296.1 1.187472 70.68908 23.75 31.70579 20.6 3.450963 2992.531 1194.5 1.574803 5.132252 15.61 54.0354 15.4 2.777402 2679.454 4299.2 3.565225 169.9304 22.62 136.621 20.3 2.841627 3072.191 896.4 1.259843 66.21898 23.09 35.20083 32.9 3.420866 2381.565 815.4 4.106776 53.44192 37.42 -176.8579 23.8 6.527769 4303.525 2262.9 0.7524454 121.2754 51.96 18.3963 14.8 2.932871 4130.657 1108.8 0.996016 10.88008 38.61 22.00537 16.1 2.421065 3339.35 1314.2 1.404494 27.21783 25.78 36.989 17.4 1.432419 2533.139 1268.9 4.207574 2.172794 17.05 23.44076 16.9 2.988671 2366.595 952.8 0.8075914 94.00198 41.9 33.35452 26.6 1.970768 2031.763 763.2 0.4034815 121.7529 51.39 53.86072 24.4 2.710884 1620.14 1228.6 0 11.48006 21.33 73.45351 21.8 4.641821 5069.072 1023.5 3.527337 3.049591 20.86 84.84604 18.9 4.317273 2983.782 1646.3 1.272669 58.24525 19.83 105.2593 17.7 2.980112 3097.782 2483.3 1.195695 80.93693 30.97 160.9784 24.7 6.063038 5996.734 1045.3 0.5552471 15.81082 15.11 78.10522 22.3 3.786695 2222.728 3866.9 1.210898 70.8133 24.03 60.605 15.2 3.335582 4972.425 1450.2 0 7.217247 10.33 163.1716 29 2.623226 1860.076
After running this data file through the script, it should give the following output:
Explaining: Organizational Learning Number of data points: 50 Number in set of interest: 11 Variable Coeff Std Error -- 95% Conf Interval -- Ave Value == Run 1 == ACES 284.795 ± 99.831 [ 83.327 : 486.263] 1.265 Instability 3.815 ± 2.057 [ -0.336 : 7.967] 52.237 Ambiguity 9.004 ± 11.259 [ -13.718 : 31.727] 28.111 Staff Change 4.617 ± 1.758 [ 1.069 : 8.165] 55.651 Divorce Rate -1.682 ± 14.412 [ -30.766 : 27.402] 21.712 Slack -125.335 ± 88.330 [ -303.593 : 52.923] 3.643 Expenditure 0.294 ± 0.078 [ 0.136 : 0.452] 3617.571 Constant -263.506 ± 591.866 [ -1457.939 : 930.927] Adj. R-squared = 0.363 ... Runs 2-9 omitted ... == Run 10 == ACES 333.752 ± 117.107 [ 97.421 : 570.083] 1.133 Instability 6.311 ± 2.421 [ 1.426 : 11.197] 63.724 Ambiguity -0.572 ± 11.657 [ -24.097 : 22.954] 30.631 Staff Change 5.445 ± 2.248 [ 0.908 : 9.981] 64.046 Divorce Rate -16.887 ± 21.001 [ -59.269 : 25.496] 23.142 Slack 5.579 ± 117.953 [ -232.460 : 243.618] 3.829 Expenditure 0.342 ± 0.107 [ 0.126 : 0.558] 3835.946 Constant 4.187 ± 877.228 [ -1766.131 : 1774.505] Adj. R-squared = 0.630