Principal Component Analysis

Arjen Markus (29 november 2017) Principal Component Analysis (PCA) is a well-known technique to analyse multivariate data. Suppose you have observations of various quantities, like income, age, education level etc. for a set of societal classes. PCA is used to reduce the dimensionality of such a data collection, making it easier to analyse the data (it is easier to analyse two-dimensional data than ten-dimensional ones). It does so by determining combinations of the quantities such that the newly constructed quantities explain as much as possible of the original quantities. In mathematical terms: you determine eigenvectors. The importance of these eigenvectors is measured via their eigenvalues. By leaving out unimportant components you reduce the dimensionality while retaining as much information as possible.

The analysis asks for some linear algebra as well as statistics. And there is some interpretation by human beings involved: the analyst has to decide what number of components to retain and whether the subsequent approximations are good enough.

The program below follows Ed Hume's example , but uses the math::linearalgebra package from Tcllib. It is my intention to capture the procedure in a new package, so that it becomes easier to do this type of work in Tcl. (I think I will use TclOO for this, as you need to use various steps and pass various intermediate results around.)

Note: This was inspired by a question from Luc Moulinier.

Update: (5 january 2018) I have almost finished the package. The code (particularly the test suite) needs some cleaning up and a new method. I hope to round it off with the next few days.

# pca_example.tcl --
#     Follow the PCA example from NIST (as paraphrased by Ed Hume, using the
#     LA package), employing the math::linearalgebra package
#
#     Note:
#     I may add a few procs to the linearalgebra package
#
package require math::linearalgebra

# Sample data - from the NIST page on the subject
set data {
    {7 4 3}
    {4 1 8}
    {6 3 5}
    {8 6 1}
    {8 5 7}
    {7 2 9}
    {5 3 3}
    {9 5 8}
    {7 4 5}
    {8 2 2}
}

# Normalise the matrix

set z [::math::linearalgebra::normalizeStat $data]

# Perform a singular value decomposition

set usv [::math::linearalgebra::determineSVD $z]

# Print the singular values

set eigenVectors [lindex $usv 2]
set singular     [lindex $usv 1]
puts "Singular values: $singular"

# Square the vector of singular values and divide by the number of observations - 1
# to get the eigenvalues

set factor [expr {1.0/([llength $data] - 1)}]

set eigen {}
foreach c $singular {
    lappend eigen [expr {$c**2 * $factor}]
}

puts "Eigenvalues: $eigen"
puts "Eigenvectors:\n[::math::linearalgebra::show $eigenVectors]"

puts "Verify that V*Vt = I"
puts "[::math::linearalgebra::show [::math::linearalgebra::matmul $eigenVectors [::math::linearalgebra::transpose $eigenVectors]]]"

# Determine the square roots of the eigenvalues
# We need them for the factor structure matrix
# At the same time, construct the diagonal with the reciprocals
set sqeigen {}
set rsqeigen {}
foreach c $eigen {
    lappend sqeigen  [expr {sqrt($c)}]
    lappend rsqeigen [expr {1.0/sqrt($c)}]
}

puts "Square root: $sqeigen"
set sqdiagonal [::math::linearalgebra::mkDiagonal $sqeigen]

# Now the factor structure matrix
set s [::math::linearalgebra::matmul $eigenVectors $sqdiagonal]

puts "Factor structure:\n[::math::linearalgebra::show $s]"

# Retain the first two factors only and print the communality
set s2 {}
foreach row $s {
    lappend s2 [lrange $row 0 1]
}

set s2s2 [::math::linearalgebra::matmul $s2 [::math::linearalgebra::transpose $s2]]
puts "SSt - first two factors\n[::math::linearalgebra::show $s2s2]"

set communality {}
for {set i 0} {$i < [llength $s2s2]} {incr i} {
    lappend communality [lindex $s2s2 $i $i]
}
puts "Communality: $communality"

# Now construct the factor score coefficients matrix (b)
set b [::math::linearalgebra::matmul $eigenVectors [::math::linearalgebra::mkDiagonal $rsqeigen]]

puts "Factor score coefficients:\n[::math::linearalgebra::show $b]"

# Examine the first observation - factor scores via the eigenvectors
set z1 [lindex $z 0]

set t [::math::linearalgebra::matmul $z1 $eigenVectors]
puts "Scores: $t"

set t2 0.0
foreach c $t e $eigen {
    set t2 [expr {$t2 + $c**2 / $e}]
}
puts "Hotelling: $t2"

# Fit the first observation, z1, with two principal components
set v2 {}
foreach row $eigenVectors {
    lappend v2 [lrange $row 0 1]
}
set t    [::math::linearalgebra::matmul $z1 $v2]
set zhat [::math::linearalgebra::matmul $t [::math::linearalgebra::transpose $v2]]

puts "z1:\n[::math::linearalgebra::show $z1]"
puts "zhat:\n[::math::linearalgebra::show $zhat]"
set difference  [::math::linearalgebra::show [::math::linearalgebra::sub $z1 $zhat]]
puts "difference:\n[::math::linearalgebra::show $difference]"
set qstat [::math::linearalgebra::dotproduct $difference $difference]
puts "Q statistic: $qstat"
                                                                   # Number of components
set qcorr [expr {$qstat * [llength $data] / double([llength $data] - 2 - 1)}]
puts "Q corrected: $qcorr"