środa, 15 grudnia 2010

Principal component analysis (PCA) with R and i-Class

The dataset ,,A Week in the Life of a Browser'' is in use again. This time to show an example of PCA. Three variables are used as input variables: one quantitative and two qualitative, namely answers for questions 8, 2 and 5. Both quantitative variables are coded by dummy variables. More information about this dataset can be found here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.

In order to perform PCA the dot product will be used. The dot product is computed in parallel fashion in NPS. The aggregate is downloaded to R client. Then mean values are subtracted from the dot product to obtain covariance matrix. The covariance matrix is then rescaled to correlation matrix. This matrix is passed to princomp() function.

Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. Then the nzDotProduct() function is used to compute dot product in parallel in NPS. Below an example usage of PCA is presented.

> nzConnect("user","password","10.1.1.74","witl") 
> nzSurvey = nz.data.frame("survey")
> tmp = nzDotProduct(~Q8+factor(Q2)+factor(Q5), nzSurvey, TRUE)
> p = ncol(tmp$mat)
> means = tmp$mat[p,-p] / tmp$mat[p,p]
> # centering the dot product matrix to covariance matrix
> centered = tmp$mat[-p,-p] - outer(means, means) * tmp$mat[p,p]
> # scaling covariance matrix to correlation matrix
> scaled = cov2cor(centered)
> # PCA with use of princomp()
> (pca = princomp(covmat = scaled))
Call:
princomp(covmat = scaled)

Standard deviations:
      Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6       Comp.7 
1.481848e+00 1.403128e+00 1.155283e+00 9.222890e-01 8.062642e-01 2.980232e-08 0.000000e+00 

 7  variables and  NA observations.
> plot(pca)
> loadings(pca)

Loadings:
     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Q8    0.323        -0.243  0.860  0.312              
Q2.1 -0.304         0.559         0.755        -0.142
Q2.2 -0.222 -0.662 -0.127                      -0.697
Q2.3  0.282  0.644                             -0.702
Q5.1 -0.392  0.146  0.454  0.459 -0.506 -0.390       
Q5.2  0.583 -0.269  0.248 -0.175        -0.700       
Q5.3 -0.427  0.220 -0.587         0.242 -0.598       

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000
> par(mar=c(5,5,2,2))
> plot(-1:1, -1:1, col="white", xlab="PC1", ylab="PC2")
> arrows(0,0,pca$loadings[,1], pca$loadings[,2])
> text(1.2* pca$loadings[,1], 1.2* pca$loadings[,2], rownames(pca$loadings))

Below the output figures are presented.
To perform PCA also the nzPCA() function from nza package may be used.