sobota, 18 grudnia 2010

Correspondence analysis with R and i-Class

One again the dataset ,,A Week in the Life of a Browser'' is in use. This time the correspondence analysis will be applied to it. Correspondence analysis is a method for finding a relation between two categorical variables, see www.jstatsoft.org/v20/a03/paper for more details.

In this example we will search for relation between user operation system and language of his web browser. Both variables are available in table ‘user’, see the dataset description for more details https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.

The nztable() procedure from nza package is used to calculate contingency table and then ca() function from ca package is used to visualize calculated contingency table. The contingency table is calculated in parallel in NPS and only the resulting small matrix is downloaded to R client.

The R code and resulting figure are presented below. As we see there is a clear structure in this contingency table. For four countries the usage of Linux is much higher than any usage of any other operating system. Other counties are discriminate by proportion of Windows NT 5 versus Windows NT 6. The negative values on the second component point countries with high fraction of newest versions of Windows and higher fraction of Mac OS. These countries are rather well situated in contrast to countries with high values in second component which are rather in the group of ,,developing countries’’ (at least some of them are).

The R code. Note that there is no need for SQL queries, neither for downloading whole dataset.

> library(nza)
> nzConnect("user","password","10.1.1.74","witl") 
> nzUsers = nz.data.frame("users")
> # contingency table for operating system and web browser location
> tmp = nzTable(OS~LOCATION, nzUsers)$mat
> # dictionaries for operating systems and locations
> # needed while the contingency table has to be reduced to avoid sparsity
> # and improve readability
> list1 = c("Mac OS", "Linux", "Windows NT 5", "Windows NT 6")
> list2 = c("de", "es", "fi", "fr", "ru", "et", "it", "zh", "en", "nl", "pl", 
+        "sv", "pt", "ca", "da", "ro", "sk", "tr", "id", "ja", "lv", "nb", 
+        "bg", "el", "is", "lt", "sl", "uk", "eo", "hu", "cs", "th", "ar", 
+        "hr", "vi", "ko", "mr", "sr")
> mat2 = matrix(0, length(lista), ncol(tmp))
> for (i in seq_along(lista)) 
+   mat2[i,] = colSums(tmp[grep(lista[i],rownames(tmp)),,drop=F])
> mat3 = matrix(0, length(lista), length(lista2))
> for (i in seq_along(lista2)) 
+   mat3[,i] = rowSums(mat2[,grep(lista2[i],colnames(tmp)),drop=F])
> rownames(mat3) = list1
> colnames(mat3) = list2
> # the reduced contingency table
> t(mat3)
   Mac OS Linux Windows NT 5 Windows NT 6
de     69    17          496         1280
es     35    52          696          727
fi      7     0           26           81
fr     73    10          312          548
ru     10     0          385          351
et      1     0            5            6
it     23     4          161          198
zh      5     3          522          318
en   1026    99         6979         7795
nl     21     1           40          126
pl      4     1          197          220
sv     13     1           27           95
pt      9     2         1648         1184
ca      2    48            3            6
da      5     0            4           31
ro      0    48           21           22
sk      2     0           11           26
tr      1     2          117          134
id      0     0           42           25
ja      6     1           27           46
lv      0     0            3            4
nb     10     0           18           61
bg      0     0            1            1
el      1     0           22           41
is      0     0            0            2
lt      0     0           14            8
sl      0     0            0            2
uk      0     0            9           16
eo      0     0            1            0
hu      1     1          143          110
cs      1     0           96          120
th      0     0            1            3
ar      0     0            1            3
hr      0    48            1            1
vi      0     0            6            1
ko      6     0           76           36
mr      0     0            0            1
sr      0     0            1            0

> library(ca)
> # plot the correspondence analysis results
> plot(ca(mat4), mass = c(T,T),  arrows = c(T,F))