One again the dataset ,,A Week in the Life of a Browser'' is in use. This time the correspondence analysis will be applied to it. Correspondence analysis is a method for finding a relation between two categorical variables, see
www.jstatsoft.org/v20/a03/paper for more details.
In this example we will search for relation between user operation system and language of his web browser. Both variables are available in table ‘user’, see the dataset description for more details
https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.
The nztable() procedure from nza package is used to calculate contingency table and then ca() function from ca package is used to visualize calculated contingency table. The contingency table is calculated in parallel in NPS and only the resulting small matrix is downloaded to R client.
The R code and resulting figure are presented below. As we see there is a clear structure in this contingency table. For four countries the usage of Linux is much higher than any usage of any other operating system. Other counties are discriminate by proportion of Windows NT 5 versus Windows NT 6. The negative values on the second component point countries with high fraction of newest versions of Windows and higher fraction of Mac OS. These countries are rather well situated in contrast to countries with high values in second component which are rather in the group of ,,developing countries’’ (at least some of them are).
The R code. Note that there is no need for SQL queries, neither for downloading whole dataset.
> library(nza)
> nzConnect("user","password","10.1.1.74","witl")
> nzUsers = nz.data.frame("users")
> # contingency table for operating system and web browser location
> tmp = nzTable(OS~LOCATION, nzUsers)$mat
> # dictionaries for operating systems and locations
> # needed while the contingency table has to be reduced to avoid sparsity
> # and improve readability
> list1 = c("Mac OS", "Linux", "Windows NT 5", "Windows NT 6")
> list2 = c("de", "es", "fi", "fr", "ru", "et", "it", "zh", "en", "nl", "pl",
+ "sv", "pt", "ca", "da", "ro", "sk", "tr", "id", "ja", "lv", "nb",
+ "bg", "el", "is", "lt", "sl", "uk", "eo", "hu", "cs", "th", "ar",
+ "hr", "vi", "ko", "mr", "sr")
> mat2 = matrix(0, length(lista), ncol(tmp))
> for (i in seq_along(lista))
+ mat2[i,] = colSums(tmp[grep(lista[i],rownames(tmp)),,drop=F])
> mat3 = matrix(0, length(lista), length(lista2))
> for (i in seq_along(lista2))
+ mat3[,i] = rowSums(mat2[,grep(lista2[i],colnames(tmp)),drop=F])
> rownames(mat3) = list1
> colnames(mat3) = list2
> # the reduced contingency table
> t(mat3)
Mac OS Linux Windows NT 5 Windows NT 6
de 69 17 496 1280
es 35 52 696 727
fi 7 0 26 81
fr 73 10 312 548
ru 10 0 385 351
et 1 0 5 6
it 23 4 161 198
zh 5 3 522 318
en 1026 99 6979 7795
nl 21 1 40 126
pl 4 1 197 220
sv 13 1 27 95
pt 9 2 1648 1184
ca 2 48 3 6
da 5 0 4 31
ro 0 48 21 22
sk 2 0 11 26
tr 1 2 117 134
id 0 0 42 25
ja 6 1 27 46
lv 0 0 3 4
nb 10 0 18 61
bg 0 0 1 1
el 1 0 22 41
is 0 0 0 2
lt 0 0 14 8
sl 0 0 0 2
uk 0 0 9 16
eo 0 0 1 0
hu 1 1 143 110
cs 1 0 96 120
th 0 0 1 3
ar 0 0 1 3
hr 0 48 1 1
vi 0 0 6 1
ko 6 0 76 36
mr 0 0 0 1
sr 0 0 1 0
> library(ca)> # plot the correspondence analysis results
> plot(ca(mat4), mass = c(T,T), arrows = c(T,F))