R and large real world data examples

poniedziałek, 20 grudnia 2010

Firefox 4 Beta Interface and i-Class

This time the dataset ,, Firefox 4 Beta Interface - Version 2'' is used. The purpose of the study is to check which UI elements are used most often by which users. The variables of interest are all in database 'interface', see the dataset description for more details https://testpilot.mozillalabs.com/testcases/beta/aggregated-data.html.

In the example below the nzJoin() function is just to create a view with join over two tables (surveys and events). Some examples for function nzTable() are presented in both textual and graphical form.

> library(nza)
> library(ca)
> nzConnect("user ","password","10.1.1.74","interface")
> # pointers to three tables in databse
> nzUsers  = nz.data.frame("users")
> nzSurvey = nz.data.frame("surveys")
> nzEvents = nz.data.frame("events")
> # create a join of two tables
> ussu <- nzJoin(nzSurvey, nzEvents, nzSurvey[,1] == nzEvents[,1])
> # build a contingency table for joined tables
> # the relation between skills and used elements
> tmp2 = nzTable(q8~item, ussu,T)$mat
> catmp2 = ca(tmp2)
> catmp2$rownames[catmp2$rowdist<0.3] = ""
> # visualization of the relation between skills and used elements 
> plot(catmp2, labels=c(2,2), arrow=c(F,T), mass=c(T,T))

> # absolute number of used UI elements
> tab1 = nzTable(category~item, nzEvents)
> ttt = as.data.frame(tab1$tab)
> # which group of UI elements are touched most often
> dotchart(ttt[,3], col=rainbow(14)[as.numeric(ttt[,1])], cex=1.1, pch=19,

+    main="number of interactions with given UI element")
> legend("topleft",levels(ttt[,1]), col=rainbow(14)[seq_along(unique(ttt[,1]))],

+    bg="white", pch=19, title="category")

In the first figure we see, that more advanced users often use following US elements: new window/new tab button, also start private browsing, inspect, page setup and web console features. These elements are hardly used by users with lower skills.

sobota, 18 grudnia 2010

Correspondence analysis with R and i-Class

One again the dataset ,,A Week in the Life of a Browser'' is in use. This time the correspondence analysis will be applied to it. Correspondence analysis is a method for finding a relation between two categorical variables, see www.jstatsoft.org/v20/a03/paper for more details.

In this example we will search for relation between user operation system and language of his web browser. Both variables are available in table ‘user’, see the dataset description for more details https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.

The nztable() procedure from nza package is used to calculate contingency table and then ca() function from ca package is used to visualize calculated contingency table. The contingency table is calculated in parallel in NPS and only the resulting small matrix is downloaded to R client.

The R code and resulting figure are presented below. As we see there is a clear structure in this contingency table. For four countries the usage of Linux is much higher than any usage of any other operating system. Other counties are discriminate by proportion of Windows NT 5 versus Windows NT 6. The negative values on the second component point countries with high fraction of newest versions of Windows and higher fraction of Mac OS. These countries are rather well situated in contrast to countries with high values in second component which are rather in the group of ,,developing countries’’ (at least some of them are).

The R code. Note that there is no need for SQL queries, neither for downloading whole dataset.

> library(nza)
> nzConnect("user","password","10.1.1.74","witl") 
> nzUsers = nz.data.frame("users")
> # contingency table for operating system and web browser location
> tmp = nzTable(OS~LOCATION, nzUsers)$mat
> # dictionaries for operating systems and locations
> # needed while the contingency table has to be reduced to avoid sparsity
> # and improve readability
> list1 = c("Mac OS", "Linux", "Windows NT 5", "Windows NT 6")
> list2 = c("de", "es", "fi", "fr", "ru", "et", "it", "zh", "en", "nl", "pl", 
+        "sv", "pt", "ca", "da", "ro", "sk", "tr", "id", "ja", "lv", "nb", 
+        "bg", "el", "is", "lt", "sl", "uk", "eo", "hu", "cs", "th", "ar", 
+        "hr", "vi", "ko", "mr", "sr")
> mat2 = matrix(0, length(lista), ncol(tmp))
> for (i in seq_along(lista)) 
+   mat2[i,] = colSums(tmp[grep(lista[i],rownames(tmp)),,drop=F])
> mat3 = matrix(0, length(lista), length(lista2))
> for (i in seq_along(lista2)) 
+   mat3[,i] = rowSums(mat2[,grep(lista2[i],colnames(tmp)),drop=F])
> rownames(mat3) = list1
> colnames(mat3) = list2
> # the reduced contingency table
> t(mat3)
   Mac OS Linux Windows NT 5 Windows NT 6
de     69    17          496         1280
es     35    52          696          727
fi      7     0           26           81
fr     73    10          312          548
ru     10     0          385          351
et      1     0            5            6
it     23     4          161          198
zh      5     3          522          318
en   1026    99         6979         7795
nl     21     1           40          126
pl      4     1          197          220
sv     13     1           27           95
pt      9     2         1648         1184
ca      2    48            3            6
da      5     0            4           31
ro      0    48           21           22
sk      2     0           11           26
tr      1     2          117          134
id      0     0           42           25
ja      6     1           27           46
lv      0     0            3            4
nb     10     0           18           61
bg      0     0            1            1
el      1     0           22           41
is      0     0            0            2
lt      0     0           14            8
sl      0     0            0            2
uk      0     0            9           16
eo      0     0            1            0
hu      1     1          143          110
cs      1     0           96          120
th      0     0            1            3
ar      0     0            1            3
hr      0    48            1            1
vi      0     0            6            1
ko      6     0           76           36
mr      0     0            0            1
sr      0     0            1            0

> library(ca)

> # plot the correspondence analysis results
> plot(ca(mat4), mass = c(T,T),  arrows = c(T,F))

środa, 15 grudnia 2010

Decision trees with R and i-Class

One again the dataset ,,A Week in the Life of a Browser'' is in use. This time the in-database decision tree procedure is presented.

Here, decision tree is used to find variables which discriminate two class of Firefox users: those who use only Firefox and those who use Firefox and some other browser (question Q2 in the survey, see the dataset description https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html).

There is in-database DECTREE procedure available in i-Class. Below its R wrapper is presented. It is called nzDecTree() and is available in the nza package. Note that the tree model is build in database, and only the model description is downloaded to R.

There are two advantages of that: the model is build in parallel and also there is no need for sending large dataset from database to R.

> nzConnect("user","password","10.1.1.74","witl") 
> nzSurvey = nz.data.frame("survey")
> # build decision tree in NPS and download it to R
> treeT = nzDecTree(Q2~Q1+Q5+Q6+Q7, nzSurvey, id="IUSER_ID", minsplit=10)
> # modification of nodes and levels labels, the original raw variable names,
> #   Q2, Q5, Q8 are not very human-readable
> levels(treeT$frame$yval) = c("only Firefox","other browsers")
> levels(treeT$frame$var) = c("", "How long use Firefox", "How old are you", 
+            "How much time do you spend on the Web")
> # generic function for decision tree plotting and printing 
> plot(treeT)
> treeT
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 4081 0 other browsers ( 0.3416 0.6584 )  
   2) How long use Firefox=0 24 0 other browsers ( 0.1667 0.8333 )  
     4) How much time do you spend on the Web=0 20 0 other browsers ( 0.0000 1.0000 ) *
     5) How much time do you spend on the Web < >0 4 0 only Firefox ( 0.7500 0.2500 ) *
   3) How long use Firefox < >0 4057 0 other browsers ( 0.3379 0.6621 )  
     6) How old are you=0 450 0 only Firefox ( 0.5067 0.4933 )  
      12) How long use Firefox=0 33 0 other browsers ( 0.2424 0.7576 ) *
      13) How long use Firefox < >0 417 0 only Firefox ( 0.5276 0.4724 ) *
     7) How old are you < >0 3607 0 other browsers ( 0.3169 0.6831 ) *

Principal component analysis (PCA) with R and i-Class

The dataset ,,A Week in the Life of a Browser'' is in use again. This time to show an example of PCA. Three variables are used as input variables: one quantitative and two qualitative, namely answers for questions 8, 2 and 5. Both quantitative variables are coded by dummy variables. More information about this dataset can be found here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.

In order to perform PCA the dot product will be used. The dot product is computed in parallel fashion in NPS. The aggregate is downloaded to R client. Then mean values are subtracted from the dot product to obtain covariance matrix. The covariance matrix is then rescaled to correlation matrix. This matrix is passed to princomp() function.

Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. Then the nzDotProduct() function is used to compute dot product in parallel in NPS. Below an example usage of PCA is presented.

> nzConnect("user","password","10.1.1.74","witl") 
> nzSurvey = nz.data.frame("survey")
> tmp = nzDotProduct(~Q8+factor(Q2)+factor(Q5), nzSurvey, TRUE)
> p = ncol(tmp$mat)
> means = tmp$mat[p,-p] / tmp$mat[p,p]
> # centering the dot product matrix to covariance matrix
> centered = tmp$mat[-p,-p] - outer(means, means) * tmp$mat[p,p]
> # scaling covariance matrix to correlation matrix
> scaled = cov2cor(centered)
> # PCA with use of princomp()
> (pca = princomp(covmat = scaled))
Call:
princomp(covmat = scaled)

Standard deviations:
      Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6       Comp.7 
1.481848e+00 1.403128e+00 1.155283e+00 9.222890e-01 8.062642e-01 2.980232e-08 0.000000e+00 

 7  variables and  NA observations.
> plot(pca)
> loadings(pca)

Loadings:
     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Q8    0.323        -0.243  0.860  0.312              
Q2.1 -0.304         0.559         0.755        -0.142
Q2.2 -0.222 -0.662 -0.127                      -0.697
Q2.3  0.282  0.644                             -0.702
Q5.1 -0.392  0.146  0.454  0.459 -0.506 -0.390       
Q5.2  0.583 -0.269  0.248 -0.175        -0.700       
Q5.3 -0.427  0.220 -0.587         0.242 -0.598       

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000
> par(mar=c(5,5,2,2))
> plot(-1:1, -1:1, col="white", xlab="PC1", ylab="PC2")
> arrows(0,0,pca$loadings[,1], pca$loadings[,2])
> text(1.2* pca$loadings[,1], 1.2* pca$loadings[,2], rownames(pca$loadings))

Below the output figures are presented.
To perform PCA also the nzPCA() function from nza package may be used.

niedziela, 12 grudnia 2010

Linear regression for narrow datasets with R and i-Class

The dataset ,,A Week in the Life of a Browser'' is in use again. This time the goal is to show how to perform linear regression and bootstrap for linear regression with i-Class.
As response variable the quantitative variable ,,How would you describe your computer/web skill level’’ is used. This variable is measured in the 0-10 scale and maybe it is not a perfect variable for Gaussian regression, but it will be ok for illustration purposes. As explanatory variables the answers for survey questions number 2, 5 and 6 are used. All three as quantitative and are coded in the regression model as dummy variables. More information about this dataset can be found here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.

We use the nzLm() function from nza package. This function is designed to efficiently deal with narrow datasets. I.e. dataset in which number of columns/variables is small while number of rows can be very large. In such cases in order to perform linear regression the dot product for model matrix is calculated. Estimates for model coefficients can be derived from the dot product. Moreover the dot product is much smaller than dataset so can be downloaded from database to R client quite fast. The calculations of dot product are easy to parallelize and in iClass dot product is computed in parallel way in NPS. It is easy to calculate dot products for bootstrapped samples, there is no need for creation of bootstrap samples, it is enough to generate appropriate vector of weights.

Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. Then the nzLm () function is used to compute linear model in NPS. The quantitative variables are marked by wrapping in factor() function. The function nzLm() calculates estimates for considered variables and some model selection criteria like AIC or BIC.

> nzConnect("user","password","10.1.1.74","witl") 
> nzSurvey  = nz.data.frame("survey") 
> nzLm(Q8~factor(Q2)+factor(Q5)+factor(Q6), nzSurvey)

Coefficients:
            Estimate  Std.Error    t.value      p.value
Q2.1      -0.7569315 0.13931738  -5.433144 5.860745e-08
Q2.2       1.7717458 0.06072839  29.174920 0.000000e+00
Q2.3       2.1566500 0.05892490  36.599980 0.000000e+00
Q5.1       0.8208645 0.08000862  10.259701 0.000000e+00
Q5.2       1.7161506 0.04135033  41.502703 0.000000e+00
Q5.3       0.6344492 0.05769266  10.997053 0.000000e+00
Q6.1      -2.0493290 0.16116496 -12.715723 0.000000e+00
Q6.2       0.8444226 0.05239913  16.115202 0.000000e+00
Q6.3       1.1582426 0.04021557  28.800850 0.000000e+00
Q6.4       1.3361322 0.04133580  32.323849 0.000000e+00
Q6.5       1.2271135 0.04891489  25.086708 0.000000e+00
Q6.6       0.5027933 0.06100118   8.242354 2.220446e-16
Q6.7       0.1520891 0.06696563   2.271151 2.318983e-02
Intercept  3.1714643 0.03389847  93.557743 0.000000e+00

Residual standard error:  14229.16  on  4070  degrees of freedom

Log-likelihood:  -8339.181 
AIC:  16700.36 
BIC:  16686.68

By specifying the argument weights=0 in the nzLm() function, the dot product is used in the bootstrap mode. This procedure can be repeated B times to obtain bootstrapped distribution of model coefficients or confidence intervals. In the example below the bootstrap distribution is presented graphically by boxplot() function.

B = 100
coef = matrix(0, 7, B)
for (i in 1:B)
     coef[,i] = nzLm(Q8~factor(Q2)+factor(Q5), nzSurvey, weights=0)$coefficients
rownames(coef) = c("Q2.1", "Q2.2", "Q2.3", "Q5.1", "Q5.2", "Q5.3", "Intercept")

par(mar=c(3,7,2,2))
boxplot(t(coef), horizontal=T, las=1)

In this example for question Q2 following levels are in model: NA (1), I use only Firefox (2), I use also other browser (3). Users with highest skills use Firefox and some other browser. For question Q5 following levels are in model: NA (1), male (2), female (3). In this case males declare themselves more often as experts.

niedziela, 5 grudnia 2010

The first example, visualizations for single quantitative variable

In following example the dataset ,,A Week in the Life of a Browser'' is used to show how to prepare simple data visualization with iClass and NPS (Netezza Performance Server, see www.netezza.com). The i-Class is used in the version LA1 (Limited availability v1). There is no guarantee that this code will work in next releases of i-Class.

The dataset ,,A Week in the Life of a Browser'' was published by Mozilla Labs and are available under the terms of the Creative Commons License. The dataset and it's description is available here: https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html

We use only the R interface, so there is no need for SQL code to prepare following figures. Note that data are stores in remote server, and are too large to download it into R client. Thus all data aggregates are computed in NPS and only small summaries are downloaded to R client in order to prepare following figures.

Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. There are three tables in this databases, namely: users, events and survey.
They are described here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html. Then the nzTable() function is used to compute contingency table in NPS, this table is downloaded and used to prepare plot.

# simple function which summarizes single qualitative variable
#   form is an R formula with one variable
#   nzdf is a nz.data.frame object
#   names is a vector of names for corresponding values of considered variable
#   title is a main title for produced figure 
getSimpleSummary <- function(form, nzdf, names, title) {  
# get the contingency table from NPS 
  tmp = nzTable(form, nzdf, F)
  tmp = nzSparse2matrix(na.omit(as.data.frame(tmp$tab)))
  tmp = tmp[order(as.numeric(names(tmp)))] 
  # add names 
  names(tmp) = names 
  # prepare graphical layout 
  par(xpd=F, mar=c(5,2,2,2))
  layout(1:2, widths=c(1,1), heights=c(4,1)) 
  # plot the variable summaries 
  dotchart(tmp, xlim=c(0,2000), pch=19, main=title, xlab="number of surveys")
  par(xpd=F, mar=c(1,2,2,2))
  barplot(as.matrix(tmp), horiz=T, xaxt="n")
  par(xpd=NA)
  text(cumsum(tmp),1.5,names(tmp), adj=c(1,1), cex=0.8)
}

# connect to database and create pointer to table survey
nzConnect("user","password","10.1.1.74","witl") 
nzSurvey  = nz.data.frame("survey") 
# show summaries for question 6 from survey 
getSimpleSummary(~Q6, nzSurvey, 
      c("Under 18", "18-25", "26-35", "36-45", "46-55", "Older than 55"),   
      "How old are you?") 
# show summaries for question 7 from survey 
getSimpleSummary(~Q7, nzSurvey, 
      c("Less than 1 hour", "1-2 hours", "2-4 hours", "4-6 hours", "6-8 hours", 
        "8-10 hours", "More than 10 hours"), 
      "How much time do you spend on the Web each day?")
#
# for two variables
# relations between sex and computer skill level
tt = nzTable(Q5~Q8, nzSurvey, T)$mat
rownames(tt) = c("male","female")
mosaicplot(tt, col=rainbow(10),main="",ylab="Computer/web skill level",xlab="Gender")

plot(ecdf(as.numeric(rep(names(tt[1,]), tt[1,]))), main="", xlab="Computer/web skill level")
plot(ecdf(as.numeric(rep(names(tt[2,]), tt[2,]))), add=T, col="red")
legend("left", c("male","female"), col=c(1,2), lwd=2)

Resulting plots are presented below

Figure for question 6

Figure for question 7