R and large real world data examples: Linear regression for narrow datasets with R and i-Class

The dataset ,,A Week in the Life of a Browser'' is in use again. This time the goal is to show how to perform linear regression and bootstrap for linear regression with i-Class.
As response variable the quantitative variable ,,How would you describe your computer/web skill level’’ is used. This variable is measured in the 0-10 scale and maybe it is not a perfect variable for Gaussian regression, but it will be ok for illustration purposes. As explanatory variables the answers for survey questions number 2, 5 and 6 are used. All three as quantitative and are coded in the regression model as dummy variables. More information about this dataset can be found here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.

We use the nzLm() function from nza package. This function is designed to efficiently deal with narrow datasets. I.e. dataset in which number of columns/variables is small while number of rows can be very large. In such cases in order to perform linear regression the dot product for model matrix is calculated. Estimates for model coefficients can be derived from the dot product. Moreover the dot product is much smaller than dataset so can be downloaded from database to R client quite fast. The calculations of dot product are easy to parallelize and in iClass dot product is computed in parallel way in NPS. It is easy to calculate dot products for bootstrapped samples, there is no need for creation of bootstrap samples, it is enough to generate appropriate vector of weights.

Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. Then the nzLm () function is used to compute linear model in NPS. The quantitative variables are marked by wrapping in factor() function. The function nzLm() calculates estimates for considered variables and some model selection criteria like AIC or BIC.

> nzConnect("user","password","10.1.1.74","witl") 
> nzSurvey  = nz.data.frame("survey") 
> nzLm(Q8~factor(Q2)+factor(Q5)+factor(Q6), nzSurvey)

Coefficients:
            Estimate  Std.Error    t.value      p.value
Q2.1      -0.7569315 0.13931738  -5.433144 5.860745e-08
Q2.2       1.7717458 0.06072839  29.174920 0.000000e+00
Q2.3       2.1566500 0.05892490  36.599980 0.000000e+00
Q5.1       0.8208645 0.08000862  10.259701 0.000000e+00
Q5.2       1.7161506 0.04135033  41.502703 0.000000e+00
Q5.3       0.6344492 0.05769266  10.997053 0.000000e+00
Q6.1      -2.0493290 0.16116496 -12.715723 0.000000e+00
Q6.2       0.8444226 0.05239913  16.115202 0.000000e+00
Q6.3       1.1582426 0.04021557  28.800850 0.000000e+00
Q6.4       1.3361322 0.04133580  32.323849 0.000000e+00
Q6.5       1.2271135 0.04891489  25.086708 0.000000e+00
Q6.6       0.5027933 0.06100118   8.242354 2.220446e-16
Q6.7       0.1520891 0.06696563   2.271151 2.318983e-02
Intercept  3.1714643 0.03389847  93.557743 0.000000e+00

Residual standard error:  14229.16  on  4070  degrees of freedom

Log-likelihood:  -8339.181 
AIC:  16700.36 
BIC:  16686.68

By specifying the argument weights=0 in the nzLm() function, the dot product is used in the bootstrap mode. This procedure can be repeated B times to obtain bootstrapped distribution of model coefficients or confidence intervals. In the example below the bootstrap distribution is presented graphically by boxplot() function.

B = 100
coef = matrix(0, 7, B)
for (i in 1:B)
     coef[,i] = nzLm(Q8~factor(Q2)+factor(Q5), nzSurvey, weights=0)$coefficients
rownames(coef) = c("Q2.1", "Q2.2", "Q2.3", "Q5.1", "Q5.2", "Q5.3", "Intercept")

par(mar=c(3,7,2,2))
boxplot(t(coef), horizontal=T, las=1)

In this example for question Q2 following levels are in model: NA (1), I use only Firefox (2), I use also other browser (3). Users with highest skills use Firefox and some other browser. For question Q5 following levels are in model: NA (1), male (2), female (3). In this case males declare themselves more often as experts.

niedziela, 12 grudnia 2010

Linear regression for narrow datasets with R and i-Class