As response variable the quantitative variable ,,How would you describe your computer/web skill level’’ is used. This variable is measured in the 0-10 scale and maybe it is not a perfect variable for Gaussian regression, but it will be ok for illustration purposes. As explanatory variables the answers for survey questions number 2, 5 and 6 are used. All three as quantitative and are coded in the regression model as dummy variables. More information about this dataset can be found here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html.
We use the nzLm() function from nza package. This function is designed to efficiently deal with narrow datasets. I.e. dataset in which number of columns/variables is small while number of rows can be very large. In such cases in order to perform linear regression the dot product for model matrix is calculated. Estimates for model coefficients can be derived from the dot product. Moreover the dot product is much smaller than dataset so can be downloaded from database to R client quite fast. The calculations of dot product are easy to parallelize and in iClass dot product is computed in parallel way in NPS. It is easy to calculate dot products for bootstrapped samples, there is no need for creation of bootstrap samples, it is enough to generate appropriate vector of weights.
Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. Then the nzLm () function is used to compute linear model in NPS. The quantitative variables are marked by wrapping in factor() function. The function nzLm() calculates estimates for considered variables and some model selection criteria like AIC or BIC.
> nzConnect("user","password","10.1.1.74","witl")
> nzSurvey = nz.data.frame("survey")
> nzLm(Q8~factor(Q2)+factor(Q5)+factor(Q6), nzSurvey)
Coefficients:
Estimate Std.Error t.value p.value
Q2.1 -0.7569315 0.13931738 -5.433144 5.860745e-08
Q2.2 1.7717458 0.06072839 29.174920 0.000000e+00
Q2.3 2.1566500 0.05892490 36.599980 0.000000e+00
Q5.1 0.8208645 0.08000862 10.259701 0.000000e+00
Q5.2 1.7161506 0.04135033 41.502703 0.000000e+00
Q5.3 0.6344492 0.05769266 10.997053 0.000000e+00
Q6.1 -2.0493290 0.16116496 -12.715723 0.000000e+00
Q6.2 0.8444226 0.05239913 16.115202 0.000000e+00
Q6.3 1.1582426 0.04021557 28.800850 0.000000e+00
Q6.4 1.3361322 0.04133580 32.323849 0.000000e+00
Q6.5 1.2271135 0.04891489 25.086708 0.000000e+00
Q6.6 0.5027933 0.06100118 8.242354 2.220446e-16
Q6.7 0.1520891 0.06696563 2.271151 2.318983e-02
Intercept 3.1714643 0.03389847 93.557743 0.000000e+00
Residual standard error: 14229.16 on 4070 degrees of freedom
Log-likelihood: -8339.181
AIC: 16700.36
BIC: 16686.68
By specifying the argument weights=0 in the nzLm() function, the dot product is used in the bootstrap mode. This procedure can be repeated B times to obtain bootstrapped distribution of model coefficients or confidence intervals. In the example below the bootstrap distribution is presented graphically by boxplot() function.
B = 100
coef = matrix(0, 7, B)
for (i in 1:B)
coef[,i] = nzLm(Q8~factor(Q2)+factor(Q5), nzSurvey, weights=0)$coefficients
rownames(coef) = c("Q2.1", "Q2.2", "Q2.3", "Q5.1", "Q5.2", "Q5.3", "Intercept")
par(mar=c(3,7,2,2))
boxplot(t(coef), horizontal=T, las=1)
In this example for question Q2 following levels are in model: NA (1), I use only Firefox (2), I use also other browser (3). Users with highest skills use Firefox and some other browser. For question Q5 following levels are in model: NA (1), male (2), female (3). In this case males declare themselves more often as experts.
