niedziela, 5 grudnia 2010

The first example, visualizations for single quantitative variable


In following example the dataset ,,A Week in the Life of a Browser'' is used to show how to prepare simple data visualization with iClass and NPS (Netezza Performance Server, see www.netezza.com). The i-Class is used in the version LA1 (Limited availability v1). There is no guarantee that this code will work in next releases of i-Class.

The dataset ,,A Week in the Life of a Browser'' was published by Mozilla Labs and are available under the terms of the Creative Commons License. The dataset and it's description is available here: https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html

We use only the R interface, so there is no need for SQL code to prepare following figures. Note that data are stores in remote server, and are too large to download it into R client. Thus all data aggregates are computed in NPS and only small summaries are downloaded to R client in order to prepare following figures.

Two R packages are used to connect with database and compute in-database aggregates. These are: nzr and nza. The nzConnect() function is used to connect with database witl. There are three tables in this databases, namely: users, events and survey.
They are described here https://testpilot.mozillalabs.com/testcases/a-week-life-2/aggregated-data.html. Then the nzTable() function is used to compute contingency table in NPS, this table is downloaded and used to prepare plot.



# simple function which summarizes single qualitative variable
#   form is an R formula with one variable
#   nzdf is a nz.data.frame object
#   names is a vector of names for corresponding values of considered variable
#   title is a main title for produced figure 
getSimpleSummary <- function(form, nzdf, names, title) {  
# get the contingency table from NPS 
  tmp = nzTable(form, nzdf, F)
  tmp = nzSparse2matrix(na.omit(as.data.frame(tmp$tab)))
  tmp = tmp[order(as.numeric(names(tmp)))] 
  # add names 
  names(tmp) = names 
  # prepare graphical layout 
  par(xpd=F, mar=c(5,2,2,2))
  layout(1:2, widths=c(1,1), heights=c(4,1)) 
  # plot the variable summaries 
  dotchart(tmp, xlim=c(0,2000), pch=19, main=title, xlab="number of surveys")
  par(xpd=F, mar=c(1,2,2,2))
  barplot(as.matrix(tmp), horiz=T, xaxt="n")
  par(xpd=NA)
  text(cumsum(tmp),1.5,names(tmp), adj=c(1,1), cex=0.8)
}

# connect to database and create pointer to table survey
nzConnect("user","password","10.1.1.74","witl") 
nzSurvey  = nz.data.frame("survey") 
# show summaries for question 6 from survey 
getSimpleSummary(~Q6, nzSurvey, 
      c("Under 18", "18-25", "26-35", "36-45", "46-55", "Older than 55"),   
      "How old are you?") 
# show summaries for question 7 from survey 
getSimpleSummary(~Q7, nzSurvey, 
      c("Less than 1 hour", "1-2 hours", "2-4 hours", "4-6 hours", "6-8 hours", 
        "8-10 hours", "More than 10 hours"), 
      "How much time do you spend on the Web each day?")
#
# for two variables
# relations between sex and computer skill level
tt = nzTable(Q5~Q8, nzSurvey, T)$mat
rownames(tt) = c("male","female")
mosaicplot(tt, col=rainbow(10),main="",ylab="Computer/web skill level",xlab="Gender")

plot(ecdf(as.numeric(rep(names(tt[1,]), tt[1,]))), main="", xlab="Computer/web skill level")
plot(ecdf(as.numeric(rep(names(tt[2,]), tt[2,]))), add=T, col="red")
legend("left", c("male","female"), col=c(1,2), lwd=2)




Resulting plots are presented below


Figure for question 6

Figure for question 7