NMST539 | Lab Session 1

Multivariate data visualisation using R

Summer Term 2021/2022 | 15/02/22

sourse file (UTF8 encoding)

Program R (GNU GPL public licence) can be downloaded from

https://www.r-project.org

There are suitable software distributions for OS Windows, Linux, as well as OS X (Macintosh}.

Additional packages/libraries can be found at the repositary webpage https://cran.r-project.org/mirrors.html. However, packages are usually programmed by ordinary R users and, therefore, some caution is always needed when using such libraries and the commands implemented in them. The functions/commands that we use in R can not serve as some software black boxes but, instead, the idea will be always placed on the underlying theoretical background (statistical theory and the probabilistic model) which should be always fully understood.

User-friendly interfaces for the R software can be used in addition to simlify the work. One of the most common interface for R is the RStudio (supports, again, OS Windows, Linux, and OS X/Macintosh).

Usefull references for statistical analyses in R (in Czech/in English)

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (PDF súbor)
  • Komárek, A.: Základy práce s R. (PDF súbor)
  • Kulich, M.: Velmi stručný úvod do R. (PDF súbor)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808)


1. Grafical tools in R

Three categories in particular:

  1. Functions opening a new graphical device
    Each function from this category opens a new graphical device (window) which can be later used to add some addtitional elements into it.

    For instance: plot(), dotchart(), hist(), barplot(), pie(), boxplot(), pairs(), atď.;

  2. Functions that add elements into an active graphical window
    Functions from this category only work if there is already some graphical window openned (by using some function from the previous category). The functions can be used to add some additional plots/elements into an existing plot.

    For instance: points(), lines(), abline(), text(), polygon(), rect(), box(), segments(), arrows(), legend(), atď.;

  3. Functions meant for plots/graphs manipulation
    Functions from this category are primarily used for some manipulation with the existing/complete plot outputs – such as saving or organizing them into some structure.

    For instance: dev.new(), X11(), dev.off(), postscript(), pdf(), png(), par(), layout(), atď.;

It is always good to use the R help which is always available for each R function. For each command, the corresponding help can be obtained by using the command itself given right after the question mark. For instnace, for the help for the command plot() one needs to type ?plot.

Scatterplots

Simple, easily understandable 2-dimensional plots. However, it does not necessarily mean that two different covariates (two sets of information) can be only used to draw a scatterplot. Multivariate (three and more) dimensional scatterplots are, however, not recommended as human eyes are not well equiped for reading and understanding information in higher dimensions.

For illustration puposes we will use the dataset mtcars which is standardly available in the R software (uppon installation).

head(mtcars)
dim(mtcars)
summary(mtcars)

Simple scatterplot (using two covariates only) and the default version of the command plot():

The graph above contain only very limited information. For instance, it is not even clear from the plot itself what type of information is given in the figure. The golden rule says that each figure, together with its caption (which should be necessary), must contain all relevant information needed to fully understand the plot. Thus, lets try another plot – containing the same information – and much more:

x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"  
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
    main="Gas Milage for Car Models by cylinders",
   xlab="Consumption [Miles Per Gallon]", gcolor="black", color=x$color, pch = x$am)

c4 <- mean(mtcars$mpg[mtcars$cyl == 4])
c6 <- mean(mtcars$mpg[mtcars$cyl == 6])
c8 <- mean(mtcars$mpg[mtcars$cyl == 8])

lines(c(c8, c8), c(1, 14), col = "darkgreen", lwd = 2)
lines(c(c6, c6), c(17, 23), col = "blue", lwd = 2)
lines(c(c4, c4), c(26, 36), col = "red", lwd = 2)

Alternatíve with a matrix of scatterplots given by the command pairs():

pairs(mtcars[,c(1,3,4,6)], col=mtcars$cyl, pch = mtcars$am)

Or, instead, we can try some plots which already implicitly include some statistical methods (e.g., estimation of the densities or the corresponding regression lines between the given pairs) may be obtained by using the additional package library(psych) and the command pairs.panels():

library(psych)
pairs.panels(mtcars[,c("mpg", "disp", "hp", "wt")], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

There are of course many options how to create a good plot/graph using the R software. The main goal should be to offer an easily understandable graph with all relevant information being clearly contained in it. This also includes a proper description of the axes, indlucind the unites (for instance, meters for distances, liters for volumes, etc.).

An example of a plot with two different \(y\) axes:

par(mar = c(5, 4, 4, 4) + 0.3)
x <- mtcars$wt
plot(mtcars$mpg ~ x, data = mtcars, pch = 21, bg = "lightblue", cex = 0.8, xlim = c(1, 5.5), xlab = "Total weight [1000 lbs]", ylab = "")
lines(lowess(mtcars$mpg ~ mtcars$wt), col = "black", lwd = 2)

mtext("Consumption [miles per gallon]" ,side=2,line=3, col = "black")
axis(2,col="black",col.axis="black")

par(new=TRUE)


plot(mtcars$hp ~ x, axes = FALSE, bty = "n", xlab = "", ylab = "", type = "p", col ="red", ylim = c(min(mtcars$hp), max(mtcars$hp)), lwd = 2, cex = 0.5, xlim = c(1, 5.5))
lines(lowess(mtcars$hp ~ x), col = "red", lwd = 2)
axis(4,col="red" ,col.axis= "red")

mtext("Horse power",side=4,line=3, col = "red")

for (i in 1:50){
  lines(c(1 + i * 4.5/50, 1 + i * 4.5/50), c(50, 340), col = "gray")
}

Sometimes, a very simple plot may just do the job which is needed… what can be seen in the following plot for instance?

library(MASS)
library("RColorBrewer")
colors <- brewer.pal(5, "Reds")

z <- kde2d(mtcars[,1], mtcars[,4], n=5)

plot(mtcars[,c(1,4)], xlab="Miles per Galon", ylab="Horse Power", pch=21, bg = "yellow")
contour(z, drawlabels=FALSE, nlevels=4, col=colors, add=TRUE)

The library MASS above is needed to calculate the kernel estimate for the two dimensional density (command kde2d()).
In addition, the library RColorBrewer provides the colour scale for differed colors of the countours.

In practical applications the plots should always correspond with the question of interest and the main scientific hypothesis. If we are interested in a car consumption given the number of cylinders, mostly it will be completely irrelevant to give a plot of the car weights given the number of transmission gears.

Boxplots

Scatterplots are used for the visualisation of the raw data points. Instead, boxplots are used to visualise some useful descriptive characteristics – sample versions of some theoretical properties – such as the sample median/mean, sample variance/tandard error, sample quartiles, etc.

The boxplot figure must be understood properly. It is fundamental for the interpretation.

boxplot(mpg ~ cyl, col = brewer.pal(3, "Blues"), xlab = "number of cylinders", ylab = "miles per galon", main= "")

Distinguishing for two different covariates (pay attention on the x axis):

boxplot(mpg ~ cyl + am, col = c(brewer.pal(3, "Greens"), brewer.pal(3, "Blues")), xlab = "No. of cylinders(.)Auto Transmission Indicator")

Or a sophisticated boxplot versions with the consumption given two covariates together with a 95% confidence interval for the true (unknown) median value (for notched = TRUE).

boxplot(mpg ~ cyl * am, col = c(brewer.pal(3, "Greens"), brewer.pal(3, "Blues")), xlab = "Number of Cylinders", ylab = "Miles per Galon | Automatic Transmission", main= "", notch = T)

Alternative histograms showing also the estimated density (using command vioplot() from the library vioplot).

library(vioplot)
x1 <- mtcars$mpg[mtcars$cyl==4]
x2 <- mtcars$mpg[mtcars$cyl==6]
x3 <- mtcars$mpg[mtcars$cyl==8]
vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), col="gold")
## [1] 10.4 33.9
title("Violin Plots of Miles Per Gallon")

Similar boxplot can by also obtained by using the command pirateplot() and the library yarrr. For a detailed interpretation of the boxplot description, see manuál.

#install.packages("yarrr")
library("yarrr")
pirateplot(mpg ~ cyl, data = mtcars, xlab = "number of cylinders", ylab = "miles per galon", main= "", inf.method = "iqr")

THe same command, however, with the confidence intervals instead of the sample quartiles: (see again the manuál).

pirateplot(mpg ~ cyl, data = mtcars, xlab = "number of cylinders", ylab = "miles per galon", main= "", inf.method = "ci")



Parallel Coordinate Plots

This type of plots may reveal some dependence form between various covariates which are given in the dataset. Again, the interpretation of the plot may be tricky in some sense but, sometimes, very useful information can be taken from the parallel coordinate plots.

library(MASS)
attach(mtcars)
colorVector <- rep("brown", dim(mtcars)[1])
colorVector[am == 0] <- "brown1"
colorVector[cyl == 6] <- "blue4"
colorVector[cyl == 6 & am == 0] <- "blue1"
colorVector[cyl == 4] <- "darkgreen"
colorVector[cyl == 4 & am == 0] <- "green"
parcoord(mtcars, col = colorVector, lty = gear)

How would you interpret the plot above? What would be your conclussion?



Library ggplot2

A very popular library for graphical outputs in R – the R package ggplot2:

install.packages("ggplot2")
library(ggplot2)

The main plotting function is qplot() which, unlike the standard R command plot(), offers many more options how to create the final graph (see the help ?qplot()).

Let us slightly prepare the data from the mtcars object:

mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears")) 
mtcars$am   <- factor(mtcars$am,levels=c(0,1), labels=c("Automatic","Manual")) 
mtcars$cyl  <- factor(mtcars$cyl,levels=c(4,6,8),  labels=c("4cyl","6cyl","8cyl"))

Density estimates for the car consumption given the number of geers (using qplot()):

qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5), main="Distribution of Gas Milage", xlab="Miles Per Gallon",  ylab="Density")

The same function/command can be also used to get boxplots (using the additional parameter geom = "..."):

qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"), fill=gear, main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon")

Or even regression lines:

qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"), method="lm", formula=y~x, color=cyl, main="Regression of MPG on Weight", xlab="Weight", ylab="Miles per Gallon")



Useful hint


The library ggplot2 uses multiple layers to put together the final graphical output. The layers can be combined together using different commands, for instance, geom_points(), geom_abline(), geom_area() geom_bar(), geom_contour(), geom_polygon(). (see a detailed list at http://sape.inf.usi.ch/quick-reference/ggplot2/geom).

A simple illustration for multiple layers in one plot:

dataSum <- plyr::ddply(mtcars, "gear", plyr::summarise, mean = mean(mpg), sd = sd(mpg))

ggplot() +
  geom_point(data = mtcars, aes(x = gear, y = mpg)) +
  geom_point(data = dataSum, aes(x = gear, y = mean), colour = 'red', size = 3) +
  geom_errorbar(data = dataSum, aes(x = gear, y = mean,  ymin = mean - sd, ymax = mean + sd), colour = 'red', width = 0.4) +
  ggtitle("Car consumption summary given the number of gears")






In general, there are very many options in R to produce nice and (information) efficient plot/graph. All possibilities go way beyond the scope of this lab session. Therefore, we can only refer a very nice manual with many practical examples which can be all found here

https://www.r-graph-gallery.com/ and also here https://statisticsglobe.com/graphics-in-r.



2. Reporting results with Sweave and Knit

Important part of each statistical analysis is the result reporting which also requires some substantial time amount. An appropriate form should be always used. There are some straightforward options in the R software. For instance, one can easily obtain a (very simply reconstructable) PDF report with all important analysis, relevant results, and the corresponding interpretation using the LaTeX and the R library Sweave. Alternatively, one can report the results in terms of some www document (HTML file) using the R library Knit. The last one can be actually also used to create a PDF file.

An illustrative guide can be found here:

https://support.rstudio.com/

The library/command Sweave enables cooperation between the R software and LaTeX. Analogously, the library/command Knit integrates the R software and the HTML code.

The main principle relies on the tex source file (for Sweave) or the html source file (for Knit), which are saved as *.Rnw (Sweave) or *.Rmd (Knit) types.

In addition to a standard tex code (Sweave) or the html code (Knit) one can also use a (properly specified) R software code embeded in specific symbols. When applying the command Sweave("file_name.Rnw") or Knit(file_name.Rmd) the R software reads the whole document firstly, the corresponding parts with the R code are evaluated and the corresponding output is plugged in instead. Consequently the R software uses LaTeX (Sweave) or HTML compiler (Knit) to produce the final PDF/HTML document/file.

The whole process simplifies when one uses the interface RStudio

https://www.rstudio.com

where Sweave and Knit buttoms are available by default (they appear according to the type of the loaded source file – *.Rnw (Sweave) or *.Rmd (Knit). A detailed guide can be found where:

http://gosset.wharton.upenn.edu/



Sweave and Knit

  • Illustrative example for the Rnw file:
    Download Rnw file (Sweave | UTF 8 encoding)
    The final PDF document is obtained when using the Compile PDF buttom in the RStudio interface – see the figure below.
  • Illustrative example for the Rmd file:
    Download Rnw file (Knit | UTF 8 encoding)
    The final HTML document is obtained when using the Knit HTML buttom in the RStudio interface – viď the figure below.











Individual task assignment

(Deadline: Second lab session / 22.02.2022)

A proper ellaboration of all partial tasks given at the end of each HTML Markdown (each week) is an obligation for all students who intend to obtain the final credit at the end of the term.

  • Use the guide on the web page of the K10 lecture room and create your own (simple) web page where you will publish all your solutions of the individual tasks assigned during the term (technically, it is sufficient to create one file called index.html which will be located in the directory public_html in your home directory).

    The file index.html can be edited using an arbitrary text editor or, instead, you can also edit the file directly in the R software (using the Knit library). Your web page link must be sent by emial to maciak [AT] karlin.mff.cuni.cz and also to hlavka [AT] karlin.mff.cuni.cz.

  • If you are interested, the page access can be also restricted by a password using htaccess. In such case, you should also provide the corresponding password when sending the link according the the item above.

Use the information provided on the lecturer’s website to install the R library SMSdata with various datasets inclded in this packages. From all the datasets available in the library SMSdata

data(package = "SMSdata") 

choose one specific dataset (free choice, there are 21 of them in total). The choosen dataset can be loaded into the R environment by using the command data(dataset_name), for example,

data(plasma) 
  • Choose one dataset from the SMSdata library and load into R;
  • Use some proper graphical tools to visualize (primarily the multivariate character of the) data;
  • Upload at least two figures on your web site. Add some reasonable interpretation/comments and discuss (at least briefly) the multivariate nature of the data using the created plots;
  • The web site link with the figures and comments should be sent by email to maciak [AT] karlin.mff.cuni.cz and hlavka [AT] karlin.mff.cuni.cz on Tuesday, 21.02.2022, at 12:00 at latest.