2 Data wrangling and visualisation

2.1 Introduction

In the second part of this workshop we are dealing with the handling and visualisation of (complex) data. It is expected that you should feel comfortable with basic R and, specifically, the use of data.frame objects. If not, then please ask one of the demonstrators.

At this point we are also saying goodbye to much of R’s basic (i.e. inbuilt) functionalities for plotting and manipulating data and introduce you to the tidyverse, which is a suite of packages for performing a (growing) number of data science tasks. Here, we will introduce key tidyverse packages, including dplyr and ggplot2, and show how they can be used to efficiently process and visualise complex data.

Although each package can be installed and loaded separately, they are designed to work together. An easier approach is to simply install and load the tidyverse directly.

To install tidyverse, use:

install.packages("tidyverse")

and once installed, it can be loaded using:

library(tidyverse)

\(~\)

2.2 Loading and saving data

An essential part of data handling is the ability to store and read (externally stored) data. There are a myriad of different ways this can be done, here we briefly introduce two of them:

  • saving data as an R object with saveRDS to create an Rds files and loading it with load
  • saving data as a comma separated version (CSV) file and loading it with read_csv or read.table

Note: tidyverse also provides functions to handle Microsoft Excel files. For example, read_excel() lets you import xls or xlsx based data files. However, due to Excel often causing havoc with data (mostly due to it changing data types), we are urging caution when handling Excel files.

The main difference between the two approaches is that an Rds file stores data in a (R specific) binary format, meaning that you need R to read it. CSV files, in comparison, are (human readable) text files that can be opened by a number of different programs. In fact, on most operating systems, these are opened by MS Excel or a text editor by default. Which approach should you use? This often depends on whether you are the only one working with the data or whether you like the added benefit of viewing (or manipulating) data outside the R environment.

2.2.1 Save / load data as an R object

As a first step we need to create some data that we wish to store for future use. And as before, we are dealing almost exclusively with data.frames. Recall the simply data.frame we generated in the first session

df <- data.frame(ParticipantID = c('ID001', 'ID002', 'ID003', 'ID007', 'ID009'),
                 Age = c(12, 8, 9, 7, 11),
                 Episode = c('y', 'n', 'n', 'y', 'y'))

To save this data as an R object, simply call

saveRDS(df, file = 'myData.Rds')

Note: the file will be saved in your current working directory. If you want to save it in a different folder you need to provide the full file path, e.g. saveRDS(df, file = '.../StatsWithR//Data/myData.Rds').

Loading the data is equally simple:

# first we remove the df object to make sure everything works
rm(df)

# load the data
readRDS('myData.Rds')
##   ParticipantID Age Episode
## 1         ID001  12       y
## 2         ID002   8       n
## 3         ID003   9       n
## 4         ID007   7       y
## 5         ID009  11       y

At first sight this looks like it has worked. However, you might also have noticed that there is no newly created object in your global environment. This means, although R successfully loaded the data, it was not assigned to an object and essentially lost (this is one of the key differences between save/load and saveRDS/readRDS). We therefore need to assign the read in data to an object

# read data and assign to an object / data.frame called myData
myData <- readRDS('myData.Rds')

# see if it works
head(myData)
##   ParticipantID Age Episode
## 1         ID001  12       y
## 2         ID002   8       n
## 3         ID003   9       n
## 4         ID007   7       y
## 5         ID009  11       y

2.2.2 Save / load as a CSV file

One of the advantages of saving data as an Rds file is that it is more space efficient. By default, saveRDS compresses the data, which can significantly shrink the size of the stored data. On the downside, it can also take a bit longer to load the data subsequently. The more preferred data format, however, is a CSV file.

There are two equivalent function pairs for saving / loading CSV files: write.table()/read.table() and read_csv()/write_csv() (the latter is essentially a wrapper for the former). Here we show you how to use the first function pair, but please familiarise yourself with the other one as well.

First we save our data.frame myData as a comma separated file

# save data in CSV format
write.table(myData,  # data object
            file = 'myData.csv',  # file name
            sep = ',', # character used to separate columns
            row.names = F) # don't save row names as additional column

Check if this worked by importing the CSV file

# first remove the object form our environment
rm(myData)

# read in data
df <- read.table('myData.csv',  # file name to be read
                 header = TRUE, # treat first row as header / column names
                 sep = ',')

# print out first three rows
head(df, n = 3)
##   ParticipantID Age Episode
## 1         ID001  12       y
## 2         ID002   8       n
## 3         ID003   9       n

Note: here we used a comma (‘,’) to separate individual data columns. Other commonly used characters are semi-colon (‘;’) and tab (’). If the separation is not explicitly provided via sep = ..., read.table will guess it - this is dangerous and often leads to errors. We therefore recommend to always state which character should be used - explicit is better than implicit!

2.2.3 Handling Excel files

Although we strongly discourage the use of Excel for data handling, it is still one of the most common methods for storing data. We therefore briefly show you one method of how to read Excel files using the readxl package, which needs to be loaded explicitly, because it is not one of the core tidyverse packages that are loaded via library(tidyverse)

# load library
library(readxl)

# read in excel data file
irisDF <- read_excel('iris.xls')

# print first few rows
head(irisDF, n = 5)
## # A tibble: 5 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa

2.2.4 Cheatsheet

For more information of how to import and export data have a look at this cheat sheet, which also provides an overview of all essential functions and their usage as a PDF for download.

\(~\)

2.3 Visualisation using ggplot

We start with a motivating example to show, how using base R graphics can easily become cumbersome. For this we use the iris dataset, which we have imported in the previous section. This dataset contains the measurements of the sepal length, sepal width, petal length and petal width for 50 samples from each of three species of iris flowers (setosa, versicolor and virginica).

Let’s say we are interested in the relationship between the length and the width of the petals. The first port of call is therefore to plot one against the other.

plot(x = irisDF$Petal.Width, y = irisDF$Petal.Length)

We can see that there is a clear relationship between those two variables. One other thing we might notice is that there are (at least) two clusters, and we might further have the suspicion that this could be due to the fact that we different species of iris. One way of checking this is to colour the points in by species. Here we do this in a slightly convoluted way but it serves the purpose of highlighting how quickly things can become complicated in base R.

The first thing we want to do is to subset the data. Say, for example, we are interested in the relationship between petal length and width in Iris setosa. In other words, we want to only plot the data for which Species=='setosa'.

plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'], 
     y = irisDF$Petal.Length[irisDF$Species=='setosa'],
     col = 'blue', pch = 20)

We can now add point for samples from Iris versicolor in a similar fashion.

plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'], 
     y = irisDF$Petal.Length[irisDF$Species=='setosa'],
     col = 'blue', pch = 20)
points(x = irisDF$Petal.Width[irisDF$Species=='versicolor'], 
       y = irisDF$Petal.Length[irisDF$Species=='versicolor'],
       col = 'red', pch = 20)

What has happened?? The problem is that the axes limits are automatically set through the first plot call and not updated for any additional points added to the graph. We therefore need to manually set the limits to guarantee that all points are visible.

plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'], 
     y = irisDF$Petal.Length[irisDF$Species=='setosa'],
     col = 'blue', pch = 20,
     xlim = c(0, 2.5),
     ylim = c(1,7),
     xlab = 'Petal width',
     ylab = 'Petal length')
points(x = irisDF$Petal.Width[irisDF$Species=='versicolor'], 
       y = irisDF$Petal.Length[irisDF$Species=='versicolor'],
       col = 'red', pch = 20)
points(x = irisDF$Petal.Width[irisDF$Species=='virginica'], 
       y = irisDF$Petal.Length[irisDF$Species=='virginica'],
       col = 'darkgreen', pch = 20)

The only thing missing now is a legend to note which colour belongs to what species.

plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'], 
     y = irisDF$Petal.Length[irisDF$Species=='setosa'],
     col = 'blue', pch = 20,
     xlim = c(0, 2.5),
     ylim = c(1,7),
     xlab = 'Petal width',
     ylab = 'Petal length')
points(x = irisDF$Petal.Width[irisDF$Species=='versicolor'], 
       y = irisDF$Petal.Length[irisDF$Species=='versicolor'],
       col = 'red', pch = 20)
points(x = irisDF$Petal.Width[irisDF$Species=='virginica'], 
       y = irisDF$Petal.Length[irisDF$Species=='virginica'],
       col = 'darkgreen', pch = 20)
legend(x = 0, y = 7, 
       legend = c('setosa', 'versicolor', 'virginica'),
       pch = 20,
       col =  c('blue', 'red', 'darkgreen'))

Now we can see that there are indeed three clusters and also get a hint that the relationship between petal length and petal width is dependent on species. In the next part of this workshop we will show you how to test this statistically.

Although you can see that we can make highly customised graphs using base R graphics, the steps needed to create the above graph were far from straight forward. So let’s try to do the same using ggplot2 instead

ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length, col = Species)) +
    geom_point()

So essentially with a lone-liner we can create a highly complex graph that required no manual subsetting of the data or manipulation of the plot. We did not have to specify the legend or its position, either; this is all being taken care of by the ggplot2 package.

Arguably, for most people ggplot2 is not very intuitive to start with and it can take some time to get used to. However, once you have understood the basics, it becomes a very logical and powerful way to create complex and highly customised graphs.

The ethos of ggplot2 is that plots can be broken down into different features, most notably:

  • data
  • aesthetic mapping
  • geometric object
  • faceting
  • scales
  • statistical transformations
  • coordinate system
  • position adjustments

The first three are the most most essential ingredients of a plot created with ggplot2. The next one (faceting) is a commonly used method when dealing with categorical data and for creating stratified graphs. The scales feature is useful for translating data values to visual properties and include, for example, log-transforms. The remaining features are only mentioned here but play no further part in this introductory workshop.

In order to explain some of these features is to break the above code apart and add features one-by-one. Let’s start with setting up the plot

ggplot(irisDF)

Although this creates an empty plot but you will not get an error message - because you have done nothing wrong, you just forgot to add crucial ingredients!

Note: In comparison to base R, ggplot2 requires a data.frame (or tibble) object for plotting. As shown in the previous section, it is easy to manipulate most other data into an R data frame. We will revisit this at a later stage.

2.3.1 Aesthetics

The next key ingredient is how the data should be mapped onto the visual aesthetics of the plot. This was done through the aes(x = Petal.Width, y = Petal.Length, col = Species) argument to the geom_point() function (see next section). What we do here is x-coordinate to be Petal.Width, the y-coordinates to be Petal.Length, and the colour of the characters to be related to Species. In general, aesthetics include:

  • position
  • colour / color / col (border or line color)
  • fill (inside color)
  • shape
  • linetype
  • size.

Note that not all aesthetics arguments have to be provided. For example, leaving out the col = argument generates the following graph

ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length)) +
    geom_point()

2.3.2 Geoms

The next ingredient is the geom, which defines the type of plot we want. In the above example we used geom_point(), which produces a scatterplot. Here are the most common geoms:

  • geom_point()
  • geom_jitter()
  • geom_line()
  • geom_bar()
  • geom_boxplot()
  • geom_violin()
  • geom_histogram()
  • geom_density()

Let’s try one of them: geom_boxplot. This creates a box-and-whisker plot, that we have already come across. Say for example we are interested in the distribution of sepal lengths but want to stratify this by species.

ggplot(irisDF, aes(x = Species, y = Sepal.Length, fill = Species)) +
    geom_boxplot()

Nice! Notice that we did not have to specify the actual x-positioning for the three species; instead, ggplot2 recognised that Species was not a numerical variable and treated it internally as a categorical variable (even though we did not specify it as such).

As you will have noticed, ggplot2 builds up plots by adding together components. Here we first set up “global” options for the plot using the ggplot() function, where we also specified the plot aesthetics, e.g. through aes(x = Petal.Width, y = Petal.Length, col = Species). We then added (+) to this the type of plot we wanted, e.g. geom_point(). One of the advantages of ggplot2 is that it allows us to layer geoms and thus let us built complex graphs in different ways. To demonstrate this, we are going to add the individual data points to the boxplot generated above using the geom_jitter() geom.

ggplot(irisDF, aes(x = Species, y = Sepal.Length, fill = Species)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, height = 0, size = 0.5)

Task
Three extra arguments were provided to geom_jitter(). By changing their respective values find out what they do?
These arguments control the horizontal spread of the data (width), the vertical 'noise' added to each data point and the size of each point 

\(~\)

Task
Change the aesthetics setting such that the colour, and not the fill, are determined by Species. What do you observe?

\(~\)

2.3.3 Labels

ggplot2 generally does a good job decorating the plot with the right axes labels etc. However, this is still dependent on the variable (i.e. column) names, and sometimes we want to add a bit more information or change or even remove labels manually. This can easily done by adding labels in a similar way as adding geoms and other features.

ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, height = 0, size = 0.5) +
    labs(x = '',  # remove x label
         y = 'Sepal length',  # set y label
         col = 'Iris species', # set legend title)
         title = 'Distribution of sepal length by species') # set plot title

Note that providing a legend in this case is not strictly necessary. Unfortunately, ggplot adds a legend by default. If you want to remove the legend, you have various options. The two most intuitive ones are

ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, height = 0, size = 0.5) +
    labs(x = 'Iris species',
         y = 'Sepal length',
         title = 'Distribution of sepal length by species') +
    guides(col = "none")

or

ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, height = 0, size = 0.5) +
    labs(x = 'Iris species',
         y = 'Sepal length',
         title = 'Distribution of sepal length by species') +
    theme(legend.position = "none")

As a quick aside, notice how we set the arguments in geom_jitter() not as an aesthetics but as a generic option that was applied to all the data. We can use this mixing of aesthetics and generic options to further customise our plot. For example, if we do not wish the individual data points to be coloured by species but simply plotted in black, we can set this as a generic argument

ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, height = 0, size = 0.5, col = 'black') +
    labs(x = 'Iris species',
         y = 'Sepal length',
         title = 'Distribution of sepal length by species') +
    theme(legend.position = "none")

Task
Produce a density plot for Sepal.Length using geom_density(), which only requires an x aesthetics argument, and fill these in by species. Now add an alpha channel (by adding an argument alpha = ... to the geom_density() function), which controls the opacity of the fill colour, ranging from 0 (complete transparent) to 1 (complete opague).

\(~\)

2.3.4 Faceting

A very useful feature of ggplot is the ability to use faceting to display sub-plots according to some grouping variable. For example, let’s assume that we want to produce separate scatter plots of petal length vs. width for each of the three flower species. We can do this using faceting:

ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length)) +
    geom_point() +
    facet_wrap(~ Species)

Note: there is also the facet_grid() option, which allows you to stratify your data by more than one (categorical) variable.

As before, we can use the same aesthetics settings as before, which are then applied to each of the facets.

ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length, col = Species)) +
    geom_point() +
    facet_wrap(~ Species) +
    labs(x = 'Petal width',
         y = 'Petal length',
         col = 'Iris species')

Task
By default, ggplot will use the same x and y limits for each of the facets. This can be overwritten by explicitly setting the scales = argument, which must set to one of “fixed” (default), “free_x”, “free_y”, or “free”. Explore what happens when you change this argument.

\(~\)

Task
Use the geom_histogram() function (geom) to produce a histogram plot of Sepal.Width and stratified (faceted) by Species. Explore what happens when you set extra arguments bin = ... (for the number of bins) or binwidth = ... (the width of each bin).

\(~\)

2.3.5 Statistical transformation

A very useful feature of ggplot2 is that it allows us to easily add statistical transformations on top of our data. Here we briefly introduce two (related) methods for adding trend lines to a scatterplot via geom_smooth()

ggplot(irisDF, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point() +
    stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

By default, geom_smooth() adds a LOESS (locally estimated scatterplot smoothing) smoothed line together with a shaded area that highlights the confidence bounds (which can be turned off by setting se = FALSE.

If we are more interested in a linear trend line, we have to explicitly set the method to lm

ggplot(irisDF, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point() +
    stat_smooth(method = 'lm') 
## `geom_smooth()` using formula = 'y ~ x'

Note: these trend lines can be added to our plot even if we do not need the scatterplot itself. That is, if, for some reason, you decide you do not want to show the actual data, you can run the above without geom_point() to only show the (linear) trend line.

Task
Produce a similar scatterplot as above but not stratified by Species (either faceted or simply coloured in a single graph). What do you observe?

2.3.6 Scales

Scales control the details of how data values are translated to visual properties and include features such as modifying axes limits and axis transformation. Here we only provide brief example but follow this link for a more in-depth tutorial.

To manually set axes limits we can either directly specify which axis we want to modify (e.g. through + xlim(min, max)) or go the more generic way such as

ggplot(irisDF, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point() +
    stat_smooth(method = 'lm') +
    lims(x = c(3.8, 8.3),
         y = c(1.5, 5))
## `geom_smooth()` using formula = 'y ~ x'

Much more control and customisation options over limits, axes labels, breaks (i.e. tick marks) etc are provided by the scale_xx() functions

scale_x_continuous(name, breaks, labels, limits, trans)
scale_y_continuous(name, breaks, labels, limits, trans)

For more information about these please have a look at the above mentioned tutorial or check R Help through the command ?scale_x_continuous.

ggplot2 offers various inbuilt functions to perform axis transformations, including:

  • scale_x_log10(), scale_y_log10() : for log10 transformation
  • scale_x_sqrt(), scale_y_sqrt() : for sqrt transformation
  • scale_x_reverse(), scale_y_reverse() : to reverse coordinates

Here is an easy example showing how these can be mixed in a single plot

df <- data.frame(x = c(10, 100, 1000, 10000, 100000),
                 y = c(1, 4, 9, 16, 25))

ggplot(df, aes(x = x, y = y)) +
    geom_point() +
    scale_x_log10() +
    scale_y_sqrt() +
    stat_smooth(method = 'lm') 
## `geom_smooth()` using formula = 'y ~ x'

2.3.7 Assigning plots to objects

So far we have used a separate code block for each plot. One thing to remember is that R is an object oriented language. This means that plots generated with ggplot, for example, can also be assigned as an object. This allows us to add extra layers / features to our plot without rewriting the entire code; here is an example.

p1 <- ggplot(df, aes(x = x, y = y)) +
    geom_point(size = 1.4, col = 'red')
p1

Next we add a line connecting the data points

p2 <- p1 + geom_line(linewidth = 0.3, col = 'blue')
p2

and finally let’s do some axis transformation

p2 + scale_x_log10() + scale_y_sqrt()

2.3.8 Putting it all together

As a final step we are going to show you how combining different aesthetics and features in ggplot2 allows you to make very complex - but both informative and beautiful - graphs. For this we will use a different dataset (LifeExpectancy.csv), start with a simple plot of life expectancy against GDP and then add layers and modifications as we go along.

# load data
LifeExpectancy <- read.table('LifeExpectancyData.csv', sep = ',', header = T)

# have a quick glimpse at the data
head(LifeExpectancy)
##       Country Continent Year     Status Life.expectancy Adult.Mortality
## 1 Afghanistan      Asia 2015 Developing            65.0             263
## 2 Afghanistan      Asia 2014 Developing            59.9             271
## 3 Afghanistan      Asia 2013 Developing            59.9             268
## 4 Afghanistan      Asia 2012 Developing            59.5             272
## 5 Afghanistan      Asia 2011 Developing            59.2             275
## 6 Afghanistan      Asia 2010 Developing            58.8             279
##   infant.deaths Alcohol percentage.expenditure Hepatitis.B Measles  BMI
## 1            62    0.01              71.279624          65    1154 19.1
## 2            64    0.01              73.523582          62     492 18.6
## 3            66    0.01              73.219243          64     430 18.1
## 4            69    0.01              78.184215          67    2787 17.6
## 5            71    0.01               7.097109          68    3013 17.2
## 6            74    0.01              79.679367          66    1989 16.7
##   under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS       GDP
## 1                83     6              8.16         65      0.1 584.25921
## 2                86    58              8.18         62      0.1 612.69651
## 3                89    62              8.13         64      0.1 631.74498
## 4                93    67              8.52         67      0.1 669.95900
## 5                97    68              7.87         68      0.1  63.53723
## 6               102    66              9.20         66      0.1 553.32894
##   Population thinness..1.19.years thinness.5.9.years
## 1   33736494                 17.2               17.3
## 2     327582                 17.5               17.5
## 3   31731688                 17.7               17.7
## 4    3696958                 17.9               18.0
## 5    2978599                 18.2               18.2
## 6    2883167                 18.4               18.4
##   Income.composition.of.resources Schooling
## 1                           0.479      10.1
## 2                           0.476      10.0
## 3                           0.470       9.9
## 4                           0.463       9.8
## 5                           0.454       9.5
## 6                           0.448       9.2
# subset to only contain data from the year 2000
DF <- LifeExpectancy[LifeExpectancy$Year == 2000,]

# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy)) +
    geom_point() +
    ylab('Life expectancy (years)')

> Note: you might get a warning about rows being removed due to missing value. We will get on to this in the next section. For now you can just ignore this.

We can see a clear relationship between life expectancy and GDP. Next we want to see if there is a difference by status of a country (developed or developing)

# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status)) +
    geom_point() +
    ylab('Life expectancy (years)')

And again we see a clear patter whereby developed countries appear to have on average a higher life expectancy. Next we can add an additional aesthetics, or stratification, changing the size of each data point according to the respective country’s population size.

# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status, size = Population)) +
    geom_point() +
    ylab('Life expectancy (years)')

Although there seems to be a relationship it is hard to see as most countries are squashed into the left-hand side of the graph. To the trained eye it also looks like the GDP data is not normally distributed (which in fact explains the clustering to the left) and we will therefore log-transform the x-axis

# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status, size = Population)) +
    geom_point() +
    scale_x_log10() +
    ylab('Life expectancy (years)')

Finally, we might be interested to see how life expectancy and its relationship with GDP and the size of the population has evolved over time and across the five continents. Let’s say we want to compare the years 2000, 2005 and 2010.

# subset data to now contain information for 2000, 2005 and 2010
DF <- LifeExpectancy[LifeExpectancy$Year %in% c(2000, 2005, 2010),]

ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status, size = Population)) +
    geom_point() +
    facet_grid(Continent ~ Year) +
    scale_x_log10() +
    theme(legend.position = 'top')

NICE!

Task
Explore the LifeExpectancy dataset using the different features that ggplot has to offer. Can you find any interesting relationships in the data? If so, take note and revist them when we are entering the statistics part of this workshop.

2.4 Data wrangling

Data wrangling, or the process of cleaning and manipulating data for subsequent analysis, is a key component of modern statistical science and is becoming more important in the age of big (e.g. high throughput) data. The tidyverse incorporates a suite of packages, such as tidyr and dplyr that are designed to make common data wrangling tasks not only easier to achieve, but also easier to decipher. Readability of the code is at the core of the philosophy underpinning these packages.

As always, there are plenty of resources to find useful information about the various data handling functions in the different tidyverse packages, including the Data Import Cheat Sheet and Data Transformation Cheat Sheet.

If you haven’t done so already, we first need to load the tidyverse library as well as the life expectancy dataset we introduced earlier.

# load tidyverse
library(tidyverse)

# load dataset andassign to object LifeExp
LifeExp <- read.table('LifeExpectancyData.csv', sep = ',', header = T)

2.4.1 Tidy data

The architect behind the tidyverse, Hadley Wickham, distinguishes between two types of data set: tidy and messy. Rather than being pejorative towards different ways in which people store and visualise data, the point here is to make a distinction between a specific way of arranging data that is useful to most R analyses, and anything else. In fact Hadley has a neat analogy to a famous Tolstoy quote:

“Tidy datasets are all alike but every messy dataset is messy in its own way.” — Hadley Wickham

In this context, a tidy data set is one in which:

  • rows contain different observations
  • columns contain different variables
  • cells contain values

Let’s look at a simple example using the life expectancy data.

head(LifeExp)
##       Country Continent Year     Status Life.expectancy Adult.Mortality
## 1 Afghanistan      Asia 2015 Developing            65.0             263
## 2 Afghanistan      Asia 2014 Developing            59.9             271
## 3 Afghanistan      Asia 2013 Developing            59.9             268
## 4 Afghanistan      Asia 2012 Developing            59.5             272
## 5 Afghanistan      Asia 2011 Developing            59.2             275
## 6 Afghanistan      Asia 2010 Developing            58.8             279
##   infant.deaths Alcohol percentage.expenditure Hepatitis.B Measles  BMI
## 1            62    0.01              71.279624          65    1154 19.1
## 2            64    0.01              73.523582          62     492 18.6
## 3            66    0.01              73.219243          64     430 18.1
## 4            69    0.01              78.184215          67    2787 17.6
## 5            71    0.01               7.097109          68    3013 17.2
## 6            74    0.01              79.679367          66    1989 16.7
##   under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS       GDP
## 1                83     6              8.16         65      0.1 584.25921
## 2                86    58              8.18         62      0.1 612.69651
## 3                89    62              8.13         64      0.1 631.74498
## 4                93    67              8.52         67      0.1 669.95900
## 5                97    68              7.87         68      0.1  63.53723
## 6               102    66              9.20         66      0.1 553.32894
##   Population thinness..1.19.years thinness.5.9.years
## 1   33736494                 17.2               17.3
## 2     327582                 17.5               17.5
## 3   31731688                 17.7               17.7
## 4    3696958                 17.9               18.0
## 5    2978599                 18.2               18.2
## 6    2883167                 18.4               18.4
##   Income.composition.of.resources Schooling
## 1                           0.479      10.1
## 2                           0.476      10.0
## 3                           0.470       9.9
## 4                           0.463       9.8
## 5                           0.454       9.5
## 6                           0.448       9.2
Question
Is this data tidy?

\(~\)

In this workshop we will see various ways in which datasets can be manipulated to and from the tidy format.

2.4.2 Filter rows

The first data manipulation we could do is to filter the data, as we have done earlier when plotting the data only for the year 2000. In base R this worked as

LifeExp[LifeExp$Year == 2000,]

To do the equivalent in tidyverse, we use the filter() function

filter(LifeExp, Year == 2000)

Note, one of the most distinctive / useful feature of tidyverse (or rather migrittr, which tidyverse/dplyr depends on) is the so-called piping operator, denoted as %>%. This operator allows you to chain one function after another without relying on nested parentheses or having to assign / store intermediate objects. So what it is doing is evaluating the expression on the right-hand side (or next line) of the %>% (pipe) using the expression on the left (or same line) as the first argument. If this sounds confusing then fear not - all will be made clearer once you see it in action! Using the same example as above we can get the same results with piping

LifeExp %>% filter(Year == 2000)

Initially this might look more complicated but you will soon see how this is a very useful operator indeed.

2.4.3 Sort rows

Another common operation is to sort rows according to some criterion. For example, we might wish to sort our data by Status and then Life.expectancy. In tidyverse this is done using the arrange() function.

arrange(LifeExp, Status, Life.expectancy)

or using the piping operator

LifeExp %>% arrange(Status, Life.expectancy)

2.4.4 Select columns

The next common operation is the selection of columns, which also includes deselection of particular columns. Let’s say we are interested only in the countries and their corresponding life expectancy, GDP, population size and the year of sampling. Before showing you how this can efficiently done in tidyverse, here are some options of how to achieve this in base R

## option 1
LifeExp[, match(c("Country", "Life.expectancy", "GDP", "Population"), colnames(LifeExp))]

## option 2
cbind(Country = LifeExp$Country, Life.expectancy = LifeExp$Life.expectancy, 
      GDP = LifeExp$GDP, Population = LifeExp$Population)

## option 3 (requires knowing which columns are which)
LifeExp[, c(1, 4, 16, 17)]

And here is how easy and hopefully by now intuitively this can be done using the select() function

LifeExp %>% select(Country, Continent, Life.expectancy, GDP, Population) %>% head()
##       Country Continent Life.expectancy       GDP Population
## 1 Afghanistan      Asia            65.0 584.25921   33736494
## 2 Afghanistan      Asia            59.9 612.69651     327582
## 3 Afghanistan      Asia            59.9 631.74498   31731688
## 4 Afghanistan      Asia            59.5 669.95900    3696958
## 5 Afghanistan      Asia            59.2  63.53723    2978599
## 6 Afghanistan      Asia            58.8 553.32894    2883167

Note: notice that we added one extra pipe at the end %>% head(.). Remember that the head() function displays the first few rows of a data.frame. What we are doing here is that we first take the data.frame LifeExp, the apply the filter() function and then use the results of this as the argument in head()

De-selection, or filtering out particular columns follows a similar structure but in this case the column to be removed is preceded by a ‘-’; for example

LifeExp %>% select(-Adult.Mortality) %>% head()

As you will have noticed, we can easy combine various data manipulation functions using pipes without having to store any intermediate results or use excessive parenthesis. Let’s try this out

LifeExp %>% 
    filter(Year == 2000) %>%
    select(Country, Continent, Life.expectancy, GDP, Population) %>%
    arrange(Life.expectancy) %>%
    head(n = 10)
##                     Country Continent Life.expectancy       GDP Population
## 1              Sierra Leone    Africa            39.0 139.31477    4564297
## 2                    Malawi    Africa            43.1 153.25949   11376172
## 3                    Zambia    Africa            43.8 341.95562    1531221
## 4                    Angola    Africa            45.3 555.29694    1644924
## 5                   Eritrea    Africa            45.3  28.19695     339281
## 6  Central African Republic    Africa            46.0 243.54293    3754986
## 7                  Zimbabwe    Africa            46.0 547.35888   12222251
## 8                    Uganda    Africa            46.6 257.63369    2439274
## 9                   Nigeria    Africa            47.1 379.11933    1223529
## 10                     Chad    Africa            47.6 166.23179    8342559

Powerful stuff!

2.4.5 Grouping and summarising

A common thing we might want to do is to produce summaries of some variable for different subsets of the data. For example, we might want to produce an estimate of the mean life expectancy for for each year. The dplyr package provides a function group_by(), which allows us to group data by one or more variables, and the summarise(), which allows us to summarise data. These can be used in combination get stratified summaries as shown here

LifeExp %>% 
    group_by(Year) %>%
    summarise(mn = mean(Life.expectancy))
## # A tibble: 16 × 2
##     Year    mn
##    <int> <dbl>
##  1  2000  65.9
##  2  2001  66.2
##  3  2002  66.5
##  4  2003  66.5
##  5  2004  66.9
##  6  2005  67.5
##  7  2006  68.0
##  8  2007  68.5
##  9  2008  68.9
## 10  2009  69.3
## 11  2010  69.4
## 12  2011  70.1
## 13  2012  70.5
## 14  2013  70.9
## 15  2014  71.1
## 16  2015  71.2

The most commonly functions used for summarising data are

  • Center: mean(), median()
  • Spread: sd(), IQR()
  • Range: min(), max(),
  • Count: n()
Task

Test your skills by

  1. calculating the mean life expectancy by continent
  2. calculating the mean life expectancy by continent and year
  3. calculating the mean and median of the GDP for each continent in the years 2000, 2005 and 2010
(note, for both the mean() and median() we have to add an extra argument na.rm = TRUE - what happens if we don’t?)

\(~\)

2.4.6 Reshaping datasets

Recall, a key philosophy of tidyverse is the notion of data being tidy. In fact, to get the most out of ggplot, datasets need to be in a specific format. More often than not, though, you will be dealing with untidy data. To give you an example, let’s have a look at the data file ExposureIndex.csv, which contains the recorded malaria exposure index for 103 individuals in Junju, Kenya, between the years 2000 and 2010.

EI <- read.csv('ExposureIndex.csv', check.names = F)  # the argument check.names=F prevents renaming of col names
head(EI)
##   Participant.ID      2000      2001         2002      2003         2004
## 1          N0004 0.7500000 0.5666667 4.705882e-01 0.4736842 1.578947e-01
## 2          N0007 0.9090909 0.6923077 3.333333e-01 0.1666667 9.770000e-15
## 3          N0016 0.7500000 0.8750000 3.750000e-01 0.6250000 1.000000e+00
## 4          N0024 0.7500000 0.5333333 5.294117e-01 0.4210526 1.052632e-01
## 5          N0041 0.5000000 0.3888889 8.690000e-13 0.3333333 6.400000e-13
## 6          N0055 0.7500000 0.5333333 5.294117e-01 0.4210526 1.578947e-01
##       2005     2006     2007 2008     2009     2010
## 1 0.00e+00 0.00e+00 8.70e-14    0 5.00e-02 1.00e-01
## 2 1.27e-14 9.77e-15 0.00e+00    0 0.00e+00 2.44e-14
## 3 5.00e-01 0.00e+00 2.00e-01    0 2.16e-14 3.60e-15
## 4 0.00e+00 0.00e+00 8.70e-14    0 5.00e-02 1.00e-01
## 5 1.56e-13 0.00e+00 4.25e-13    0 3.45e-13 5.74e-14
## 6 0.00e+00 0.00e+00 8.70e-14    0 5.00e-02 1.00e-01
Question
Is this data in tidy format? And if not, why not?

\(~\)

Luckily, within tidyverse is it easy to reshape data to get it into the desired format. In this case we want to reshape it from a wide format into a long format. The necessary function to do this is called gather(), which takes two or more columns and gathers them into key-value pairs (here: year - exposure index). The columns to gather are 2000, 2001, …, 2010, which we can shorten to 2000:2010 (i.e. all columns from 2000 to 2010)

EI_long <- EI %>%
    gather(Year, ExposureIndex, '2000':'2010') 

head(EI_long, n = 10)
##    Participant.ID Year ExposureIndex
## 1           N0004 2000     0.7500000
## 2           N0007 2000     0.9090909
## 3           N0016 2000     0.7500000
## 4           N0024 2000     0.7500000
## 5           N0041 2000     0.5000000
## 6           N0055 2000     0.7500000
## 7           N0059 2000     0.7500000
## 8           N0067 2000     0.7500000
## 9           N0089 2000     0.4375000
## 10          N0096 2000     0.4375000

Note: because the column names were numerical values and not strings, we have to encase them by ’ ’. This is not necessary if column names are in character format.

The converse of gather() is spread(), which takes two columns (key and value) and spreads them into multiple columns (dictated by the number of unique values).

EI_long %>%
    spread(Year, ExposureIndex) %>%
    head()
##   Participant.ID      2000      2001         2002      2003         2004
## 1          N0004 0.7500000 0.5666667 4.705882e-01 0.4736842 1.578947e-01
## 2          N0007 0.9090909 0.6923077 3.333333e-01 0.1666667 9.770000e-15
## 3          N0016 0.7500000 0.8750000 3.750000e-01 0.6250000 1.000000e+00
## 4          N0024 0.7500000 0.5333333 5.294117e-01 0.4210526 1.052632e-01
## 5          N0041 0.5000000 0.3888889 8.690000e-13 0.3333333 6.400000e-13
## 6          N0055 0.7500000 0.5333333 5.294117e-01 0.4210526 1.578947e-01
##       2005     2006     2007 2008     2009     2010
## 1 0.00e+00 0.00e+00 8.70e-14    0 5.00e-02 1.00e-01
## 2 1.27e-14 9.77e-15 0.00e+00    0 0.00e+00 2.44e-14
## 3 5.00e-01 0.00e+00 2.00e-01    0 2.16e-14 3.60e-15
## 4 0.00e+00 0.00e+00 8.70e-14    0 5.00e-02 1.00e-01
## 5 1.56e-13 0.00e+00 4.25e-13    0 3.45e-13 5.74e-14
## 6 0.00e+00 0.00e+00 8.70e-14    0 5.00e-02 1.00e-01

2.4.7 Mutate

Last but not least we show you how to create, modify or even delete columns with mutate(). In fact, mutate() is probably one of the functions you will end up using the most. However, the functionality of mutate() is introduced here by means of two examples, which also introduce a few other useful features.

In the first example we are going to create a new column in our data.frame EI_long, where we categorise the exposure index into high and low depending on its value being above or below 0.5, respectively. One new function we are going to introduce is ifelse(); please use the help for more details, although its use should become apparent.

EI_long <- EI_long %>%
    mutate(EI_cat = ifelse(ExposureIndex >= 0.5, 'high', 'low'))

head(EI_long, n=10)
##    Participant.ID Year ExposureIndex EI_cat
## 1           N0004 2000     0.7500000   high
## 2           N0007 2000     0.9090909   high
## 3           N0016 2000     0.7500000   high
## 4           N0024 2000     0.7500000   high
## 5           N0041 2000     0.5000000   high
## 6           N0055 2000     0.7500000   high
## 7           N0059 2000     0.7500000   high
## 8           N0067 2000     0.7500000   high
## 9           N0089 2000     0.4375000    low
## 10          N0096 2000     0.4375000    low

In the second example we going to use mutate() to modify three existing columns. First, we are going to round the ExposureIndex to two significant digits; then we are turning our new column EI_cat into a categorical variable using the function factor(); and finally we are going to change the Year column into a numeric format

EI_long <- EI_long %>%
    mutate(ExposureIndex = round(ExposureIndex, digits=2)) %>%
    mutate(EI_cat = factor(EI_cat)) %>%
    mutate(Year = as.numeric(Year))

# note, we could have done the same in a single mutate() call
# EI_long <- EI_long %>%
#     mutate(ExposureIndex = round(ExposureIndex, digits=2),
#            EI_cat = factor(EI_cat),
#            Year = as.numeric(Year))

head(EI_long, n=10)
##    Participant.ID Year ExposureIndex EI_cat
## 1           N0004 2000          0.75   high
## 2           N0007 2000          0.91   high
## 3           N0016 2000          0.75   high
## 4           N0024 2000          0.75   high
## 5           N0041 2000          0.50   high
## 6           N0055 2000          0.75   high
## 7           N0059 2000          0.75   high
## 8           N0067 2000          0.75   high
## 9           N0089 2000          0.44    low
## 10          N0096 2000          0.44    low
Task
  1. produce a ggplot line graph of the mean exposure index against time (2000 to 2010)
  2. show the distribution of the expsoure index over time by means of box-and-whisker plots

Show: Hint

\(~\)

2.4.8 Dealing with missing data

One recurring problem with data analysis is that data are often incomplete, leading to more or less serious downstream data analysis problems. That is, different functions deal with missing data in different ways, some simply ignore them whereas others fail completely. We therefore need to expand our data wrangling toolbox.

Missing data in R are usually referred to NA’s, and here is an example of where NA’s can become problematic

x = c(1, 4, 2, 9, NA, 4, 5)
mean(x)
## [1] NA

There are various ways to deal with missing data, of which we only touch upon a few

Detecting missing data

The first thing we might want to know if there are any NA’s, and a function for this is is.na(), which works on single variables, vectors, matrices and data.frames

df <- data.frame(x = c(1, 4, 2, 9, NA, 4, 5),
                 y = c(0.2, 0.5, NA, 0.1, NA, 0.8, 0.9))

is.na(df)
##          x     y
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE  TRUE
## [4,] FALSE FALSE
## [5,]  TRUE  TRUE
## [6,] FALSE FALSE
## [7,] FALSE FALSE

Removing NA’s

Going back to our previous example, if we want to calculate the mean of x we can either manually remove the offending entries or tell mean() to ignore them

# option 1: tell mean() to ignore NA
mean(x, na.rm = TRUE)
## [1] 4.166667
# option 2: remove NA manually
x <- x[!is.na(x)]  # this means that x should be assigned to all entries of x that are not (!) NA
mean(x)
## [1] 4.166667

What about our data.frame df? As df has both rows and columns we cannot simply use the same method but have to be more specific. For example, to remove a row where df$x contains missing data

df[!is.na(df$x),]
##   x   y
## 1 1 0.2
## 2 4 0.5
## 3 2  NA
## 4 9 0.1
## 6 4 0.8
## 7 5 0.9

or to remove all rows where either df$x or df$y contains missing data (note: for this we need a logical operator, which is in fact an and, &, because we neither want NA’s in x nor in y so both conditions have to be safisfied together)

df[!is.na(df$x) & !is.na(df$y),]
##   x   y
## 1 1 0.2
## 2 4 0.5
## 4 9 0.1
## 6 4 0.8
## 7 5 0.9

An easier way to achieve the same is by using the complete.cases() function, which selects only those rows where we have complete information (i.e. no missing data)

df[complete.cases(df),]
##   x   y
## 1 1 0.2
## 2 4 0.5
## 4 9 0.1
## 6 4 0.8
## 7 5 0.9

And finally here the same examples but done using tidy functionality

df %>%
    filter(!is.na(x) & !is.na(y))
##   x   y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9

which is equivalent to

df %>%
    drop_na()
##   x   y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9

which is equivalent to

df %>% 
  filter_at(vars(x:y), all_vars(!is.na(.)))
##   x   y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9

which is equivalent to

df %>% 
  filter_at(vars(x:y), all_vars(complete.cases(.)))
##   x   y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9

As they say in English: “there is more than one way to skin a cat”.