2 Data wrangling and visualisation
2.1 Introduction
In the second part of this workshop we are dealing with the handling and visualisation of (complex) data. It is expected that you should feel comfortable with basic R and, specifically, the use of data.frame objects. If not, then please ask one of the demonstrators.
At this point we are also saying goodbye to much of R’s basic (i.e. inbuilt) functionalities for plotting and manipulating data and introduce you to the tidyverse, which is a suite of packages for performing a (growing) number of data science tasks. Here, we will introduce key tidyverse
packages, including dplyr
and ggplot2
, and show how they can be used to efficiently process and visualise complex data.
Although each package can be installed and loaded separately, they are designed to work together. An easier approach is to simply install and load the tidyverse
directly.
To install tidyverse, use:
install.packages("tidyverse")
and once installed, it can be loaded using:
library(tidyverse)
\(~\)
2.2 Loading and saving data
An essential part of data handling is the ability to store and read (externally stored) data. There are a myriad of different ways this can be done, here we briefly introduce two of them:
- saving data as an R object with
saveRDS
to create anRds
files and loading it withload
- saving data as a comma separated version (CSV) file and loading it with
read_csv
orread.table
Note:
tidyverse
also provides functions to handle Microsoft Excel files. For example,read_excel()
lets you import xls or xlsx based data files. However, due to Excel often causing havoc with data (mostly due to it changing data types), we are urging caution when handling Excel files.
The main difference between the two approaches is that an Rds
file stores data in a (R specific) binary format, meaning that you need R to read it. CSV files, in comparison, are (human readable) text files that can be opened by a number of different programs. In fact, on most operating systems, these are opened by MS Excel or a text editor by default. Which approach should you use? This often depends on whether you are the only one working with the data or whether you like the added benefit of viewing (or manipulating) data outside the R environment.
2.2.1 Save / load data as an R object
As a first step we need to create some data that we wish to store for future use. And as before, we are dealing almost exclusively with data.frames. Recall the simply data.frame we generated in the first session
<- data.frame(ParticipantID = c('ID001', 'ID002', 'ID003', 'ID007', 'ID009'),
df Age = c(12, 8, 9, 7, 11),
Episode = c('y', 'n', 'n', 'y', 'y'))
To save this data as an R object, simply call
saveRDS(df, file = 'myData.Rds')
Note: the file will be saved in your current working directory. If you want to save it in a different folder you need to provide the full file path, e.g.
saveRDS(df, file = '.../StatsWithR//Data/myData.Rds')
.
Loading the data is equally simple:
# first we remove the df object to make sure everything works
rm(df)
# load the data
readRDS('myData.Rds')
## ParticipantID Age Episode
## 1 ID001 12 y
## 2 ID002 8 n
## 3 ID003 9 n
## 4 ID007 7 y
## 5 ID009 11 y
At first sight this looks like it has worked. However, you might also have noticed that there is no newly created object in your global environment. This means, although R successfully loaded the data, it was not assigned to an object and essentially lost (this is one of the key differences between save
/load
and saveRDS
/readRDS
). We therefore need to assign the read in data to an object
# read data and assign to an object / data.frame called myData
<- readRDS('myData.Rds')
myData
# see if it works
head(myData)
## ParticipantID Age Episode
## 1 ID001 12 y
## 2 ID002 8 n
## 3 ID003 9 n
## 4 ID007 7 y
## 5 ID009 11 y
2.2.2 Save / load as a CSV file
One of the advantages of saving data as an Rds file is that it is more space efficient. By default, saveRDS
compresses the data, which can significantly shrink the size of the stored data. On the downside, it can also take a bit longer to load the data subsequently. The more preferred data format, however, is a CSV file.
There are two equivalent function pairs for saving / loading CSV files: write.table()
/read.table()
and read_csv()
/write_csv()
(the latter is essentially a wrapper for the former). Here we show you how to use the first function pair, but please familiarise yourself with the other one as well.
First we save our data.frame myData
as a comma separated file
# save data in CSV format
write.table(myData, # data object
file = 'myData.csv', # file name
sep = ',', # character used to separate columns
row.names = F) # don't save row names as additional column
Check if this worked by importing the CSV file
# first remove the object form our environment
rm(myData)
# read in data
<- read.table('myData.csv', # file name to be read
df header = TRUE, # treat first row as header / column names
sep = ',')
# print out first three rows
head(df, n = 3)
## ParticipantID Age Episode
## 1 ID001 12 y
## 2 ID002 8 n
## 3 ID003 9 n
Note: here we used a comma (‘,’) to separate individual data columns. Other commonly used characters are semi-colon (‘;’) and tab (’). If the separation is not explicitly provided via
sep = ...
,read.table
will guess it - this is dangerous and often leads to errors. We therefore recommend to always state which character should be used - explicit is better than implicit!
2.2.3 Handling Excel files
Although we strongly discourage the use of Excel for data handling, it is still one of the most common methods for storing data. We therefore briefly show you one method of how to read Excel files using the readxl
package, which needs to be loaded explicitly, because it is not one of the core tidyverse packages that are loaded via library(tidyverse)
# load library
library(readxl)
# read in excel data file
<- read_excel('iris.xls')
irisDF
# print first few rows
head(irisDF, n = 5)
## # A tibble: 5 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
2.2.4 Cheatsheet
For more information of how to import and export data have a look at this cheat sheet, which also provides an overview of all essential functions and their usage as a PDF for download.
\(~\)
2.3 Visualisation using ggplot
We start with a motivating example to show, how using base R graphics can easily become cumbersome. For this we use the iris
dataset, which we have imported in the previous section. This dataset contains the measurements of the sepal length, sepal width, petal length and petal width for 50 samples from each of three species of iris flowers (setosa, versicolor and virginica).
Let’s say we are interested in the relationship between the length and the width of the petals. The first port of call is therefore to plot one against the other.
plot(x = irisDF$Petal.Width, y = irisDF$Petal.Length)
We can see that there is a clear relationship between those two variables. One other thing we might notice is that there are (at least) two clusters, and we might further have the suspicion that this could be due to the fact that we different species of iris. One way of checking this is to colour the points in by species. Here we do this in a slightly convoluted way but it serves the purpose of highlighting how quickly things can become complicated in base R.
The first thing we want to do is to subset the data. Say, for example, we are interested in the relationship between petal length and width in Iris setosa. In other words, we want to only plot the data for which Species=='setosa'
.
plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'],
y = irisDF$Petal.Length[irisDF$Species=='setosa'],
col = 'blue', pch = 20)
We can now add point for samples from Iris versicolor in a similar fashion.
plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'],
y = irisDF$Petal.Length[irisDF$Species=='setosa'],
col = 'blue', pch = 20)
points(x = irisDF$Petal.Width[irisDF$Species=='versicolor'],
y = irisDF$Petal.Length[irisDF$Species=='versicolor'],
col = 'red', pch = 20)
What has happened?? The problem is that the axes limits are automatically set through the first plot
call and not updated for any additional points added to the graph. We therefore need to manually set the limits to guarantee that all points are visible.
plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'],
y = irisDF$Petal.Length[irisDF$Species=='setosa'],
col = 'blue', pch = 20,
xlim = c(0, 2.5),
ylim = c(1,7),
xlab = 'Petal width',
ylab = 'Petal length')
points(x = irisDF$Petal.Width[irisDF$Species=='versicolor'],
y = irisDF$Petal.Length[irisDF$Species=='versicolor'],
col = 'red', pch = 20)
points(x = irisDF$Petal.Width[irisDF$Species=='virginica'],
y = irisDF$Petal.Length[irisDF$Species=='virginica'],
col = 'darkgreen', pch = 20)
The only thing missing now is a legend to note which colour belongs to what species.
plot(x = irisDF$Petal.Width[irisDF$Species=='setosa'],
y = irisDF$Petal.Length[irisDF$Species=='setosa'],
col = 'blue', pch = 20,
xlim = c(0, 2.5),
ylim = c(1,7),
xlab = 'Petal width',
ylab = 'Petal length')
points(x = irisDF$Petal.Width[irisDF$Species=='versicolor'],
y = irisDF$Petal.Length[irisDF$Species=='versicolor'],
col = 'red', pch = 20)
points(x = irisDF$Petal.Width[irisDF$Species=='virginica'],
y = irisDF$Petal.Length[irisDF$Species=='virginica'],
col = 'darkgreen', pch = 20)
legend(x = 0, y = 7,
legend = c('setosa', 'versicolor', 'virginica'),
pch = 20,
col = c('blue', 'red', 'darkgreen'))
Now we can see that there are indeed three clusters and also get a hint that the relationship between petal length and petal width is dependent on species. In the next part of this workshop we will show you how to test this statistically.
Although you can see that we can make highly customised graphs using base R graphics, the steps needed to create the above graph were far from straight forward. So let’s try to do the same using ggplot2
instead
ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length, col = Species)) +
geom_point()
So essentially with a lone-liner we can create a highly complex graph that required no manual subsetting of the data or manipulation of the plot. We did not have to specify the legend or its position, either; this is all being taken care of by the ggplot2
package.
Arguably, for most people ggplot2
is not very intuitive to start with and it can take some time to get used to. However, once you have understood the basics, it becomes a very logical and powerful way to create complex and highly customised graphs.
The ethos of ggplot2
is that plots can be broken down into different features, most notably:
- data
- aesthetic mapping
- geometric object
- faceting
- scales
- statistical transformations
- coordinate system
- position adjustments
The first three are the most most essential ingredients of a plot created with ggplot2
. The next one (faceting) is a commonly used method when dealing with categorical data and for creating stratified graphs. The scales feature is useful for translating data values to visual properties and include, for example, log-transforms. The remaining features are only mentioned here but play no further part in this introductory workshop.
In order to explain some of these features is to break the above code apart and add features one-by-one. Let’s start with setting up the plot
ggplot(irisDF)
Although this creates an empty plot but you will not get an error message - because you have done nothing wrong, you just forgot to add crucial ingredients!
Note: In comparison to base R,
ggplot2
requires a data.frame (or tibble) object for plotting. As shown in the previous section, it is easy to manipulate most other data into an R data frame. We will revisit this at a later stage.
2.3.1 Aesthetics
The next key ingredient is how the data should be mapped onto the visual aesthetics of the plot. This was done through the aes(x = Petal.Width, y = Petal.Length, col = Species)
argument to the geom_point()
function (see next section). What we do here is x-coordinate to be Petal.Width
, the y-coordinates to be Petal.Length
, and the colour of the characters to be related to Species
. In general, aesthetics include:
- position
- colour / color / col (border or line color)
- fill (inside color)
- shape
- linetype
- size.
Note that not all aesthetics arguments have to be provided. For example, leaving out the col =
argument generates the following graph
ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length)) +
geom_point()
2.3.2 Geoms
The next ingredient is the geom, which defines the type of plot we want. In the above example we used geom_point()
, which produces a scatterplot. Here are the most common geoms:
geom_point()
geom_jitter()
geom_line()
geom_bar()
geom_boxplot()
geom_violin()
geom_histogram()
geom_density()
Let’s try one of them: geom_boxplot
. This creates a box-and-whisker plot, that we have already come across. Say for example we are interested in the distribution of sepal lengths but want to stratify this by species.
ggplot(irisDF, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
Nice! Notice that we did not have to specify the actual x-positioning for the three species; instead, ggplot2
recognised that Species
was not a numerical variable and treated it internally as a categorical variable (even though we did not specify it as such).
As you will have noticed, ggplot2
builds up plots by adding together components. Here we first set up “global” options for the plot using the ggplot()
function, where we also specified the plot aesthetics, e.g. through aes(x = Petal.Width, y = Petal.Length, col = Species)
. We then added (+) to this the type of plot we wanted, e.g. geom_point()
.
One of the advantages of ggplot2
is that it allows us to layer geoms and thus let us built complex graphs in different ways. To demonstrate this, we are going to add the individual data points to the boxplot generated above using the geom_jitter()
geom.
ggplot(irisDF, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
geom_jitter(width = 0.1, height = 0, size = 0.5)
geom_jitter()
. By changing their respective values find out what they do?
These arguments control the horizontal spread of the data (width), the vertical 'noise' added to each data point and the size of each point
\(~\)
Species
. What do you observe?
\(~\)
2.3.3 Labels
ggplot2
generally does a good job decorating the plot with the right axes labels etc. However, this is still dependent on the variable (i.e. column) names, and sometimes we want to add a bit more information or change or even remove labels manually. This can easily done by adding labels in a similar way as adding geoms and other features.
ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
geom_boxplot() +
geom_jitter(width = 0.1, height = 0, size = 0.5) +
labs(x = '', # remove x label
y = 'Sepal length', # set y label
col = 'Iris species', # set legend title)
title = 'Distribution of sepal length by species') # set plot title
Note that providing a legend in this case is not strictly necessary. Unfortunately, ggplot
adds a legend by default. If you want to remove the legend, you have various options. The two most intuitive ones are
ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
geom_boxplot() +
geom_jitter(width = 0.1, height = 0, size = 0.5) +
labs(x = 'Iris species',
y = 'Sepal length',
title = 'Distribution of sepal length by species') +
guides(col = "none")
or
ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
geom_boxplot() +
geom_jitter(width = 0.1, height = 0, size = 0.5) +
labs(x = 'Iris species',
y = 'Sepal length',
title = 'Distribution of sepal length by species') +
theme(legend.position = "none")
As a quick aside, notice how we set the arguments in geom_jitter()
not as an aesthetics but as a generic option that was applied to all the data. We can use this mixing of aesthetics and generic options to further customise our plot. For example, if we do not wish the individual data points to be coloured by species but simply plotted in black, we can set this as a generic argument
ggplot(irisDF, aes(x = Species, y = Sepal.Length, col = Species)) +
geom_boxplot() +
geom_jitter(width = 0.1, height = 0, size = 0.5, col = 'black') +
labs(x = 'Iris species',
y = 'Sepal length',
title = 'Distribution of sepal length by species') +
theme(legend.position = "none")
Sepal.Length
using geom_density()
, which only requires an x
aesthetics argument, and fill these in by species. Now add an alpha channel (by adding an argument alpha = ...
to the geom_density()
function), which controls the opacity of the fill colour, ranging from 0 (complete transparent) to 1 (complete opague).
\(~\)
2.3.4 Faceting
A very useful feature of ggplot
is the ability to use faceting to display sub-plots according to some grouping variable. For example, let’s assume that we want to produce separate scatter plots of petal length vs. width for each of the three flower species. We can do this using faceting:
ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length)) +
geom_point() +
facet_wrap(~ Species)
Note: there is also the
facet_grid()
option, which allows you to stratify your data by more than one (categorical) variable.
As before, we can use the same aesthetics settings as before, which are then applied to each of the facets.
ggplot(irisDF, aes(x = Petal.Width, y = Petal.Length, col = Species)) +
geom_point() +
facet_wrap(~ Species) +
labs(x = 'Petal width',
y = 'Petal length',
col = 'Iris species')
ggplot
will use the same x and y limits for each of the facets. This can be overwritten by explicitly setting the scales =
argument, which must set to one of “fixed” (default), “free_x”, “free_y”, or “free”. Explore what happens when you change this argument.
\(~\)
geom_histogram()
function (geom) to produce a histogram plot of Sepal.Width
and stratified (faceted) by Species
. Explore what happens when you set extra arguments bin = ...
(for the number of bins) or binwidth = ...
(the width of each bin).
\(~\)
2.3.5 Statistical transformation
A very useful feature of ggplot2
is that it allows us to easily add statistical transformations on top of our data. Here we briefly introduce two (related) methods for adding trend lines to a scatterplot via geom_smooth()
ggplot(irisDF, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
By default, geom_smooth()
adds a LOESS (locally estimated scatterplot smoothing) smoothed line together with a shaded area that highlights the confidence bounds (which can be turned off by setting se = FALSE
.
If we are more interested in a linear trend line, we have to explicitly set the method
to lm
ggplot(irisDF, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
stat_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
Note: these trend lines can be added to our plot even if we do not need the scatterplot itself. That is, if, for some reason, you decide you do not want to show the actual data, you can run the above without
geom_point()
to only show the (linear) trend line.
Species
(either faceted or simply coloured in a single graph). What do you observe?
2.3.6 Scales
Scales control the details of how data values are translated to visual properties and include features such as modifying axes limits and axis transformation. Here we only provide brief example but follow this link for a more in-depth tutorial.
To manually set axes limits we can either directly specify which axis we want to modify (e.g. through + xlim(min, max)
) or go the more generic way such as
ggplot(irisDF, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
stat_smooth(method = 'lm') +
lims(x = c(3.8, 8.3),
y = c(1.5, 5))
## `geom_smooth()` using formula = 'y ~ x'
Much more control and customisation options over limits, axes labels, breaks (i.e. tick marks) etc are provided by the scale_xx()
functions
scale_x_continuous(name, breaks, labels, limits, trans)
scale_y_continuous(name, breaks, labels, limits, trans)
For more information about these please have a look at the above mentioned tutorial or check R Help through the command ?scale_x_continuous
.
ggplot2
offers various inbuilt functions to perform axis transformations, including:
scale_x_log10()
,scale_y_log10()
: for log10 transformationscale_x_sqrt()
,scale_y_sqrt()
: for sqrt transformationscale_x_reverse()
,scale_y_reverse()
: to reverse coordinates
Here is an easy example showing how these can be mixed in a single plot
<- data.frame(x = c(10, 100, 1000, 10000, 100000),
df y = c(1, 4, 9, 16, 25))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
scale_x_log10() +
scale_y_sqrt() +
stat_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
2.3.7 Assigning plots to objects
So far we have used a separate code block for each plot. One thing to remember is that R is an object oriented language. This means that plots generated with ggplot
, for example, can also be assigned as an object. This allows us to add extra layers / features to our plot without rewriting the entire code; here is an example.
<- ggplot(df, aes(x = x, y = y)) +
p1 geom_point(size = 1.4, col = 'red')
p1
Next we add a line connecting the data points
<- p1 + geom_line(linewidth = 0.3, col = 'blue')
p2 p2
and finally let’s do some axis transformation
+ scale_x_log10() + scale_y_sqrt() p2
2.3.8 Putting it all together
As a final step we are going to show you how combining different aesthetics and features in ggplot2
allows you to make very complex - but both informative and beautiful - graphs. For this we will use a different dataset (LifeExpectancy.csv
), start with a simple plot of life expectancy against GDP and then add layers and modifications as we go along.
# load data
<- read.table('LifeExpectancyData.csv', sep = ',', header = T)
LifeExpectancy
# have a quick glimpse at the data
head(LifeExpectancy)
## Country Continent Year Status Life.expectancy Adult.Mortality
## 1 Afghanistan Asia 2015 Developing 65.0 263
## 2 Afghanistan Asia 2014 Developing 59.9 271
## 3 Afghanistan Asia 2013 Developing 59.9 268
## 4 Afghanistan Asia 2012 Developing 59.5 272
## 5 Afghanistan Asia 2011 Developing 59.2 275
## 6 Afghanistan Asia 2010 Developing 58.8 279
## infant.deaths Alcohol percentage.expenditure Hepatitis.B Measles BMI
## 1 62 0.01 71.279624 65 1154 19.1
## 2 64 0.01 73.523582 62 492 18.6
## 3 66 0.01 73.219243 64 430 18.1
## 4 69 0.01 78.184215 67 2787 17.6
## 5 71 0.01 7.097109 68 3013 17.2
## 6 74 0.01 79.679367 66 1989 16.7
## under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS GDP
## 1 83 6 8.16 65 0.1 584.25921
## 2 86 58 8.18 62 0.1 612.69651
## 3 89 62 8.13 64 0.1 631.74498
## 4 93 67 8.52 67 0.1 669.95900
## 5 97 68 7.87 68 0.1 63.53723
## 6 102 66 9.20 66 0.1 553.32894
## Population thinness..1.19.years thinness.5.9.years
## 1 33736494 17.2 17.3
## 2 327582 17.5 17.5
## 3 31731688 17.7 17.7
## 4 3696958 17.9 18.0
## 5 2978599 18.2 18.2
## 6 2883167 18.4 18.4
## Income.composition.of.resources Schooling
## 1 0.479 10.1
## 2 0.476 10.0
## 3 0.470 9.9
## 4 0.463 9.8
## 5 0.454 9.5
## 6 0.448 9.2
# subset to only contain data from the year 2000
<- LifeExpectancy[LifeExpectancy$Year == 2000,]
DF
# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy)) +
geom_point() +
ylab('Life expectancy (years)')
> Note: you might get a warning about rows being removed due to missing value. We will get on to this in the next section. For now you can just ignore this.
We can see a clear relationship between life expectancy and GDP. Next we want to see if there is a difference by status of a country (developed or developing)
# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status)) +
geom_point() +
ylab('Life expectancy (years)')
And again we see a clear patter whereby developed countries appear to have on average a higher life expectancy. Next we can add an additional aesthetics, or stratification, changing the size of each data point according to the respective country’s population size.
# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status, size = Population)) +
geom_point() +
ylab('Life expectancy (years)')
Although there seems to be a relationship it is hard to see as most countries are squashed into the left-hand side of the graph. To the trained eye it also looks like the GDP data is not normally distributed (which in fact explains the clustering to the left) and we will therefore log-transform the x-axis
# scatterplot of life expectancy against GDP
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status, size = Population)) +
geom_point() +
scale_x_log10() +
ylab('Life expectancy (years)')
Finally, we might be interested to see how life expectancy and its relationship with GDP and the size of the population has evolved over time and across the five continents. Let’s say we want to compare the years 2000, 2005 and 2010.
# subset data to now contain information for 2000, 2005 and 2010
<- LifeExpectancy[LifeExpectancy$Year %in% c(2000, 2005, 2010),]
DF
ggplot(DF, aes(x = GDP, y = Life.expectancy, col = Status, size = Population)) +
geom_point() +
facet_grid(Continent ~ Year) +
scale_x_log10() +
theme(legend.position = 'top')
NICE!
LifeExpectancy
dataset using the different features that ggplot
has to offer. Can you find any interesting relationships in the data? If so, take note and revist them when we are entering the statistics part of this workshop.
2.4 Data wrangling
Data wrangling, or the process of cleaning and manipulating data for subsequent analysis, is a key component of modern statistical science and is becoming more important in the age of big (e.g. high throughput) data. The tidyverse
incorporates a suite of packages, such as tidyr
and dplyr
that are designed to make common data wrangling tasks not only easier to achieve, but also easier to decipher. Readability of the code is at the core of the philosophy underpinning these packages.
As always, there are plenty of resources to find useful information about the various data handling functions in the different
tidyverse
packages, including the Data Import Cheat Sheet and Data Transformation Cheat Sheet.
If you haven’t done so already, we first need to load the tidyverse
library as well as the life expectancy dataset we introduced earlier.
# load tidyverse
library(tidyverse)
# load dataset andassign to object LifeExp
<- read.table('LifeExpectancyData.csv', sep = ',', header = T) LifeExp
2.4.1 Tidy data
The architect behind the tidyverse, Hadley Wickham, distinguishes between two types of data set: tidy and messy. Rather than being pejorative towards different ways in which people store and visualise data, the point here is to make a distinction between a specific way of arranging data that is useful to most R analyses, and anything else. In fact Hadley has a neat analogy to a famous Tolstoy quote:
“Tidy datasets are all alike but every messy dataset is messy in its own way.” — Hadley Wickham
In this context, a tidy data set is one in which:
- rows contain different observations
- columns contain different variables
- cells contain values
Let’s look at a simple example using the life expectancy data.
head(LifeExp)
## Country Continent Year Status Life.expectancy Adult.Mortality
## 1 Afghanistan Asia 2015 Developing 65.0 263
## 2 Afghanistan Asia 2014 Developing 59.9 271
## 3 Afghanistan Asia 2013 Developing 59.9 268
## 4 Afghanistan Asia 2012 Developing 59.5 272
## 5 Afghanistan Asia 2011 Developing 59.2 275
## 6 Afghanistan Asia 2010 Developing 58.8 279
## infant.deaths Alcohol percentage.expenditure Hepatitis.B Measles BMI
## 1 62 0.01 71.279624 65 1154 19.1
## 2 64 0.01 73.523582 62 492 18.6
## 3 66 0.01 73.219243 64 430 18.1
## 4 69 0.01 78.184215 67 2787 17.6
## 5 71 0.01 7.097109 68 3013 17.2
## 6 74 0.01 79.679367 66 1989 16.7
## under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS GDP
## 1 83 6 8.16 65 0.1 584.25921
## 2 86 58 8.18 62 0.1 612.69651
## 3 89 62 8.13 64 0.1 631.74498
## 4 93 67 8.52 67 0.1 669.95900
## 5 97 68 7.87 68 0.1 63.53723
## 6 102 66 9.20 66 0.1 553.32894
## Population thinness..1.19.years thinness.5.9.years
## 1 33736494 17.2 17.3
## 2 327582 17.5 17.5
## 3 31731688 17.7 17.7
## 4 3696958 17.9 18.0
## 5 2978599 18.2 18.2
## 6 2883167 18.4 18.4
## Income.composition.of.resources Schooling
## 1 0.479 10.1
## 2 0.476 10.0
## 3 0.470 9.9
## 4 0.463 9.8
## 5 0.454 9.5
## 6 0.448 9.2
\(~\)
In this workshop we will see various ways in which datasets can be manipulated to and from the tidy format.
2.4.2 Filter rows
The first data manipulation we could do is to filter the data, as we have done earlier when plotting the data only for the year 2000. In base R this worked as
$Year == 2000,] LifeExp[LifeExp
To do the equivalent in tidyverse
, we use the filter()
function
filter(LifeExp, Year == 2000)
Note, one of the most distinctive / useful feature of tidyverse
(or rather migrittr
, which tidyverse
/dplyr
depends on) is the so-called piping operator, denoted as %>%
. This operator allows you to chain one function after another without relying on nested parentheses or having to assign / store intermediate objects. So what it is doing is evaluating the expression on the right-hand side (or next line) of the %>%
(pipe) using the expression on the left (or same line) as the first argument. If this sounds confusing then fear not - all will be made clearer once you see it in action! Using the same example as above we can get the same results with piping
%>% filter(Year == 2000) LifeExp
Initially this might look more complicated but you will soon see how this is a very useful operator indeed.
2.4.3 Sort rows
Another common operation is to sort rows according to some criterion. For example, we might wish to sort our data by Status
and then Life.expectancy
. In tidyverse this is done using the arrange()
function.
arrange(LifeExp, Status, Life.expectancy)
or using the piping operator
%>% arrange(Status, Life.expectancy) LifeExp
2.4.4 Select columns
The next common operation is the selection of columns, which also includes deselection of particular columns. Let’s say we are interested only in the countries and their corresponding life expectancy, GDP, population size and the year of sampling. Before showing you how this can efficiently done in tidyverse
, here are some options of how to achieve this in base R
## option 1
LifeExp[, match(c("Country", "Life.expectancy", "GDP", "Population"), colnames(LifeExp))]
## option 2
cbind(Country = LifeExp$Country, Life.expectancy = LifeExp$Life.expectancy,
GDP = LifeExp$GDP, Population = LifeExp$Population)
## option 3 (requires knowing which columns are which)
LifeExp[, c(1, 4, 16, 17)]
And here is how easy and hopefully by now intuitively this can be done using the select()
function
%>% select(Country, Continent, Life.expectancy, GDP, Population) %>% head() LifeExp
## Country Continent Life.expectancy GDP Population
## 1 Afghanistan Asia 65.0 584.25921 33736494
## 2 Afghanistan Asia 59.9 612.69651 327582
## 3 Afghanistan Asia 59.9 631.74498 31731688
## 4 Afghanistan Asia 59.5 669.95900 3696958
## 5 Afghanistan Asia 59.2 63.53723 2978599
## 6 Afghanistan Asia 58.8 553.32894 2883167
Note: notice that we added one extra pipe at the end
%>% head(.)
. Remember that thehead()
function displays the first few rows of a data.frame. What we are doing here is that we first take the data.frameLifeExp
, the apply thefilter()
function and then use the results of this as the argument inhead()
De-selection, or filtering out particular columns follows a similar structure but in this case the column to be removed is preceded by a ‘-’; for example
%>% select(-Adult.Mortality) %>% head() LifeExp
As you will have noticed, we can easy combine various data manipulation functions using pipes without having to store any intermediate results or use excessive parenthesis. Let’s try this out
%>%
LifeExp filter(Year == 2000) %>%
select(Country, Continent, Life.expectancy, GDP, Population) %>%
arrange(Life.expectancy) %>%
head(n = 10)
## Country Continent Life.expectancy GDP Population
## 1 Sierra Leone Africa 39.0 139.31477 4564297
## 2 Malawi Africa 43.1 153.25949 11376172
## 3 Zambia Africa 43.8 341.95562 1531221
## 4 Angola Africa 45.3 555.29694 1644924
## 5 Eritrea Africa 45.3 28.19695 339281
## 6 Central African Republic Africa 46.0 243.54293 3754986
## 7 Zimbabwe Africa 46.0 547.35888 12222251
## 8 Uganda Africa 46.6 257.63369 2439274
## 9 Nigeria Africa 47.1 379.11933 1223529
## 10 Chad Africa 47.6 166.23179 8342559
Powerful stuff!
2.4.5 Grouping and summarising
A common thing we might want to do is to produce summaries of some variable for different subsets of the data. For example, we might want to produce an estimate of the mean life expectancy for for each year. The dplyr
package provides a function group_by()
, which allows us to group data by one or more variables, and the summarise()
, which allows us to summarise data. These can be used in combination get stratified summaries as shown here
%>%
LifeExp group_by(Year) %>%
summarise(mn = mean(Life.expectancy))
## # A tibble: 16 × 2
## Year mn
## <int> <dbl>
## 1 2000 65.9
## 2 2001 66.2
## 3 2002 66.5
## 4 2003 66.5
## 5 2004 66.9
## 6 2005 67.5
## 7 2006 68.0
## 8 2007 68.5
## 9 2008 68.9
## 10 2009 69.3
## 11 2010 69.4
## 12 2011 70.1
## 13 2012 70.5
## 14 2013 70.9
## 15 2014 71.1
## 16 2015 71.2
The most commonly functions used for summarising data are
- Center:
mean()
,median()
- Spread:
sd()
,IQR()
- Range:
min()
,max()
, - Count:
n()
Test your skills by
- calculating the mean life expectancy by continent
- calculating the mean life expectancy by continent and year
- calculating the mean and median of the GDP for each continent in the years 2000, 2005 and 2010
mean()
and median()
we have to add an extra argument na.rm = TRUE
- what happens if we don’t?)
\(~\)
2.4.6 Reshaping datasets
Recall, a key philosophy of tidyverse
is the notion of data being tidy. In fact, to get the most out of ggplot
, datasets need to be in a specific format. More often than not, though, you will be dealing with untidy data. To give you an example, let’s have a look at the data file ExposureIndex.csv
, which contains the recorded malaria exposure index for 103 individuals in Junju, Kenya, between the years 2000 and 2010.
<- read.csv('ExposureIndex.csv', check.names = F) # the argument check.names=F prevents renaming of col names
EI head(EI)
## Participant.ID 2000 2001 2002 2003 2004
## 1 N0004 0.7500000 0.5666667 4.705882e-01 0.4736842 1.578947e-01
## 2 N0007 0.9090909 0.6923077 3.333333e-01 0.1666667 9.770000e-15
## 3 N0016 0.7500000 0.8750000 3.750000e-01 0.6250000 1.000000e+00
## 4 N0024 0.7500000 0.5333333 5.294117e-01 0.4210526 1.052632e-01
## 5 N0041 0.5000000 0.3888889 8.690000e-13 0.3333333 6.400000e-13
## 6 N0055 0.7500000 0.5333333 5.294117e-01 0.4210526 1.578947e-01
## 2005 2006 2007 2008 2009 2010
## 1 0.00e+00 0.00e+00 8.70e-14 0 5.00e-02 1.00e-01
## 2 1.27e-14 9.77e-15 0.00e+00 0 0.00e+00 2.44e-14
## 3 5.00e-01 0.00e+00 2.00e-01 0 2.16e-14 3.60e-15
## 4 0.00e+00 0.00e+00 8.70e-14 0 5.00e-02 1.00e-01
## 5 1.56e-13 0.00e+00 4.25e-13 0 3.45e-13 5.74e-14
## 6 0.00e+00 0.00e+00 8.70e-14 0 5.00e-02 1.00e-01
\(~\)
Luckily, within tidyverse
is it easy to reshape data to get it into the desired format. In this case we want to reshape it from a wide format into a long format. The necessary function to do this is called gather()
, which takes two or more columns and gathers them into key-value pairs (here: year - exposure index). The columns to gather are 2000
, 2001
, …, 2010
, which we can shorten to 2000:2010
(i.e. all columns from 2000
to 2010
)
<- EI %>%
EI_long gather(Year, ExposureIndex, '2000':'2010')
head(EI_long, n = 10)
## Participant.ID Year ExposureIndex
## 1 N0004 2000 0.7500000
## 2 N0007 2000 0.9090909
## 3 N0016 2000 0.7500000
## 4 N0024 2000 0.7500000
## 5 N0041 2000 0.5000000
## 6 N0055 2000 0.7500000
## 7 N0059 2000 0.7500000
## 8 N0067 2000 0.7500000
## 9 N0089 2000 0.4375000
## 10 N0096 2000 0.4375000
Note: because the column names were numerical values and not strings, we have to encase them by ’ ’. This is not necessary if column names are in character format.
The converse of gather()
is spread()
, which takes two columns (key
and value
) and spreads them into multiple columns (dictated by the number of unique values
).
%>%
EI_long spread(Year, ExposureIndex) %>%
head()
## Participant.ID 2000 2001 2002 2003 2004
## 1 N0004 0.7500000 0.5666667 4.705882e-01 0.4736842 1.578947e-01
## 2 N0007 0.9090909 0.6923077 3.333333e-01 0.1666667 9.770000e-15
## 3 N0016 0.7500000 0.8750000 3.750000e-01 0.6250000 1.000000e+00
## 4 N0024 0.7500000 0.5333333 5.294117e-01 0.4210526 1.052632e-01
## 5 N0041 0.5000000 0.3888889 8.690000e-13 0.3333333 6.400000e-13
## 6 N0055 0.7500000 0.5333333 5.294117e-01 0.4210526 1.578947e-01
## 2005 2006 2007 2008 2009 2010
## 1 0.00e+00 0.00e+00 8.70e-14 0 5.00e-02 1.00e-01
## 2 1.27e-14 9.77e-15 0.00e+00 0 0.00e+00 2.44e-14
## 3 5.00e-01 0.00e+00 2.00e-01 0 2.16e-14 3.60e-15
## 4 0.00e+00 0.00e+00 8.70e-14 0 5.00e-02 1.00e-01
## 5 1.56e-13 0.00e+00 4.25e-13 0 3.45e-13 5.74e-14
## 6 0.00e+00 0.00e+00 8.70e-14 0 5.00e-02 1.00e-01
2.4.7 Mutate
Last but not least we show you how to create, modify or even delete columns with mutate()
. In fact, mutate()
is probably one of the functions you will end up using the most. However, the functionality of mutate()
is introduced here by means of two examples, which also introduce a few other useful features.
In the first example we are going to create a new column in our data.frame EI_long
, where we categorise the exposure index into high and low depending on its value being above or below 0.5, respectively. One new function we are going to introduce is ifelse()
; please use the help for more details, although its use should become apparent.
<- EI_long %>%
EI_long mutate(EI_cat = ifelse(ExposureIndex >= 0.5, 'high', 'low'))
head(EI_long, n=10)
## Participant.ID Year ExposureIndex EI_cat
## 1 N0004 2000 0.7500000 high
## 2 N0007 2000 0.9090909 high
## 3 N0016 2000 0.7500000 high
## 4 N0024 2000 0.7500000 high
## 5 N0041 2000 0.5000000 high
## 6 N0055 2000 0.7500000 high
## 7 N0059 2000 0.7500000 high
## 8 N0067 2000 0.7500000 high
## 9 N0089 2000 0.4375000 low
## 10 N0096 2000 0.4375000 low
In the second example we going to use mutate()
to modify three existing columns. First, we are going to round the ExposureIndex
to two significant digits; then we are turning our new column EI_cat
into a categorical variable using the function factor()
; and finally we are going to change the Year
column into a numeric format
<- EI_long %>%
EI_long mutate(ExposureIndex = round(ExposureIndex, digits=2)) %>%
mutate(EI_cat = factor(EI_cat)) %>%
mutate(Year = as.numeric(Year))
# note, we could have done the same in a single mutate() call
# EI_long <- EI_long %>%
# mutate(ExposureIndex = round(ExposureIndex, digits=2),
# EI_cat = factor(EI_cat),
# Year = as.numeric(Year))
head(EI_long, n=10)
## Participant.ID Year ExposureIndex EI_cat
## 1 N0004 2000 0.75 high
## 2 N0007 2000 0.91 high
## 3 N0016 2000 0.75 high
## 4 N0024 2000 0.75 high
## 5 N0041 2000 0.50 high
## 6 N0055 2000 0.75 high
## 7 N0059 2000 0.75 high
## 8 N0067 2000 0.75 high
## 9 N0089 2000 0.44 low
## 10 N0096 2000 0.44 low
- produce a
ggplot
line graph of the mean exposure index against time (2000 to 2010) - show the distribution of the expsoure index over time by means of box-and-whisker plots
\(~\)
2.4.8 Dealing with missing data
One recurring problem with data analysis is that data are often incomplete, leading to more or less serious downstream data analysis problems. That is, different functions deal with missing data in different ways, some simply ignore them whereas others fail completely. We therefore need to expand our data wrangling toolbox.
Missing data in R are usually referred to NA
’s, and here is an example of where NA’s can become problematic
= c(1, 4, 2, 9, NA, 4, 5)
x mean(x)
## [1] NA
There are various ways to deal with missing data, of which we only touch upon a few
Detecting missing data
The first thing we might want to know if there are any NA’s, and a function for this is is.na()
, which works on single variables, vectors, matrices and data.frames
<- data.frame(x = c(1, 4, 2, 9, NA, 4, 5),
df y = c(0.2, 0.5, NA, 0.1, NA, 0.8, 0.9))
is.na(df)
## x y
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE TRUE
## [4,] FALSE FALSE
## [5,] TRUE TRUE
## [6,] FALSE FALSE
## [7,] FALSE FALSE
Removing NA’s
Going back to our previous example, if we want to calculate the mean of x
we can either manually remove the offending entries or tell mean()
to ignore them
# option 1: tell mean() to ignore NA
mean(x, na.rm = TRUE)
## [1] 4.166667
# option 2: remove NA manually
<- x[!is.na(x)] # this means that x should be assigned to all entries of x that are not (!) NA
x mean(x)
## [1] 4.166667
What about our data.frame df
? As df
has both rows and columns we cannot simply use the same method but have to be more specific. For example, to remove a row where df$x
contains missing data
!is.na(df$x),] df[
## x y
## 1 1 0.2
## 2 4 0.5
## 3 2 NA
## 4 9 0.1
## 6 4 0.8
## 7 5 0.9
or to remove all rows where either df$x
or df$y
contains missing data (note: for this we need a logical operator, which is in fact an and, &
, because we neither want NA’s in x
nor in y
so both conditions have to be safisfied together)
!is.na(df$x) & !is.na(df$y),] df[
## x y
## 1 1 0.2
## 2 4 0.5
## 4 9 0.1
## 6 4 0.8
## 7 5 0.9
An easier way to achieve the same is by using the complete.cases()
function, which selects only those rows where we have complete information (i.e. no missing data)
complete.cases(df),] df[
## x y
## 1 1 0.2
## 2 4 0.5
## 4 9 0.1
## 6 4 0.8
## 7 5 0.9
And finally here the same examples but done using tidy
functionality
%>%
df filter(!is.na(x) & !is.na(y))
## x y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9
which is equivalent to
%>%
df drop_na()
## x y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9
which is equivalent to
%>%
df filter_at(vars(x:y), all_vars(!is.na(.)))
## x y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9
which is equivalent to
%>%
df filter_at(vars(x:y), all_vars(complete.cases(.)))
## x y
## 1 1 0.2
## 2 4 0.5
## 3 9 0.1
## 4 4 0.8
## 5 5 0.9
As they say in English: “there is more than one way to skin a cat”.