You can download this post as a PDF or RMarkdown file. These formats support code highlighting.
We recently wrote about how IDinsight strives to use the right analytical and statistical tools to advise decision-makers and improve social impact. In that post, we highlighted the benefits of the statistical software R, which is especially useful to visually communicate complex ideas. This post aims to provide beginner practitioners with the tools to make a graphic using ggplot2, a package within R.
At the end of this post, we hope you will have a better understanding of the graph design process from beginning (deciding the elements of your graph) to end (making the final graph look polished). Additionally, you will have code for a plot that you can easily modify for your future graphing needs.
There is a wealth of information on the philosophy of ggplot2, how to get started with ggplot2, and how to customize the smallest elements of a graphic using ggplot2 — but it’s all in different corners of the Internet. It can be difficult for a beginner to tie all this information together.
ggplot2
This post assumes basic familiarity with the following R concepts:
• vectors
• data frames
• factors
I also use the dplyr package to clean data. All code is commented so this should be straightforward to follow even if you have not used dplyr before.
dplyr
We will be using the GapMinder dataset that comes pre-packaged with R. This dataset is an excerpt from the GapMinder data, and it shows the life expectancy, population and GDP per capita of various countries over 12 years between 1952 to 2007.
We would like to show the change in life expectancy from 1952 to 2007 for 11 (arbitrarily-selected) countries: Bolivia, China, Ethiopia, Guatemala, Haiti, India, Kenya, Pakistan, Sri Lanka, Tanzania, Uganda.
Specifically, we want to see the life expectancy in each of these countries in 1952 and 2007. We also want to group the countries by continent.
We will use a bar plot to communicate this information graphically because we can easily see the levels of the life expectancy variable, and compare values over time and across countries. Here is a rough sketch to get us started on what we can do:
Note that we want two bars per country — one of these should be the life expectancy in 1952 and the other in 2007. We also want to colour the bars differently based on the continent.
ggplot2 is based on the “grammar of graphics”, which provides a standard way to describe the components of a graph (the “gg” in ggplot2 refers to the grammar of graphics). It has specialized terminology to refer to the elements of a graph, and I’ll introduce and explain new terms as we encounter them. For now, what we need to understand is that we will build a graphic by adding components one after the other, like layers.
The first step to building the graphic is to identify the components. Using our rough sketch as a guide, we know that our components are:
Now that we know what we need to include in the graph, let’s move on to writing code.
We need to install the following packages:
gapminder
We can use the following code to install and load packages.
Let’s have a look at the data again. It’s saved under gapminder:
Let’s restrict the data to the countries and years we are interested in, and save this new dataset as data_graph.
data_graph
Let’s also make “year” a factor, since it is a discrete variable:
To build a ggplot, we first use the ggplot() function to specify the default data source and aesthetic mappings:
ggplot()
Let’s break this down a little:
aes()
Note that there is no bar graph because we haven’t specified one yet. We have just specified which dataset and axes to use, not the type of graphic to display.
Let’s make the graph look a bit nicer. My preference is to make the following adjustments:
We will use the theme() function to make these changes. theme() allows us to modify the display of non-data elements of the graph.
theme()
Note that we did not have to re-write the code to make the base plot or modify it in any way. Instead, we kept the base plot object as-is and “added” themes to it using the + operator. This is how we build a ggplot — we add components together to build a graphic.
In order to add bars to our ggplot, we need to understand geometric objects (“geoms”). A “geom” is a mark we add to the plot to represent data. For example, we can use the geom “point” to display our data using points, in which case the resulting graphic would be a scatterplot. The ggplot2 cheatsheet has a list of all the geoms we can add to a plot.
We will be adding bars to our graph using geom_bar():
geom_bar()
We now have a bar graph. The numbers don’t seem to be right since the life expectancy is close to 100 for all countries — we will fix this later.
It may seem strange that we didn’t specify the x and y values for the bars, but the bars displayed life expectancy by country anyway. This is because of ggplot’s “hierarchy of defaults”. Since we add the call to geom_bar() to an existing call to ggplot(data = data_graph, aes(x = country, y = lifeExp)), ggplot2 assumes that the x and y variables for geom_bar() are the same as those for ggplot() i.e. the x and y variables are “country” and “lifeExp”, respectively.
ggplot(data = data_graph, aes(x = country, y = lifeExp))
We also specified stat in the call to geom_bar. stat is used when we want to apply a statistical function to the data and show the results graphically. When we use geom_bar(), by default, stat assumes that we want each bar to show the count of y-variables per x-variable. Since we want ggplot to plot the values as-is, we specify stat = "identity".
stat
geom_bar
stat = "identity"
Now, let’s change the colour of the bars. We ultimately want the colour of the bars to vary by continent, but let’s start with something simpler — let’s change the colour of the bars to light blue. To do this, we will specify fill = "lightblue" inside the call to geom_bar().
fill = "lightblue"
Now, let’s make the colour of the bars vary by continent. We are saying that we want a mapping from an aesthetic element (the colour inside the bars) to a variable in our data (“continent”). Recall that we use the aes() function to specify a relationship between a visual element and a variable. Within aes(), we will use the fill argument to specify that we are interested in changing the colour of the bars.
fill
Note that we used fill in both cases, because fill is what controls the colour inside the bars. However, we did not use aes() when we coloured the bars light blue because the colour inside the bars wasn’t related to any variables.
Now, we will address why we aren’t seeing the correct values of life expectancy in the graph. Since each country has two observations for life expectancy (one for 1952 and one for 2007), and we haven’t specified which observation to use, the life expectancy shown by the bars is actually the sum of life expectancy for both years.
Let’s see what happens when we restrict the graph to include only data for 2007.
We now see the correct values of life expectancy. Note that though the plot_base_clean object already had a default value of data (data_graph), we were able to override it in the call to geom_bar(). This again ties back to the hierarchy of defaults – if we don’t specify a new dataset or xy-variables for our geoms, we simply use the dataset and xy-variables provided in the call to ggplot(), but since we specified a new value of data within geom_bar(), the bars reflect a new data source.
plot_base_clean
data
Next, we are interested in showing two data points per country, one for 1952 and one for 2007. Here is where the alpha aesthetic is useful. It specifies the transparency of the colours we are using. Let’s try using alpha with the same subsetted dataset:
alpha
We see that similar to specifying fill = "lightblue", specifying alpha to be a number changes the transparency levels of each bar. alpha values range from 0 to 1, with higher values being more opaque.
Like fill, alpha can also be used as an aesthetic. Let’s establish a relationship between the transparency of a bar and the year. Since we are interested in both years, we won’t restrict graph_data in geom_bar().
We don’t want a stacked bar chart, but alpha does seem to be working – we see that the lighter portions of the bars correspond to the values in 1952, while the darker portions correspond to values in 2007.
Now, let’s use the position argument to make the bars appear side-by-side, instead of being stacked. According to the ggplot2 documentation, bars are stacked by default and we need to specify position = "dodge" to make the bars appear side-by-side.
position
position = "dodge"
Note that position = "dodge" is another way of writing position = position_dodge(). position_dodge() can take a width argument, which is discussed in detail in this Stack Overflow post. We are using the default width, which is why we can use the shorter version position = "dodge".
position = position_dodge()
position_dodge()
The 1952 colours for alpha are very light. Let’s modify the transparency provided by alpha using scale_alpha_manual().
scale_alpha_manual()
Here, we specified a vector for scale_alpha_manual, where each element provides the transparency of the corresponding year. We assigned a transparency of 0.6 to 1952 and 1 to 2007 (we know the first element corresponds to 1952 and the second element to 2007 because that is the order of levels for the “year” factor. You can check this using levels(data_graph$year)).
scale_alpha_manual
levels(data_graph$year)
Let’s also change the colour scheme for the continent colours using scale_fill_manual(). We provide a vector of colours, where each element provides the colour for the corresponding continent. I have provided the colours in hexadecimal format (e.g. as “#FF0011”), but you can provide colours in any other format you prefer.
scale_fill_manual()
Let’s turn our plot into a horizontal bar chart using coord_flip():
coord_flip()
Note the order of the bars still reflects the levels of the factor i.e. countries coming first alphabetically are closer to the origin, and the bar for 1952 is below the bar for 2007. We are going to go ahead with this order, but if you’d like the countries or years to appear in a different order, all you have to do is modify the factor levels of the corresponding variables.
Our graph is already quite informative — we can identify the continent a country belongs to by the colour of the bar. If we want the country bars to appear by continent, we can change the levels of the “country” factor so that the country names are sorted by continent.
However, it would be much more effective if we could group the countries into continents on the x-axis. The reader of the graph wouldn’t need to keep referring to the legend; all the information would be in one place. We can create these groups using facets.
Facets are used to split the ggplot into a matrix of panels. Let’s add a facet for the “continent” variable to understand what “matrix of panels” means:
We see that our graph is now in 3 horizontal panels, with each panel representing a different continent.
Let’s break the facet_grid() command down a little: we wanted horizontal panels, so we specified the rows argument. Each row/panel was on the basis on continent, so we specified rows = vars(continent)). vars just indicates that the “continent” object exists in the context of the dataset we are using in our ggplot() command. If we don’t specify vars, we will get an error saying that the object “continent” was not found.
facet_grid()
rows
rows = vars(continent))
vars
Now, we will explore some arguments of facet_grid() that can improve the appearance of the graph. All of these are covered in detail in the ggplot2 documentation; in this post, we will use only a few options.
First, we see that the graph is assuming that every x-variable (“country”, in our case) exists for every faceting variable (“continent”) e.g. Haiti is in the Africa and Asia panel as well as the Americas panel. This is because ggplot2 assumes every panel will have the same scale, where “scale” refers to the values the x and y axis take on. Our scale of interest is country names, and currently each continent has exactly the same scale – all of the country names are included for each continent. To remedy this, we specify scales = "free_y" – we say that every faceting variable (“continent”) can have its own scale (where a “scale” would be only those country names that are part of the continent).
scales = "free_y"
Now, notice that the bars for the Americas are thicker than the bars for Africa or Asia. This is because by default, ggplot makes all panels (i.e. all continents) occupy the same amount of space. We’d prefer that all our bars be equally thick, rather than our panels be equally tall. Let’s add space = "free_y".
space = "free_y"
It seems a little confusing to have the continent names to the right and the country names to left. We can use the switch option to change where the facet labels (i.e. continent names) are displayed.
switch
This looks quite good! Let’s do the following to modify the appearance of the facet labels i.e. the continent names:
Our graph is almost ready! Let’s clean up the legend and the axes, and give a title to our graph.
Legend
To reduce chartjunk, let’s suppress the legend for continent because we already have that information in the facets. We will use the guides() function to suppress the legend for the fill aesthetic (recall that we set aes(fill = continent) in geom_bar()).
guides()
aes(fill = continent)
DataNovia has an excellent guide for formatting ggplot legends, if you’d like to modify the legend further e.g. change its position, manually change legend colours, etc.
Graph labels
Finally, let’s use the labs function to change the labels for this graph. We want to:
labs
And that is our graph!
Here is all the graph code in one place:
You can save a copy of the graph using the ggsave() command, which allows you to specify the save location, dimensions of the file, image format (.png, .jpg etc.), and more.
ggsave()
Now that we understand how to build a ggplot, let’s map the elements of our graph to the components of a plot:
x
y
The graph components are succinctly expressed in this code template:
You can make the following graphs to learn more about ggplot():
geom_text()
geom_errorbar()
geom_line()
We would love to know if this worked for you. Write to us with questions or share your graphs with us in the comments below.
6 September 2024
2 September 2024
20 August 2024
15 August 2024
13 August 2024
11 July 2024
7 July 2024
4 July 2024
2 July 2024
1 March 2019
7 March 2019
2 April 2019
29 April 2019