Skip to content
Blog

How to make bar graphs using ggplot2 in R

Ishita Batra 12 September 2019

You can download this post as a PDF or RMarkdown file. These formats support code highlighting.

We recently wrote about how IDinsight strives to use the right analytical and statistical tools to advise decision-makers and improve social impact. In that post, we highlighted the benefits of the statistical software R, which is especially useful to visually communicate complex ideas. This post aims to provide beginner practitioners with the tools to make a graphic using ggplot2, a package within R.

At the end of this post, we hope you will have a better understanding of the graph design process from beginning (deciding the elements of your graph) to end (making the final graph look polished). Additionally, you will have code for a plot that you can easily modify for your future graphing needs.

There is a wealth of information on the philosophy of ggplot2, how to get started with ggplot2, and how to customize the smallest elements of a graphic using ggplot2 — but it’s all in different corners of the Internet. It can be difficult for a beginner to tie all this information together.

This post assumes basic familiarity with the following R concepts:

• vectors

• data frames

• factors

I also use the dplyr package to clean data. All code is commented so this should be straightforward to follow even if you have not used dplyr before.

We will be using the GapMinder dataset that comes pre-packaged with R. This dataset is an excerpt from the GapMinder data, and it shows the life expectancy, population and GDP per capita of various countries over 12 years between 1952 to 2007.

Visualization task

We would like to show the change in life expectancy from 1952 to 2007 for 11 (arbitrarily-selected) countries: Bolivia, China, Ethiopia, Guatemala, Haiti, India, Kenya, Pakistan, Sri Lanka, Tanzania, Uganda.

Specifically, we want to see the life expectancy in each of these countries in 1952 and 2007. We also want to group the countries by continent.

We will use a bar plot to communicate this information graphically because we can easily see the levels of the life expectancy variable, and compare values over time and across countries. Here is a rough sketch to get us started on what we can do:

Note that we want two bars per country — one of these should be the life expectancy in 1952 and the other in 2007. We also want to colour the bars differently based on the continent.

Components of the graph

ggplot2 is based on the “grammar of graphics”, which provides a standard way to describe the components of a graph (the “gg” in ggplot2 refers to the grammar of graphics). It has specialized terminology to refer to the elements of a graph, and I’ll introduce and explain new terms as we encounter them. For now, what we need to understand is that we will build a graphic by adding components one after the other, like layers.

The first step to building the graphic is to identify the components. Using our rough sketch as a guide, we know that our components are:

  1. Dataset — for us, this is a subset of the gapminder data that includes only the countries and years in question
  2. Coordinate system — Cartesian
  3. Axes — we want country name on the x-axis and life expectancy on the y-axis
  4. Type of visualization — we want one bar per country per year e.g. for India, we want one bar for the life expectancy in 1952 and another bar for 2007
  5. Groups on the x-axis — we want to group countries by continent

Now that we know what we need to include in the graph, let’s move on to writing code.

Making the chart

We need to install the following packages:

  • ggplot2
  • dplyr — to manipulate data
  • gapminder — data source

We can use the following code to install and load packages.

Let’s have a look at the data again. It’s saved under gapminder:

Let’s restrict the data to the countries and years we are interested in, and save this new dataset as data_graph.

Let’s also make “year” a factor, since it is a discrete variable:

To build a ggplot, we first use the ggplot() function to specify the default data source and aesthetic mappings:

Let’s break this down a little:

  • data source: “data_graph” in our case
  • aesthetic mappings: The aes() function maps variables in our data frame to aesthetic attributes. An “aesthetic attribute”” is a visual element of the graph, such as the shape of a point or the colour of a line. In our case, we are specifying that the axes (which are aesthetic attributes) should correspond to the variables “country” and “lifeExp”.

Note that there is no bar graph because we haven’t specified one yet. We have just specified which dataset and axes to use, not the type of graphic to display.

Let’s make the graph look a bit nicer. My preference is to make the following adjustments:

  • Simple, black-and-white layout
  • No background colour
  • No gridlines
  • The chart area shouldn’t be in a box; we should have only the x and y axis

We will use the theme() function to make these changes. theme() allows us to modify the display of non-data elements of the graph.


    

Note that we did not have to re-write the code to make the base plot or modify it in any way. Instead, we kept the base plot object as-is and “added” themes to it using the + operator. This is how we build a ggplot — we add components together to build a graphic.

In order to add bars to our ggplot, we need to understand geometric objects (“geoms”). A “geom” is a mark we add to the plot to represent data. For example, we can use the geom “point” to display our data using points, in which case the resulting graphic would be a scatterplot. The ggplot2 cheatsheet has a list of all the geoms we can add to a plot.

We will be adding bars to our graph using geom_bar():

We now have a bar graph. The numbers don’t seem to be right since the life expectancy is close to 100 for all countries — we will fix this later.

It may seem strange that we didn’t specify the x and y values for the bars, but the bars displayed life expectancy by country anyway. This is because of ggplot’s “hierarchy of defaults”. Since we add the call to geom_bar() to an existing call to ggplot(data = data_graph, aes(x = country, y = lifeExp))ggplot2 assumes that the x and y variables for geom_bar() are the same as those for ggplot() i.e. the x and y variables are “country” and “lifeExp”, respectively.

We also specified stat in the call to geom_barstat is used when we want to apply a statistical function to the data and show the results graphically. When we use geom_bar(), by default, stat assumes that we want each bar to show the count of y-variables per x-variable. Since we want ggplot to plot the values as-is, we specify stat = "identity".

Now, let’s change the colour of the bars. We ultimately want the colour of the bars to vary by continent, but let’s start with something simpler — let’s change the colour of the bars to light blue. To do this, we will specify fill = "lightblue" inside the call to geom_bar().

Now, let’s make the colour of the bars vary by continent. We are saying that we want a mapping from an aesthetic element (the colour inside the bars) to a variable in our data (“continent”). Recall that we use the aes() function to specify a relationship between a visual element and a variable. Within aes(), we will use the fill argument to specify that we are interested in changing the colour of the bars.

Note that we used fill in both cases, because fill is what controls the colour inside the bars. However, we did not use aes() when we coloured the bars light blue because the colour inside the bars wasn’t related to any variables.

Now, we will address why we aren’t seeing the correct values of life expectancy in the graph. Since each country has two observations for life expectancy (one for 1952 and one for 2007), and we haven’t specified which observation to use, the life expectancy shown by the bars is actually the sum of life expectancy for both years.

Let’s see what happens when we restrict the graph to include only data for 2007.

We now see the correct values of life expectancy. Note that though the plot_base_clean object already had a default value of data (data_graph), we were able to override it in the call to geom_bar(). This again ties back to the hierarchy of defaults – if we don’t specify a new dataset or xy-variables for our geoms, we simply use the dataset and xy-variables provided in the call to ggplot(), but since we specified a new value of data within geom_bar(), the bars reflect a new data source.

Next, we are interested in showing two data points per country, one for 1952 and one for 2007. Here is where the alpha aesthetic is useful. It specifies the transparency of the colours we are using. Let’s try using alpha with the same subsetted dataset:

We see that similar to specifying fill = "lightblue", specifying alpha to be a number changes the transparency levels of each bar. alpha values range from 0 to 1, with higher values being more opaque.

Like fillalpha can also be used as an aesthetic. Let’s establish a relationship between the transparency of a bar and the year. Since we are interested in both years, we won’t restrict graph_data in geom_bar().

We don’t want a stacked bar chart, but alpha does seem to be working – we see that the lighter portions of the bars correspond to the values in 1952, while the darker portions correspond to values in 2007.

Now, let’s use the position argument to make the bars appear side-by-side, instead of being stacked. According to the ggplot2 documentation, bars are stacked by default and we need to specify position = "dodge" to make the bars appear side-by-side.

Note that position = "dodge" is another way of writing position = position_dodge()position_dodge() can take a width argument, which is discussed in detail in this Stack Overflow post. We are using the default width, which is why we can use the shorter version position = "dodge".

The 1952 colours for alpha are very light. Let’s modify the transparency provided by alpha using scale_alpha_manual().

Here, we specified a vector for scale_alpha_manual, where each element provides the transparency of the corresponding year. We assigned a transparency of 0.6 to 1952 and 1 to 2007 (we know the first element corresponds to 1952 and the second element to 2007 because that is the order of levels for the “year” factor. You can check this using levels(data_graph$year)).

Let’s also change the colour scheme for the continent colours using scale_fill_manual(). We provide a vector of colours, where each element provides the colour for the corresponding continent. I have provided the colours in hexadecimal format (e.g. as “#FF0011”), but you can provide colours in any other format you prefer.

Let’s turn our plot into a horizontal bar chart using coord_flip():

Note the order of the bars still reflects the levels of the factor i.e. countries coming first alphabetically are closer to the origin, and the bar for 1952 is below the bar for 2007. We are going to go ahead with this order, but if you’d like the countries or years to appear in a different order, all you have to do is modify the factor levels of the corresponding variables.

Our graph is already quite informative — we can identify the continent a country belongs to by the colour of the bar. If we want the country bars to appear by continent, we can change the levels of the “country” factor so that the country names are sorted by continent.

However, it would be much more effective if we could group the countries into continents on the x-axis. The reader of the graph wouldn’t need to keep referring to the legend; all the information would be in one place. We can create these groups using facets.

Facets are used to split the ggplot into a matrix of panels. Let’s add a facet for the “continent” variable to understand what “matrix of panels” means:

We see that our graph is now in 3 horizontal panels, with each panel representing a different continent.

Let’s break the facet_grid() command down a little: we wanted horizontal panels, so we specified the rows argument. Each row/panel was on the basis on continent, so we specified rows = vars(continent))vars just indicates that the “continent” object exists in the context of the dataset we are using in our ggplot() command. If we don’t specify vars, we will get an error saying that the object “continent” was not found.

Now, we will explore some arguments of facet_grid() that can improve the appearance of the graph. All of these are covered in detail in the ggplot2 documentation; in this post, we will use only a few options.

First, we see that the graph is assuming that every x-variable (“country”, in our case) exists for every faceting variable (“continent”) e.g. Haiti is in the Africa and Asia panel as well as the Americas panel. This is because ggplot2 assumes every panel will have the same scale, where “scale” refers to the values the x and y axis take on. Our scale of interest is country names, and currently each continent has exactly the same scale – all of the country names are included for each continent. To remedy this, we specify scales = "free_y" – we say that every faceting variable (“continent”) can have its own scale (where a “scale” would be only those country names that are part of the continent).

Now, notice that the bars for the Americas are thicker than the bars for Africa or Asia. This is because by default, ggplot makes all panels (i.e. all continents) occupy the same amount of space. We’d prefer that all our bars be equally thick, rather than our panels be equally tall. Let’s add space = "free_y".

It seems a little confusing to have the continent names to the right and the country names to left. We can use the switch option to change where the facet labels (i.e. continent names) are displayed.

This looks quite good! Let’s do the following to modify the appearance of the facet labels i.e. the continent names:

  • Move the continent names to the left of country names
  • Remove the gray background and box from the continent labels
  • Make the continent names horizontal and not vertical

Our graph is almost ready! Let’s clean up the legend and the axes, and give a title to our graph.

Legend

To reduce chartjunk, let’s suppress the legend for continent because we already have that information in the facets. We will use the guides() function to suppress the legend for the fill aesthetic (recall that we set aes(fill = continent) in geom_bar()).

DataNovia has an excellent guide for formatting ggplot legends, if you’d like to modify the legend further e.g. change its position, manually change legend colours, etc.

Graph labels

Finally, let’s use the labs function to change the labels for this graph. We want to:

  • Remove the x-axis label — we don’t need to say “country” since it is apparent
  • Change the y-axis label to “Life expectancy (years)”
  • Add a title above the graph explaining what the graph shows
  • Add the data source below the graph

And that is our graph!

Final graph code

Here is all the graph code in one place:

You can save a copy of the graph using the ggsave() command, which allows you to specify the save location, dimensions of the file, image format (.png, .jpg etc.), and more.

Revisiting ggplot() graph components

Now that we understand how to build a ggplot, let’s map the elements of our graph to the components of a plot:

  • “A default dataset and set of mappings from variables to aesthetics” — we did this in ggplot(data = data_graph, aes(x = country, y = lifeExp)).
  • “One or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings”— we created a layer for bars using geom_bar()stat = "identity" and “position = "dodge".
  • “One scale for each aesthetic mapping used” — the x and y axes had default scales based on the values of “country” and “lifeExp”. We also created scales for fill and alpha.
  • A coordinate system — Cartesian, in our case, as we specified aesthetics for x and y. We also flipped the axes.
  • The facet specification — we did this using facet_grid().

The graph components are succinctly expressed in this code template:

Next steps

You can make the following graphs to learn more about ggplot():

  • Change the font and font size for the chart title, facet labels, and axis labels (you’ll need to use the theme() function)
  • Modify the existing graph to show the value of life expectancy for each bar (you’ll need to add a geom_text())
  • Create some dummy data with confidence intervals for estimates of life expectancy, and show these confidence intervals on our existing graph (you’ll need to use geom_errorbar())
  • Create a line graph showing the value of life expectancy over several years for different countries (you’ll need to use geom_line() and take a new subset of the data)
  • You can have a look at the ggplot2 cheatsheet to get more ideas for what you can do!

We would love to know if this worked for you. Write to us with questions or share your graphs with us in the comments below.