This guide is a part of a series of short tutorials. Find all the tutorials here.

1 Introduction

R is a powerful tool for visualising your data. You can make almost any kind of plot, revealing connections that are hard to see from summary statistics. This tutorial is not meant to be a comprehensice guide on plotting in R, but rather a starting point, introducing the most important methods.

Section 3 is optional, and introduces ggplot2, a different system for plotting in R.

1.1 Why visualise your data?

You may think that communication is the most important reason for visualising your data. This is partly true, as a good graph can convey a lot of information more easily than text or tables. Hovever, you will find that you make a lot of graphs that you never show to others. These are graphs that you make for yourself to investigate connections in the data. Visualisations are not only important for communication, but also for the process of data exploration.

For learning about the powers of data visualisation, as well as the do’s and dont’s of making graphs, I recommend the book Data Visualization by Kieran Healy.

1.2 Prerequisites

For this exercise we will only use base R and its built-in iris data set. For the optional section 3 you will need the ggplot2 package, that you can install by running install.packages("ggplot2").

2 Plotting in R

The basic function for plotting in R is simply plot(). This function will guess what kind of plot to make based on the data you provide. You can supply many arguments to the plot() function to get the visualisation you want, which we will gradually go through here.

2.1 Plotting vectors

The simplest way of plotting in R is by plotting two vectors of equal length. One vector gives the x-value, and another gives the y-value.

# make two vectors of equal length
x <- 1:50
y <- 51:100

plot(x, y)

As you can see, it makes a simple plot of our data, using points as the default. If we want to make a line graph we have to specify type = "l":

plot(x, y, type = "l")

For all the different ´type´ arguments, see the plot-function’s help page by running ?plot.

Tip: Remember that when you use R’s $ operator to pick out a single column from a data frame, you create a vector. Using this, you can easily plot the columns of your data frame. Another way of doing this is introduced in section 2.3.

2.2 Plotting functions

You can also plot simple functions easily with the plot() function. You don’t have to give an x-value when plotting this way, and you can control what x-values to show with the xlim argument.

# define a function
myexp <- function(x){
  exp(x)/x
} 

# fit two plots in one window using par()
par(mfrow = c(1, 2))

# plot with default xlim
plot(myexp)

# plot with x in a range between -5 and 5
plot(myexp, xlim = c(-5, 5))

Figure 2.1: Plotting the function “myexp”. Right: controlling the x-value to range between -5 and 5.


# reset parameters to one plot per window
par(mfrow = c(1, 1))

Notice how the plot-function again guesses that we want lines for our graph. Remember that you can change this with type.

2.3 Plotting data frames

The thing you probably will plot most often is data frames, containing your own or someone else’s data. For this part we will use the iris data set, which is a built-in data set in R (which means you don’t need to load it in any way). We start by inspecting the object:

summary(iris)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#>

As you can see, it contains four continous measurements for 50 individuals each from three different species of Iris. We could just simply write plot(iris), which plots all the values against each other:

plot(iris)

Figure 2.2: A plot of the entire iris data set.

Even for this relatively small data set this visualisation can be quite overwhelming. While this kind of plotting can be useful to get an overview, we often want to focus on specific relationships between our data. To plot two values against each other you use the following syntax: plot(response_variable ~ explanatory_variable, data = my_data). If we want to investigate the relationship between petal length and sepal length in the iris data, we would do the following:

plot(Sepal.Length ~ Petal.Length, data = iris)

Figure 2.3: Sepal length vs. petal length

We get a nice scatterplot of those data. There is some pattern here, which we will explore later, in section 2.5.2. If we want to plot numerical data (e.g. petal width) as a function of categorical data (here species), the approach is the same:

plot(Petal.Width ~ Species, data = iris)

Figure 2.4: R automatically assumed you wanted a boxplot!

The plotting function automatically makes a boxplot! When presented with one categorical and one numeric variable, R assumes that a boxplot is the best way to present that data. Be careful with this behavior, as R aren’t always right about this. You should always think about how you want to present your data, rather than going with what R suggests¹.

Often, you don’t want R to guess for you, but rather specify yourself what you want to plot, leading us into the next section.

2.4 Special plot commands (histograms, boxplots, barcharts etc.)

While plot() is good for a lot of things, you’ll want other things than scatterplots and boxplots. Then you have to use different functions for making your plot:

barplot() for barplots
hist() for histograms
boxplot for boxplots

# examples
par(mfrow = c(3, 1))

barplot(table(iris$Species), main = "Barplot: no. of iris individuals per species")
hist(iris$Sepal.Length)
boxplot(Petal.Width ~ Species, data = iris, main = "Boxplot of petal width by species")

Figure 2.5: Examples of special plot commands


par(mfrow = c(1,1))

Additionally, some functions add elements to an existing plot instead of creating a new one:

lines() add lines on top
points() add points on top


plot(x, y, type = "l")
lines(x, y - 5, col = "red")
points(c(0, 10, 15), c(60, 80, 70), col = "blue")

Figure 2.6: Plot with added line (red) and added points (blue)

2.5 Customizing your plots

While the plots we made mostly show the data we want to, they certainly are not publication ready. In this section I will go through some parameters to adjust to make the graphs prettier and more informative. This is not meant to be a comprehensive guide, as there are so many options, but rather an introduction to what you can do with your graphs.

2.5.1 Changing point shape and line type

You can change the shape of points with the argument pch. You can see the options you have in figure 2.7 (notice that 21-25 supports fill colors, while the others don’t).

Figure 2.7: pch options

Returning to one of our earlier plots, we can try this out:

plot(Sepal.Length ~ Petal.Length, data = iris, pch = 15)

Another adjustment you can make is to change line type with lty:

plot(x, y, type = "l", lty = 2)

I recommend trying out what options you have here, or look it up online.

2.5.2 Adding colors and legends

Still, our plot is really black and white, and could use some color. The color argument is simply col.

plot(Sepal.Length ~ Petal.Length, data = iris, pch = 15, col = "red")

There are a ton of available colors, and I recommend googling “R colors” to see what options you have. You can also provide hexadecimal (e.g. col = "#eda611") or rgb colors (e.g. col = rgb(132, 77, 73 )), if you want to be really precise.

The red color makes our plot prettier, but not more informative. A good approach here would be to color the points by species. You can do this by providing a vector instead of a color name. The syntax here is rather weird, and just something you have to remember².

Edit August 27th 2020: With R version 4.0 onwards, strings aren’t automatically converted to factors when importing data. In practice this means that you need to turn iris$species into a factor variable before plotting (or using stringsAsFactors = TRUE on import). This is reflected in the updated code below:

iris$Species <- factor(iris$Species)

plot(Sepal.Length ~ Petal.Length, data = iris, pch = 15, col = c("red", "blue", "black")[iris$Species])

Figure 2.8: A plot with color-coded species, but which is which?

Now you have colors, but no way of telling which is which. To see this you have to add a legend. Legends in base R plotting are manual, and have to be added to the plot after creating it, with the legend() function.

plot(Sepal.Length ~ Petal.Length, data = iris, pch = 15, col = c("red", "blue", "black")[iris$Species])
legend("bottomright", legend = levels(iris$Species), col = c("red", "blue", "black"), pch = 15)

Figure 2.9: With a legend you can see which color is which species!

As a side note, you can apply this same procedure for pch and lty as well if you want to.

2.5.2.1 A note about color choice

When you choose colors for your plot, remember that not all people can distinguish all colors! Red-green colorblindness is quite common, so you should never distinguish your points using these two colors. See e.g. this page for some tips on being color blind friendly.

2.5.3 Modifying axis labels, title, etc.

Our plot is nearly good, but we should modify our labels and create a title for it to be more informative! This is done with the main, xlab and ylab arguments of the plot() function.

plot(Sepal.Length ~ Petal.Length, 
     data = iris, 
     pch = 15, 
     col = c("red", "blue", "black")[iris$Species],
     main = "Relationship between Iris' petal and sepal length",
     xlab = "Petal length",
     ylab = "Sepal length")
legend("bottomright", legend = levels(iris$Species), col = c("red", "blue", "black"), pch = 15)

Figure 2.10: Our finished plot!

Tip: Notice how I have placed the arguments on separate lines for better readability in the example above. As long as you have an unclosed parenthesis R will not care about the line breaks, so I recommend this approach whenever you use many arguments of a function.

The arguments xlim and ylim modifies our view of the x and y-axis respectively. If we wanted to focus on just I. versicolor and I. virginica we could use these arguments³:

plot(Sepal.Length ~ Petal.Length, 
     data = iris, 
     pch = 15, 
     col = c("red", "blue", "black")[iris$Species],
     main = "Relationship between Iris' petal and sepal length",
     xlab = "Petal length",
     ylab = "Sepal length",
     xlim = c(2.5, 7))
legend("bottomright", legend = levels(iris$Species), col = c("red", "blue", "black"), pch = 15)

Figure 2.11: Our plot, zoomed in on versicolor and virginica

3 ggplot2

ggplot2 is a different system altogether for plotting in R. Some of the benefits are:

Looks pretty straight away
It never guesses for you what visualisation you want to make, you have to tell it what to do
Automatic generation of legend, making it harder to mess up
A lot of default colorblind friendly palettes
Easy to map variables to e.g. color, point type etc.

Some drawbacks are:

Pretty visuals may blind you to problems with your visualisation
It never guesses for you what visualisation you want to make, you have to tell it what to do
Your data always has to be contained in a data frame (although it mostly will be anyway)
It’s troublesome to use in functions
Modifying graphical parameters can sometimes be a hassle, be prepared for spending a lot of time on stackoverflow!

ggplot2 will not be taught in this course, but I recommend chapter 3 of R for Data Science if you want to know more, that’s how I learned this!

To see the potential of ggplot, here is how easy it is to make the same graph as we made in base plot:

library(ggplot2)
ggplot(iris, aes(Petal.Length, Sepal.Length, col = Species)) +
  geom_point(pch = 15) +
  labs(title = "Relationship between Iris' petal and sepal length",
     x = "Petal length",
     y = "Sepal length")

I wasn’t able to find any way to plot just the points without the boxplot, it seems base R plotting doesn’t have this functionality. If you want to plot just the points I recommend ggplot2.↩︎
I’d argue that this kind of plotting is a lot easier in ggplot2 (see section 3)↩︎
A cleaner approach to this would probably be to exclude I. setosa from our data before plotting↩︎

Plotting in R

Even Garvang

Spring 2019