This guide is a part of a series of short tutorials. Find all the tutorials here.
You have already learned how to make basic plots with base R in the short tutorial about plotting. However, when making more advanced plots, for example plotting model predictions with confidence intervals, base R may become a bit clunky to use. In this tutorial you will learn about the ggplot2
package, where you gain quite a lot of useful functionality, at the expense of having to learn some new syntax.
This guide is divided into 3 parts: First I give a general overview of the ggplot2 syntax, then I go through plotting continuous predictions with confidence intervals and finally plotting predictions for categorical predictors. The guide may be updated if there is need in the future.
To complete the tutorial you will need the following:
ggplot2
. Make sure to install it with install.packages()
before startingThe syntax of ggplot2
is quite different from base R’s plot()
(or anything in base R for that matter). The philosophy is to start with a data set and choose variables, and then add layers on top to make the visualization you want. To make a ggplot, you need (at the very least) 3 things:
The three things you need for a ggplot:
Here, we will build up a simple scatterplot using the three elements I listed above. We will start by loading the ggplot2
package (remember to install it first if you haven’t already done so!).
library(ggplot2)
We will use the student_means.csv
data for plotting, so we need to import that, and look at what it contains. NB! We will use the argument stringsAsFactors = TRUE
when we import, to ensure that all text columns are encoded as factors (this is convenient when the goal is to model).
<- read.csv("student_means.csv", stringsAsFactors = TRUE)
student_means summary(student_means)
#> X ID Sex Height
#> Min. : 1.0 g1p1 : 1 female:27 Min. :156.9
#> 1st Qu.:11.5 g1p2 : 1 male :16 1st Qu.:164.2
#> Median :22.0 g1p3 : 1 Median :171.0
#> Mean :22.0 g1p4 : 1 Mean :172.3
#> 3rd Qu.:32.5 g1p5 : 1 3rd Qu.:181.0
#> Max. :43.0 g2p1 : 1 Max. :189.0
#> (Other):37
#> Length_of_right_underarm Length_of_right_underarm_and_hand
#> Min. :21.07 Min. :38.60
#> 1st Qu.:24.13 1st Qu.:41.97
#> Median :25.37 Median :43.87
#> Mean :25.48 Mean :43.54
#> 3rd Qu.:26.70 3rd Qu.:45.33
#> Max. :30.07 Max. :50.17
#>
#> Length_of_right_foot Length_of_left_foot Neck_circumference Height_of_mother
#> Min. :20.90 Min. :20.97 Min. :29.00 Min. :156.0
#> 1st Qu.:23.05 1st Qu.:22.97 1st Qu.:31.25 1st Qu.:164.1
#> Median :24.20 Median :24.13 Median :33.63 Median :168.0
#> Mean :24.42 Mean :24.47 Mean :34.27 Mean :167.2
#> 3rd Qu.:25.65 3rd Qu.:25.98 3rd Qu.:37.33 3rd Qu.:170.0
#> Max. :27.93 Max. :28.03 Max. :42.30 Max. :176.0
#> NA's :9
#> Height_of_father Foot
#> Min. :168 Min. :20.93
#> 1st Qu.:177 1st Qu.:23.07
#> Median :180 Median :24.17
#> Mean :181 Mean :24.44
#> 3rd Qu.:184 3rd Qu.:25.82
#> Max. :193 Max. :27.98
#> NA's :10
We will now make a scatter plot of Foot
against Height
(like we have done in the tutorials in BIOS3000/4000)
First, we need to tell ggplot what data we are going to use. This is the first argument of the ggplot()
function.
ggplot(student_means)
As you can see, this outputs a completely empty plot. That kind of makes sense, we lack two things according to the list I made above. Let’s add some variables/aesthetics.
The variables are the second argument of the ggplot()
function. Additionally, all variables need to be wrapped inside the aes()
function! Remembering this will save you a lot of headaches in the future.
So, let’s add Foot
on the x-axis, and Height
on the y-axis:
ggplot(student_means, aes(x = Foot, y = Height))
Note that you don’t need to use the $
to access the Foot
and Height
columns, neat! This only works inside of aes()
(which is where your variables should go anyways).
Our plot is still empty, but at least we have axes corresponding to the range of our data. Next, we will add some points.
Geometry is added with the +
operator, using their own separate functions. The functions all start with geom_
, and the function for adding points is geom_point()
.
ggplot(student_means, aes(x = Foot, y = Height)) + geom_point()
Now we’ve made a nice little scatterplot of our variables. We could also add more geoms if we want to. For instance, geom_smooth()
adds a regression line (we use the argument method = "lm"
to specify linear regression):
ggplot(student_means, aes(x = Foot, y = Height)) +
geom_point() +
geom_smooth(method = "lm")
Nice and simple, and shows the trend in our data with added uncertainty.
In the previous plot, we only mapped our variables to the aesthetics x
and y
. The real power of ggplot, however, is the concept that all aesthetics can be mapped to variables. This means that if we want to color our points by sex, all we need to do is to map the Sex
column to the col
aesthetic.
ggplot(student_means, aes(x = Foot, y = Height, col = Sex)) +
geom_point() +
geom_smooth(method = "lm")
Notice how both the points and regression now follow Sex
using different colors for the different sexes (geom_smooth()
is plotting an interaction model here by the way, more on this later). Also, we automagically get a legend, telling us what the colors mean!
If you want to map a variable to an aesthetic for only one geom, you can supply the mapping directly to the geom function instead:
ggplot(student_means, aes(x = Foot, y = Height)) +
geom_point(aes(col = Sex)) + #color now only affects the points
geom_smooth(method = "lm")
More geoms and aesthetics, and ways to use these, will be introduced in the parts about plotting model predictions.
You can save your ggplot to an object. This is useful if you want to use the same fundament for multiple plots.
<- ggplot(student_means, aes(x = Foot, y = Height)) +
fh_plot geom_point(aes(col = Sex)) + #color now only affects the points
geom_smooth(method = "lm")
fh_plot
You can add labels and title by adding the labs()
function (using +
, like before).
+
fh_plot labs(title = "Relationship between footlength and height",
subtitle = "Data from measurements by BIOS3000/4000 students in 2020",
x = "Foot length (cm)",
y = "Height (cm)")
ggplot2
also has some pre-packaged themes that you can add if you don’t like the default grey, for example theme_bw()
.
+
fh_plot theme_bw()
You can set static aesthetics (i.e. not mapped to variables) by putting the arguments outside of aes()
. For these aesthetics, you can use the same colors, line types and point shapes as in base R.
ggplot(student_means, aes(x = Foot, y = Height)) +
geom_point(col = "steelblue",
pch = 14) + #static color and shape for all points
geom_smooth(method = "lm",
col = "firebrick",
lty = 2,
fill = "yellow") #static color, linetype and fill
We will now turn to some examples of plotting model predictions with ggplot2. We will base it on the following linear models (from week 3 in BIOS3000/4000):
<- lm(Height ~ Sex + Foot, data = student_means)
fit_Sex_Foot <- lm(Height ~ Sex*Foot, data = student_means) fit_Sex_Foot_interaction
The interaction model is actually plotted already in Figure 1.5, with a confidence interval and everything. Plotting simple models like this with interaction is as easy as mapping both your points and your regression lines by color.
For basically all other kinds of models, like the additive model fit_Sex_Foot
, we will have to do manual predictions and plot those. The following code for doing this is copy-pasted from the tutorials in BIOS3000/4000 (courtesy of Torbjørn Ergon):
= fit_Sex_Foot
model_fit = range(student_means$Foot[student_means$Sex=="female"])
x_range_F = range(student_means$Foot[student_means$Sex=="male"])
x_range_M = data.frame(Sex = "female", Foot = seq(x_range_F[1], x_range_F[2], length.out = 50))
pred_data_F = data.frame(Sex = "male", Foot = seq(x_range_M[1], x_range_M[2], length.out = 50))
pred_data_M = rbind(pred_data_F, pred_data_M)
pred_data = predict(model_fit, pred_data, interval = "confidence")
pred = cbind(pred_data, pred) pred
We now have the object pred
, which is a data frame, and thus readily plotable (if that’s a word) with ggplot2
. We will use a new geom, geom_line()
to add lines for our predictions.
<- ggplot(pred, aes(x = Foot, y = fit, col = Sex)) +
additive_plot geom_line()
additive_plot
To add confidence intervals, we add a geom_ribbon
and map lwr
and upr
to the aesthetics ymin
and ymax
, respectively. alpha
and lty
are for making the plot more visually pleasing (try without and see for yourself!). They are not mapped to variables, and thus go outside of aes()
.
<- additive_plot +
additive_plot_conf geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2, lty = 2)
additive_plot_conf
Finally, we can add the original points from our data set. Plotting this becomes a bit awkward, however, as the original data is in one data frame, and the prediction in another. This is solved by supplying a data
argument whenever we’re using a different data set than is provided in the ggplot()
function. In this case we swap the prediction data pred
with the original data student_means
.
+
additive_plot_conf geom_point(data = student_means, aes(x = Foot, y = Height))
Feel free to add some customization to this plot to make it prettier!
For this part, we will use the wheatlings_bio2150_F18.csv data. The data set contains growth data for different wheat varieties using different fertilizers at different concentrations.
First, we import and look at the data, like we did for the student data.
<- read.csv("wheatlings_bio2150_F18.csv", stringsAsFactors = TRUE)
wheat summary(wheat)
#> student_gr variety fertype conc length
#> gr_1 : 18 Bjarne :36 control:72 C0 :72 Min. :10.90
#> gr_10 : 18 Diamant :36 flow :48 C1 :72 1st Qu.:16.60
#> gr_11 : 18 Magifik :36 kris :48 C10:72 Median :19.20
#> gr_12 : 18 Oberkulmer:36 plus :48 Mean :20.29
#> gr_2 : 18 Olivin :36 3rd Qu.:24.90
#> gr_3 : 18 Zebra :36 Max. :32.10
#> (Other):108 NA's :3
#> wetmass startlength startwetmass
#> Min. :0.08195 Min. : 9.60 Min. :0.06500
#> 1st Qu.:0.16470 1st Qu.:12.55 1st Qu.:0.08100
#> Median :0.19160 Median :13.90 Median :0.09300
#> Mean :0.19717 Mean :14.35 Mean :0.09413
#> 3rd Qu.:0.21887 3rd Qu.:15.90 3rd Qu.:0.10450
#> Max. :0.40438 Max. :21.40 Max. :0.13300
#> NA's :6
A boxplot is often a good representation of one continuous and one categorical variable, like for example if we want to plot length
against conc
. As before, we choose our data, aesthetics and add a geom, in this case geom_boxplot()
.
<- ggplot(wheat, aes(x = conc, y = length)) +
conc_box geom_boxplot()
conc_box
However, if you’ve seen the data before, you know that the wheat variety drastically affect length, so we might want to include that information in our plot. One way to do this is to split our plot into different panels based on a variable1. In ggplot, this is called faceting. You can create facets by adding facet_wrap()
to your plot like this:
+
conc_box facet_wrap(~variety)
Note that here we have an exception to our “all variables go inside aes()
”-rule, which simply is something you have to remember. The same goes for the tilde ~
, which may be easier to understand if you read it like “function of”, and thus the entire thing as “facet as a function of variety”.
Bonus: If you want, you could add points on top of your boxplot, to show more of your data. geom_jitter()
randomly moves points to prevent plotting them on top of each other.
+
conc_box facet_wrap(~variety) +
geom_jitter(col = "firebrick", alpha = 0.5)
Boxplots are nice for showing data, but for predictions you only have a single data point per variable combination. For plotting predictions with error, we can use the geoms geom_point()
and geom_errorbar()
.
First, we need to make the model and prediction data, using an additive model with conc
and variety
as predictors (code copy-pasted from the tutorial in BIOS3000/4000, courtesy of Torbjørn Ergon):
<- lm(length ~ conc + variety, data = wheat)
fit_ad = expand.grid(variety = unique(wheat$variety), conc = unique(wheat$conc))
Newdata = cbind(Newdata, predict(fit_ad, newdata = Newdata, interval = "confidence")) Pred
Now we can plot first the predictions (with facets like before):
<- ggplot(Pred, aes(x = conc, y = fit)) +
wheat_pred geom_point(col = "firebrick") + # points for predictions
facet_wrap(~variety) # facets by variety
wheat_pred
Then add confidence intervals with geom_errorbar()
. Like geom_ribbon()
from before, it takes the ymin
and ymax
aesthetics, and we map lwr
and upr
to these.
+
wheat_pred geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.4, col = "steelblue")
Hope this little tutorial was useful! It will be updated if the need arises, in the meantime you should now know quite a bit of ggplot, and be able to figure stuff out yourself. If you are able to formulate your problem in general terms, google in general, and the stackoverflow results in particular will be very useful.
Good luck with your plotting!
Another way would be to use the fill
aesthetic and map that to variety
:
ggplot(wheat, aes(x = conc, y = length, fill = variety)) +
geom_boxplot()
But this is perhaps not as clean.↩︎