This guide is a part of a series of short tutorials. Find all the tutorials here.

Introduction

A data.frame is arguably the most important kind of object you work with in R as a biologist. This very short tutorial covers some important aspects of handling data.frames, namely adding and extracting information, and subsetting.

We will make this example data.frame for working with throughout:

my_df <- data.frame(
  numbers_1 = 1:5,
  numbers_2 = 6:10,
  letters_1 = letters[1:5],
  letters_2 = letters[6:10],
  stringsAsFactors = FALSE
)
my_df
##   numbers_1 numbers_2 letters_1 letters_2
## 1         1         6         a         f
## 2         2         7         b         g
## 3         3         8         c         h
## 4         4         9         d         i
## 5         5        10         e         j

1 Accessing data within your data.frame

You can extract data in two main ways: with the $ operator or with square brackets [].

The $ operator extracts a column from the data frame into a vector:

my_df$numbers_1
## [1] 1 2 3 4 5
my_df$letters_1
## [1] "a" "b" "c" "d" "e"

Square brackets can take either rows or columns, as well as single data entries. The syntax is data[row,column]. If you leave either of the fields within the brackets empty, it will return all the rows or columns, respectively.

# extract data from row 2, column 4:
my_df[2,4]
## [1] "g"
#extract row 2, column 1 and 3:
my_df[2, c(1,3)]
##   numbers_1 letters_1
## 2         2         b
# extract row 2, all columns:
my_df[2,]
##   numbers_1 numbers_2 letters_1 letters_2
## 2         2         7         b         g
# extract column 4, all rows:
my_df[,4]
## [1] "f" "g" "h" "i" "j"

Note that the last one is the same as writing my_data$letters_2.

You can also supply a column name in the brackets to extract one or more column (remember quotation marks):

my_df["letters_1"]
##   letters_1
## 1         a
## 2         b
## 3         c
## 4         d
## 5         e
my_df[c("numbers_2", "letters_2")]
##   numbers_2 letters_2
## 1         6         f
## 2         7         g
## 3         8         h
## 4         9         i
## 5        10         j

2 Changing data and adding columns

2.1 Changing data

Luckily, the process for changing a data point is very similar to extracting one. All you need to do is to put the indexing on the left side of <- (or =).

# change a single value
my_df[2,4] <- "changed"
my_df
##   numbers_1 numbers_2 letters_1 letters_2
## 1         1         6         a         f
## 2         2         7         b   changed
## 3         3         8         c         h
## 4         4         9         d         i
## 5         5        10         e         j
# change a column
my_df$numbers_2 <- c(1,1,2,3,5)
my_df
##   numbers_1 numbers_2 letters_1 letters_2
## 1         1         1         a         f
## 2         2         1         b   changed
## 3         3         2         c         h
## 4         4         3         d         i
## 5         5         5         e         j
# change a row
my_df[1,] <- c("this", "row", "is", "changed")
my_df
##   numbers_1 numbers_2 letters_1 letters_2
## 1      this       row        is   changed
## 2         2         1         b   changed
## 3         3         2         c         h
## 4         4         3         d         i
## 5         5         5         e         j

Notice how this changes the original data, so use this with caution!

# make a new df since we broke the last one ...
my_df <- data.frame(
  numbers_1 = 1:5,
  numbers_2 = 6:10,
  letters_1 = letters[1:5],
  letters_2 = letters[6:10],
  stringsAsFactors = FALSE
)

2.2 Adding columns

In the same way you can write e.g. my_df$numbers_2 <- to overwrite the column numbers_2, you can create a new column by writing a name that doesn’t already exist:

# new column
my_df$numbers_3 <- 11:15
my_df
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3
## 1         1         6         a         f        11
## 2         2         7         b         g        12
## 3         3         8         c         h        13
## 4         4         9         d         i        14
## 5         5        10         e         j        15

A very useful application if this is that you can compute values in a new column using data from other columns:

my_df$numbers_4 <- my_df$numbers_1 + my_df$numbers_3
my_df
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14
## 3         3         8         c         h        13        16
## 4         4         9         d         i        14        18
## 5         5        10         e         j        15        20

This kind of vectorized operations is among the most useful features of R, and it is essential for handling data.

3 Subsetting

Subsetting is similar to extracting data in that you select out certain rows (and columns). You can use either the square brackets or R’s subset function for this. A difference is that when subsetting, you often want to select rows based on some defined criteria, and often you use R’s logical operators, like ==, > or %in% for this. This is based on a different way of extracting data than we covered before, namely providing a vector of TRUE and FALSE to select all row numbers that are TRUE. For example, for selecting rows 1 and 2, you could write:

my_df[1:2,]
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14

But you could also write:

my_df[c(TRUE,TRUE,FALSE,FALSE,FALSE),]
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14

Then, consider that writing:

my_df$numbers_1 < 3
## [1]  TRUE  TRUE FALSE FALSE FALSE

provides the exact same vector that we used earlier for subsetting. This means that you can write:

my_df[my_df$numbers_1 < 3,]
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14

to select all rows where the numbers_1 column is smaller than 3.

You can select based on several criteria using the %in% operator, which checks if the value of your data is equal to a value in a vector you provide, e.g.:

my_df[my_df$letters_1 %in% c("a", "b"),]
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14

I won’t cover the logical operators comprehensively here, but a lot of material is available online. It is as simple as: if you are able to create your subsetting criteria as a logical expression in R, you will have no trouble subsetting.

An alternative to using square brackets is using the subset function. This works in the same way, with logical operators, but a different syntax.

subset(my_df, numbers_1 < 3)
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14
subset(my_df, letters_1 %in% c("a", "b"))
##   numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1         1         6         a         f        11        12
## 2         2         7         b         g        12        14

If you want to know more about the subset function, try accessing it’s help page by writing ?subset.