This guide is a part of a series of short tutorials. Find all the tutorials here.
A data.frame is arguably the most important kind of object you work with in R as a biologist. This very short tutorial covers some important aspects of handling data.frames, namely adding and extracting information, and subsetting.
We will make this example data.frame for working with throughout:
<- data.frame(
my_df numbers_1 = 1:5,
numbers_2 = 6:10,
letters_1 = letters[1:5],
letters_2 = letters[6:10],
stringsAsFactors = FALSE
) my_df
## numbers_1 numbers_2 letters_1 letters_2
## 1 1 6 a f
## 2 2 7 b g
## 3 3 8 c h
## 4 4 9 d i
## 5 5 10 e j
You can extract data in two main ways: with the $
operator or with square brackets []
.
The $
operator extracts a column from the data frame into a vector:
$numbers_1 my_df
## [1] 1 2 3 4 5
$letters_1 my_df
## [1] "a" "b" "c" "d" "e"
Square brackets can take either rows or columns, as well as single data entries. The syntax is data[row,column]
. If you leave either of the fields within the brackets empty, it will return all the rows or columns, respectively.
# extract data from row 2, column 4:
2,4] my_df[
## [1] "g"
#extract row 2, column 1 and 3:
2, c(1,3)] my_df[
## numbers_1 letters_1
## 2 2 b
# extract row 2, all columns:
2,] my_df[
## numbers_1 numbers_2 letters_1 letters_2
## 2 2 7 b g
# extract column 4, all rows:
4] my_df[,
## [1] "f" "g" "h" "i" "j"
Note that the last one is the same as writing my_data$letters_2
.
You can also supply a column name in the brackets to extract one or more column (remember quotation marks):
"letters_1"] my_df[
## letters_1
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e
c("numbers_2", "letters_2")] my_df[
## numbers_2 letters_2
## 1 6 f
## 2 7 g
## 3 8 h
## 4 9 i
## 5 10 j
Luckily, the process for changing a data point is very similar to extracting one. All you need to do is to put the indexing on the left side of <-
(or =
).
# change a single value
2,4] <- "changed"
my_df[ my_df
## numbers_1 numbers_2 letters_1 letters_2
## 1 1 6 a f
## 2 2 7 b changed
## 3 3 8 c h
## 4 4 9 d i
## 5 5 10 e j
# change a column
$numbers_2 <- c(1,1,2,3,5)
my_df my_df
## numbers_1 numbers_2 letters_1 letters_2
## 1 1 1 a f
## 2 2 1 b changed
## 3 3 2 c h
## 4 4 3 d i
## 5 5 5 e j
# change a row
1,] <- c("this", "row", "is", "changed")
my_df[ my_df
## numbers_1 numbers_2 letters_1 letters_2
## 1 this row is changed
## 2 2 1 b changed
## 3 3 2 c h
## 4 4 3 d i
## 5 5 5 e j
Notice how this changes the original data, so use this with caution!
# make a new df since we broke the last one ...
<- data.frame(
my_df numbers_1 = 1:5,
numbers_2 = 6:10,
letters_1 = letters[1:5],
letters_2 = letters[6:10],
stringsAsFactors = FALSE
)
In the same way you can write e.g. my_df$numbers_2 <-
to overwrite the column numbers_2
, you can create a new column by writing a name that doesn’t already exist:
# new column
$numbers_3 <- 11:15
my_df my_df
## numbers_1 numbers_2 letters_1 letters_2 numbers_3
## 1 1 6 a f 11
## 2 2 7 b g 12
## 3 3 8 c h 13
## 4 4 9 d i 14
## 5 5 10 e j 15
A very useful application if this is that you can compute values in a new column using data from other columns:
$numbers_4 <- my_df$numbers_1 + my_df$numbers_3
my_df my_df
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
## 3 3 8 c h 13 16
## 4 4 9 d i 14 18
## 5 5 10 e j 15 20
This kind of vectorized operations is among the most useful features of R, and it is essential for handling data.
Subsetting is similar to extracting data in that you select out certain rows (and columns). You can use either the square brackets or R’s subset
function for this. A difference is that when subsetting, you often want to select rows based on some defined criteria, and often you use R’s logical operators, like ==
, >
or %in%
for this. This is based on a different way of extracting data than we covered before, namely providing a vector of TRUE
and FALSE
to select all row numbers that are TRUE
. For example, for selecting rows 1 and 2, you could write:
1:2,] my_df[
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
But you could also write:
c(TRUE,TRUE,FALSE,FALSE,FALSE),] my_df[
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
Then, consider that writing:
$numbers_1 < 3 my_df
## [1] TRUE TRUE FALSE FALSE FALSE
provides the exact same vector that we used earlier for subsetting. This means that you can write:
$numbers_1 < 3,] my_df[my_df
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
to select all rows where the numbers_1
column is smaller than 3.
You can select based on several criteria using the %in%
operator, which checks if the value of your data is equal to a value in a vector you provide, e.g.:
$letters_1 %in% c("a", "b"),] my_df[my_df
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
I won’t cover the logical operators comprehensively here, but a lot of material is available online. It is as simple as: if you are able to create your subsetting criteria as a logical expression in R, you will have no trouble subsetting.
An alternative to using square brackets is using the subset
function. This works in the same way, with logical operators, but a different syntax.
subset(my_df, numbers_1 < 3)
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
subset(my_df, letters_1 %in% c("a", "b"))
## numbers_1 numbers_2 letters_1 letters_2 numbers_3 numbers_4
## 1 1 6 a f 11 12
## 2 2 7 b g 12 14
If you want to know more about the subset
function, try accessing it’s help page by writing ?subset
.