Robert Schlegel
group_by()
provides powerful options for data analysisWe will use two packages and one example dataset for these slides.
In the previous session we covered the five main transformation functions one would use in a typical tidy workflow. But to really unlock their power we need to learn how to use them with group_by()
. This is how we may calculate statistics based on the different grouping variables within our data, such as sites or species or months.
group_by()
group_by()
summarise()
the data…ungroup()
summarise()
has an ungrouping argumentsst_NOAA_site_mean <- sst_NOAA %>%
# Group by the site column
group_by(site) %>%
# Calculate means
summarise(mean_temp = mean(temp, na.rm = TRUE),
# Count observations
count = n(),
# Ungroup results
.groups = "drop")
sst_NOAA_site_mean
# A tibble: 3 × 3
site mean_temp count
<chr> <dbl> <int>
1 Med 17.9 14245
2 NW_Atl 8.90 14245
3 WA 21.5 14245
We’ve played around quite a bit with grouping and summarising, but that’s not all we can do. We can use group_by()
very nicely with filter()
and mutate()
as well. Not so much with arrange()
and select()
as these are designed to work on the entire dataframe at once, without any subsetting. We can do some rather imaginative things when we combine all of these tools together. In fact, we should be able to accomplish almost any task we can think of.
sst_NOAA_anom <- sst_NOAA %>%
group_by(site) %>%
mutate(anom = temp - mean(temp, na.rm = T)) %>%
ungroup()
head(sst_NOAA_anom)
# A tibble: 6 × 4
site t temp anom
<chr> <date> <dbl> <dbl>
1 Med 1982-01-01 13.9 -4.00
2 Med 1982-01-02 13.9 -3.99
3 Med 1982-01-03 13.4 -4.46
4 Med 1982-01-04 13.1 -4.74
5 Med 1982-01-05 13.1 -4.73
6 Med 1982-01-06 13.9 -4.00
# First create a character vector containing the desired sites
selected_sites <- c("Med", "WA")
# Then calculate the statistics
sst_NOAA %>%
filter(site %in% selected_sites) %>%
group_by(site) %>%
summarise(mean_temp = mean(temp, na.rm = TRUE),
sd_temp = sd(temp, na.rm = TRUE))
# A tibble: 2 × 3
site mean_temp sd_temp
<chr> <dbl> <dbl>
1 Med 17.9 4.15
2 WA 21.5 1.64
# Load the SACTN Day 1 data
read_csv("course_material/data/sst_NOAA.csv") %>%
# Then create a month abbreviation column
mutate(month = month(t, label = T)) %>%
# Then group by sites and months
group_by(site, month) %>%
# Lastly calculate the mean
summarise(mean_temp = mean(temp, na.rm = TRUE),
# and the SD
sd_temp = sd(temp, na.rm = TRUE)) %>%
# Begin ggplot
ggplot(aes(x = month, y = mean_temp, group = site)) +
# Create a ribbon
geom_ribbon(aes(ymin = mean_temp - sd_temp, ymax = mean_temp + sd_temp),
fill = "black", alpha = 0.4) +
# Create dots
geom_point(aes(colour = site)) +
# Create lines
geom_line(aes(colour = site, group = site)) +
# Change labels
labs(x = "Month", y = "Temperature (°C)", colour = "Site")
There is a near endless sea of possibilities when one starts to become comfortable with writing R code. We have seen several summary functions used thus far. Mostly in straightforward ways. But that is one of the fun things about R, the only limits to what we may create are within our mind, not the program. Here is just one example of a creative way to answer a straightforward question: ‘What is the proportion of recordings above 20°C per site?’. Note how we may refer to columns we have created within the same chunk. There is no need to save the intermediate dataframes if we choose not to.
sst_NOAA %>%
group_by(site) %>%
summarise(count = n(),
count_20 = sum(temp > 20)) %>%
mutate(prop_20 = count_20/count) %>%
arrange(prop_20)
# A tibble: 3 × 4
site count count_20 prop_20
<chr> <int> <int> <dbl>
1 NW_Atl 14245 0 0
2 Med 14245 4740 0.333
3 WA 14245 11463 0.805