Domesticating data

Robert Schlegel

Problem

  • How do we summarise our data if they have multiple categories?
  • What are the limitations of this approach?
  • Does this really help with our workflow?

Solution

  • group_by() provides powerful options for data analysis
  • The only limits are our imagination, but be cautious
  • We end with the example exercise from Day 1

Setup

We will use two packages and one example dataset for these slides.

library(tidyverse) # All-in-one

library(lubridate) # For working with dates

sst_NOAA <- read_csv("course_material/data/sst_NOAA.csv") # SST data

The next level

In the previous session we covered the five main transformation functions one would use in a typical tidy workflow. But to really unlock their power we need to learn how to use them with group_by(). This is how we may calculate statistics based on the different grouping variables within our data, such as sites or species or months.

group_by()

  • Note that this function will not appear to do anything by itself
  • This can cause issues if we aren’t paying attention
sst_NOAA_site <- sst_NOAA %>% group_by(site)
sst_NOAA %>% head()
# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   1982-01-01  13.9
2 Med   1982-01-02  13.9
3 Med   1982-01-03  13.4
4 Med   1982-01-04  13.1
5 Med   1982-01-05  13.1
6 Med   1982-01-06  13.9
sst_NOAA_site %>% head()
# A tibble: 6 × 3
# Groups:   site [1]
  site  t           temp
  <chr> <date>     <dbl>
1 Med   1982-01-01  13.9
2 Med   1982-01-02  13.9
3 Med   1982-01-03  13.4
4 Med   1982-01-04  13.1
5 Med   1982-01-05  13.1
6 Med   1982-01-06  13.9

group_by()

  • But when we summarise() the data…
sst_NOAA %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE)) %>% 
  head()
# A tibble: 1 × 1
  mean_temp
      <dbl>
1      16.1
sst_NOAA_site %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE)) %>% 
  head()
# A tibble: 3 × 2
  site   mean_temp
  <chr>      <dbl>
1 Med        17.9 
2 NW_Atl      8.90
3 WA         21.5 

ungroup()

  • One must explicitly tell R to remove a group
sst_NOAA_ungroup <- sst_NOAA_site %>% ungroup()
sst_NOAA_site %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE)) %>% 
  head()
# A tibble: 3 × 2
  site   mean_temp
  <chr>      <dbl>
1 Med        17.9 
2 NW_Atl      8.90
3 WA         21.5 
sst_NOAA_ungroup %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE)) %>% 
  head()
# A tibble: 1 × 1
  mean_temp
      <dbl>
1      16.1

Multiple groups

  • As one may have guessed by now, grouping is not confined to a single column
  • One may use any number of columns to perform elaborate grouping measures
# Create groupings based on temperatures
sst_NOAA_temp_group <- sst_NOAA %>% 
  group_by(round(temp))

# Create groupings based on site and month
sst_NOAA_temp_month_group <- sst_NOAA %>% 
  mutate(month = month(t)) %>% 
  group_by(site, month)

Chain functions

  • Generally we do not group objects separately
  • Grouping is performed within code chunks
  • summarise() has an ungrouping argument
sst_NOAA_site_mean <- sst_NOAA %>% 
  # Group by the site column
  group_by(site) %>% 
  # Calculate means
  summarise(mean_temp = mean(temp, na.rm = TRUE), 
            # Count observations 
            count = n(),
            # Ungroup results
            .groups = "drop") 
sst_NOAA_site_mean
# A tibble: 3 × 3
  site   mean_temp count
  <chr>      <dbl> <int>
1 Med        17.9  14245
2 NW_Atl      8.90 14245
3 WA         21.5  14245

Grouped transformations

We’ve played around quite a bit with grouping and summarising, but that’s not all we can do. We can use group_by() very nicely with filter() and mutate() as well. Not so much with arrange() and select() as these are designed to work on the entire dataframe at once, without any subsetting. We can do some rather imaginative things when we combine all of these tools together. In fact, we should be able to accomplish almost any task we can think of.

Examples

  • Filter sites that don’t have a max temperature above 20°C
sst_NOAA_20 <- sst_NOAA %>%
  group_by(site) %>%
  filter(max(temp) > 20) %>% 
  ungroup()
unique(sst_NOAA_20$site)
[1] "Med" "WA" 

Examples

  • Calculate anomalies for each site
sst_NOAA_anom <- sst_NOAA %>%
  group_by(site) %>% 
  mutate(anom = temp - mean(temp, na.rm = T)) %>%
  ungroup()
head(sst_NOAA_anom)
# A tibble: 6 × 4
  site  t           temp  anom
  <chr> <date>     <dbl> <dbl>
1 Med   1982-01-01  13.9 -4.00
2 Med   1982-01-02  13.9 -3.99
3 Med   1982-01-03  13.4 -4.46
4 Med   1982-01-04  13.1 -4.74
5 Med   1982-01-05  13.1 -4.73
6 Med   1982-01-06  13.9 -4.00

Examples

  • Calculate mean and standard deviations for two sites
sst_NOAA %>% 
  filter(site == "Med" | site == "WA") %>%
  group_by(site) %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE), 
            sd_temp = sd(temp, na.rm = TRUE))
# A tibble: 2 × 3
  site  mean_temp sd_temp
  <chr>     <dbl>   <dbl>
1 Med        17.9    4.15
2 WA         21.5    1.64

Examples

  • Calculate mean and standard deviations for two sites
# First create a character vector containing the desired sites
selected_sites <- c("Med", "WA")

# Then calculate the statistics
sst_NOAA %>% 
  filter(site %in% selected_sites) %>%
  group_by(site) %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE), 
            sd_temp = sd(temp, na.rm = TRUE))
# A tibble: 2 × 3
  site  mean_temp sd_temp
  <chr>     <dbl>   <dbl>
1 Med        17.9    4.15
2 WA         21.5    1.64

Examples

  • Only days with temperatures above 10°C and below 15°C
sst_NOAA %>% 
  filter(site == "Med", 
         temp > 10, temp < 15) %>% 
  nrow()
[1] 5244
sst_NOAA %>% 
  filter(site == "Med", 
         !(temp <= 10 | temp  >= 15)) %>% 
  nrow()
[1] 5244

The new age redux

 # Load the SACTN Day 1 data
read_csv("course_material/data/sst_NOAA.csv") %>%
  # Then create a month abbreviation column
  mutate(month = month(t, label = T)) %>% 
  # Then group by sites and months
  group_by(site, month) %>% 
  # Lastly calculate the mean
  summarise(mean_temp = mean(temp, na.rm = TRUE), 
            # and the SD
            sd_temp = sd(temp, na.rm = TRUE)) %>% 
  # Begin ggplot
  ggplot(aes(x = month, y = mean_temp, group = site)) + 
  # Create a ribbon
  geom_ribbon(aes(ymin = mean_temp - sd_temp, ymax = mean_temp + sd_temp), 
              fill = "black", alpha = 0.4) + 
  # Create dots
  geom_point(aes(colour = site)) + 
  # Create lines
  geom_line(aes(colour = site, group = site)) + 
  # Change labels
  labs(x = "Month", y = "Temperature (°C)", colour = "Site") 

The new age redux

Summary functions

There is a near endless sea of possibilities when one starts to become comfortable with writing R code. We have seen several summary functions used thus far. Mostly in straightforward ways. But that is one of the fun things about R, the only limits to what we may create are within our mind, not the program. Here is just one example of a creative way to answer a straightforward question: ‘What is the proportion of recordings above 20°C per site?’. Note how we may refer to columns we have created within the same chunk. There is no need to save the intermediate dataframes if we choose not to.

Summary functions

  • The proportion of recordings above 20°C per site
sst_NOAA %>%  
  group_by(site) %>%
  summarise(count = n(), 
            count_20 = sum(temp > 20)) %>% 
  mutate(prop_20 = count_20/count) %>% 
  arrange(prop_20)
# A tibble: 3 × 4
  site   count count_20 prop_20
  <chr>  <int>    <int>   <dbl>
1 NW_Atl 14245        0   0    
2 Med    14245     4740   0.333
3 WA     14245    11463   0.805