Taming data

Robert Schlegel

Problem

Once the data are tidy, how do we manipulate them?
Are there any consistent approaches to this?
Can this be used to perform analyses?

Solution

There are five primary functions we will focus on here
We will see how the use of pre-existing logic tools can work for us
We end with some examples of summarise()

Setup

We will use two packages and one example dataset for these slides.

library(tidyverse) # All-in-one

library(lubridate) # For working with dates

load("course_material/data/OISST_mangled.RData") # For examples

tidy workflow

The transformation of ones data should be the third step in any workflow, after the tidying of the data.

Data transforming in the data processing pipeline. Reproduced from R for Data Science

tidy transformation

There are five primary data transformation functions that we will focus on here:

Arrange observations (rows) with arrange()
Select variables (columns) withselect()
Filter observations (rows) with filter()
Create new variables (columns) with mutate()
Summarise data (rows+columns) with summarise()

Comparison operators

We use the assignment operator (<-) to assign code to an object in our environment
Likewise, comparison operators are symbols used to compare different objects
We will see these symbols often in the ‘real world’

Comparison operators

Greater than: >
Greater than or equal to: >=
Less than: <
Less than or equal to: <=
Equal to: ==
Not equal to: !=

Comparison operators

It is important here to note that == is for comparisons and = is for maths. They are not interchangeable, as we may see in the following code chunk. This is one of the more common mistakes one makes when writing code.

OISST1 %>% 
  filter(site = "Med")

Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `site == "Med"`?

Logical operators

Comparison operators are used to make direct comparisons between specific things
Logical operators are used more broadly when making logical arguments
R makes use of the same Boolean logic as many other platforms, including Google

Logical operators

and: &
or: |
not: !

Logical operators

Note that we need a separate logical argument for each comparison operator

OISST1_sub <- OISST1 %>% 
  filter(site == "Med", month(t) == 12 | month(t) == 1)

head(OISST1_sub)

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

tail(OISST1_sub)

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2009-12-26  13.5
2 Med   2009-12-27  13.5
3 Med   2009-12-28  13.6
4 Med   2009-12-29  13.6
5 Med   2009-12-30  13.5
6 Med   2009-12-31  13.5

`arrange()` observation (row)

arrange() observations (rows) based on the variables (columns) it is given
If we are concerned with ties we provide additional columns
The importance of arranging the rows is given in order from left to right
Note that NA values will get sent to the bottom

`arrange()` observation (row)

The effect of using the desc() function

OISST1 %>% 
  arrange(site, temp) %>% 
  head()

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2008-02-03  12.5
2 Med   2008-02-02  12.5
3 Med   2009-03-02  12.6
4 Med   2009-02-02  12.6
5 Med   2009-02-07  12.6
6 Med   2008-02-12  12.6

OISST1 %>% 
  arrange(site, desc(temp)) %>%
  head()

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2008-06-30  26.8
2 Med   2009-08-20  26.7
3 Med   2009-08-19  26.7
4 Med   2009-08-21  26.6
5 Med   2008-07-01  26.6
6 Med   2009-08-18  26.6

`select()` variable (column)

If we load data with columns we don’t need, we may select() those we do
There are many ways to do this and none are technically better or faster
It is up to the user to find a favourite technique

`select()` variable (column)

Different ways of using select()

# Select columns individually by name
OISST1 %>% 
  select(site, t, temp)

# Select all columns between site and temp like a sequence
OISST1 %>% 
  select(site:temp)

# Select all columns except those stated individually
OISST1 %>% 
  select(-t, -temp)

# Select all columns except those within a given sequence
  # Note that the '-' goes outside of a new set of brackets
  # that are wrapped around the sequence of columns to remove
OISST1 %>% 
  select(-(site:temp))

`select()` variable (column)

Even more ways of using select()

# Change up order by specifying individual columns
OISST1 %>% 
  select(temp, t, site)

# Use the everything function to grab all columns 
# not already specified
OISST1 %>% 
  select(t, everything())

# Or go bananas and use all of the rules at once
  # Remember, when dealing with tidy data,
  # everything may be interchanged
OISST1 %>% 
  select(temp:t, everything(), -site)

`filter()` observation (row)

filter() is generally used the most of all transformations
It allows us to dramatically reshape a dataframe
Note that this automatically removes rows with NA

`filter()` observation (row)

Use the %in% comparison operator to make multiple logic statements

OISST1 %>% 
  filter(site == "Med", 
         year(t) == 2008) %>% 
  head()

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

OISST1 %>% 
  filter(site %in% c("Med", "WA"), 
         year(t) == 2009) %>% 
  tail()

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 WA    2009-12-26  21.7
2 WA    2009-12-27  21.5
3 WA    2009-12-28  21.7
4 WA    2009-12-29  21.8
5 WA    2009-12-30  20.9
6 WA    2009-12-31  20.5

`mutate()` new variable (column)

When performing data analysis/statistics it is likely necessary to create new values not in the raw data. arrange(), filter(), and select() prepare us to create new data, but do not do so themselves. For this we use mutate(). We must however be very mindful that mutate() is only useful if we want to create new variables (columns) that are a function of one or more existing columns. This means that any column we create with mutate() will always have the same number of rows as the dataframe we are working with.

`mutate()` new variable (column)

Tell R the name of the column and what to put in it
Here we create kelvin by converting the temp column into Kelvin (°K) (i.e. °C+273.15)

OISST1 %>% mutate(kelvin = temp + 273.15) %>% head()

# A tibble: 6 × 4
  site  t           temp kelvin
  <chr> <date>     <dbl>  <dbl>
1 Med   2008-01-01  13.7   287.
2 Med   2008-01-02  13.7   287.
3 Med   2008-01-03  13.2   286.
4 Med   2008-01-04  13.3   286.
5 Med   2008-01-05  13.3   286.
6 Med   2008-01-06  13.4   287.

`summarise()` data (row+column)

To create new columns we use mutate(), but to calculate any sort of summary/statistic from a column that will return fewer rows than the dataframe has we will use summarise(). This makes summarise() much more powerful than the other functions in this section, but because it is able to do more, it can also be more unpredictable, making it potentially more challenging to use. We will almost always use this function in our work flows however so it behoves us to become well acquainted with it.

`summarise()` data (row+column)

Note how the use of summarise() changes the dimensions of the dataframe

OISST1 %>% 
  mutate(mean_temp = mean(temp, na.rm = TRUE)) %>% 
  head()

# A tibble: 6 × 4
  site  t           temp mean_temp
  <chr> <date>     <dbl>     <dbl>
1 Med   2008-01-01  13.7      16.2
2 Med   2008-01-02  13.7      16.2
3 Med   2008-01-03  13.2      16.2
4 Med   2008-01-04  13.3      16.2
5 Med   2008-01-05  13.3      16.2
6 Med   2008-01-06  13.4      16.2

OISST1 %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE)) %>% 
  head()

# A tibble: 1 × 1
  mean_temp
      <dbl>
1      16.2

`summarise()` data (row+column)

It is possible to perform many analyses at once

OISST1 %>% 
  summarise(mean_temp = mean(temp, na.rm = TRUE),
            sd_temp = sd(temp, na.rm = TRUE),
            min_temp = min(temp, na.rm = TRUE),
            max_temp = max(temp, na.rm = TRUE))

# A tibble: 1 × 4
  mean_temp sd_temp min_temp max_temp
      <dbl>   <dbl>    <dbl>    <dbl>
1      16.2    6.55     2.27     27.2

Domesticating data

Creating summaries of the entire OISST dataset in this way is not appropriate as we should not be combining time series from such different parts of the world. In order to calculate summaries within variables we will need to learn how to use group_by(), which in turn will first require us to learn how to chain multiple functions together within a pipe (%>%). That is how we will begin the next session for today. Finishing with several tips on how to make our data the tidiest that it may be.

Taming data

Problem

Solution

Setup

tidy workflow

tidy transformation

Comparison operators

Comparison operators

Comparison operators

Logical operators

Logical operators

Logical operators

arrange() observation (row)

arrange() observation (row)

select() variable (column)

select() variable (column)

select() variable (column)

filter() observation (row)

filter() observation (row)

mutate() new variable (column)

mutate() new variable (column)

summarise() data (row+column)

summarise() data (row+column)

summarise() data (row+column)

Domesticating data

`arrange()` observation (row)

`arrange()` observation (row)

`select()` variable (column)

`select()` variable (column)

`select()` variable (column)

`filter()` observation (row)

`filter()` observation (row)

`mutate()` new variable (column)

`mutate()` new variable (column)

`summarise()` data (row+column)

`summarise()` data (row+column)

`summarise()` data (row+column)