Once the data are tidy, how do we manipulate them?
Are there any consistent approaches to this?
Can this be used to perform analyses?
Solution
There are five primary functions we will focus on here
We will see how the use of pre-existing logic tools can work for us
We end with some examples of summarise()
Setup
We will use two packages and one example dataset for these slides.
library(tidyverse) # All-in-onelibrary(lubridate) # For working with datesload("course_material/data/OISST_mangled.RData") # For examples
tidy workflow
The transformation of ones data should be the third step in any workflow, after the tidying of the data.
Data transforming in the data processing pipeline. Reproduced from R for Data Science
tidy transformation
There are five primary data transformation functions that we will focus on here:
Arrange observations (rows) with arrange()
Select variables (columns) withselect()
Filter observations (rows) with filter()
Create new variables (columns) with mutate()
Summarise data (rows+columns) with summarise()
Comparison operators
We use the assignment operator (<-) to assign code to an object in our environment
Likewise, comparison operators are symbols used to compare different objects
We will see these symbols often in the ‘real world’
Comparison operators
Greater than: >
Greater than or equal to: >=
Less than: <
Less than or equal to: <=
Equal to: ==
Not equal to: !=
Comparison operators
It is important here to note that == is for comparisons and = is for maths. They are not interchangeable, as we may see in the following code chunk. This is one of the more common mistakes one makes when writing code.
OISST1 %>%filter(site ="Med")
Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `site == "Med"`?
Logical operators
Comparison operators are used to make direct comparisons between specific things
Logical operators are used more broadly when making logical arguments
R makes use of the same Boolean logic as many other platforms, including Google
Logical operators
and: &
or: |
not: !
Logical operators
Note that we need a separate logical argument for each comparison operator
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 Med 2008-01-01 13.7
2 Med 2008-01-02 13.7
3 Med 2008-01-03 13.2
4 Med 2008-01-04 13.3
5 Med 2008-01-05 13.3
6 Med 2008-01-06 13.4
tail(OISST1_sub)
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 Med 2009-12-26 13.5
2 Med 2009-12-27 13.5
3 Med 2009-12-28 13.6
4 Med 2009-12-29 13.6
5 Med 2009-12-30 13.5
6 Med 2009-12-31 13.5
arrange() observation (row)
arrange() observations (rows) based on the variables (columns) it is given
If we are concerned with ties we provide additional columns
The importance of arranging the rows is given in order from left to right
Note that NA values will get sent to the bottom
arrange() observation (row)
The effect of using the desc() function
OISST1 %>%arrange(site, temp) %>%head()
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 Med 2008-02-03 12.5
2 Med 2008-02-02 12.5
3 Med 2009-03-02 12.6
4 Med 2009-02-02 12.6
5 Med 2009-02-07 12.6
6 Med 2008-02-12 12.6
OISST1 %>%arrange(site, desc(temp)) %>%head()
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 Med 2008-06-30 26.8
2 Med 2009-08-20 26.7
3 Med 2009-08-19 26.7
4 Med 2009-08-21 26.6
5 Med 2008-07-01 26.6
6 Med 2009-08-18 26.6
select() variable (column)
If we load data with columns we don’t need, we may select() those we do
There are many ways to do this and none are technically better or faster
It is up to the user to find a favourite technique
select() variable (column)
Different ways of using select()
# Select columns individually by nameOISST1 %>%select(site, t, temp)# Select all columns between site and temp like a sequenceOISST1 %>%select(site:temp)# Select all columns except those stated individuallyOISST1 %>%select(-t, -temp)# Select all columns except those within a given sequence# Note that the '-' goes outside of a new set of brackets# that are wrapped around the sequence of columns to removeOISST1 %>%select(-(site:temp))
select() variable (column)
Even more ways of using select()
# Change up order by specifying individual columnsOISST1 %>%select(temp, t, site)# Use the everything function to grab all columns # not already specifiedOISST1 %>%select(t, everything())# Or go bananas and use all of the rules at once# Remember, when dealing with tidy data,# everything may be interchangedOISST1 %>%select(temp:t, everything(), -site)
filter() observation (row)
filter() is generally used the most of all transformations
It allows us to dramatically reshape a dataframe
Note that this automatically removes rows with NA
filter() observation (row)
Use the %in% comparison operator to make multiple logic statements
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 Med 2008-01-01 13.7
2 Med 2008-01-02 13.7
3 Med 2008-01-03 13.2
4 Med 2008-01-04 13.3
5 Med 2008-01-05 13.3
6 Med 2008-01-06 13.4
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 WA 2009-12-26 21.7
2 WA 2009-12-27 21.5
3 WA 2009-12-28 21.7
4 WA 2009-12-29 21.8
5 WA 2009-12-30 20.9
6 WA 2009-12-31 20.5
mutate() new variable (column)
When performing data analysis/statistics it is likely necessary to create new values not in the raw data. arrange(), filter(), and select() prepare us to create new data, but do not do so themselves. For this we use mutate(). We must however be very mindful that mutate() is only useful if we want to create new variables (columns) that are a function of one or more existing columns. This means that any column we create with mutate() will always have the same number of rows as the dataframe we are working with.
mutate() new variable (column)
Tell R the name of the column and what to put in it
Here we create kelvin by converting the temp column into Kelvin (°K) (i.e. °C+273.15)
OISST1 %>%mutate(kelvin = temp +273.15) %>%head()
# A tibble: 6 × 4
site t temp kelvin
<chr> <date> <dbl> <dbl>
1 Med 2008-01-01 13.7 287.
2 Med 2008-01-02 13.7 287.
3 Med 2008-01-03 13.2 286.
4 Med 2008-01-04 13.3 286.
5 Med 2008-01-05 13.3 286.
6 Med 2008-01-06 13.4 287.
summarise() data (row+column)
To create new columns we use mutate(), but to calculate any sort of summary/statistic from a column that will return fewer rows than the dataframe has we will use summarise(). This makes summarise() much more powerful than the other functions in this section, but because it is able to do more, it can also be more unpredictable, making it potentially more challenging to use. We will almost always use this function in our work flows however so it behoves us to become well acquainted with it.
summarise() data (row+column)
Note how the use of summarise() changes the dimensions of the dataframe
# A tibble: 6 × 4
site t temp mean_temp
<chr> <date> <dbl> <dbl>
1 Med 2008-01-01 13.7 16.2
2 Med 2008-01-02 13.7 16.2
3 Med 2008-01-03 13.2 16.2
4 Med 2008-01-04 13.3 16.2
5 Med 2008-01-05 13.3 16.2
6 Med 2008-01-06 13.4 16.2
Creating summaries of the entire OISST dataset in this way is not appropriate as we should not be combining time series from such different parts of the world. In order to calculate summaries within variables we will need to learn how to use group_by(), which in turn will first require us to learn how to chain multiple functions together within a pipe (%>%). That is how we will begin the next session for today. Finishing with several tips on how to make our data the tidiest that it may be.