We are introduced to the main functions we’ll need for tidying data
We go over some tidy workflows to clarify
Setup
For these slides we will only need the tidyverse package.
library(tidyverse) # All-in-one
tidy data
So what exactly are tidy data? It is not just a a buzz word, there is a real definition. In three parts, to be exact. Taken from Hadley Wickham’s R for Data Science:
Each variable must have its own column
Each observation must have its own row
Each value must have its own cell
tidy data
Following three rules make a dataset tidy — variables are in columns, observations are in rows, and values are in cells. Reproduced from R for Data Science
tidy workflow
The tidying of ones data should be the second step in any workflow, after the loading of the data.
Data tidying in the data processing pipeline. Reproduced from R for Data Science
The basics
To illustrate the implementation of this three part definition in a workflow we are going to learn how to manipulate a non-tidy dataset into a tidy one.
We will use NOAA OISST data for the three time series from 2008 - 2009, with some notable (untidy) changes
The basics
The secondary purpose of these exercises is to illustrate that these steps may be done more quickly in R than excel. Thereby allowing for ones raw data to remain exactly how they were collected and having all of the manipulations performed on them documented in an R script. This is a centrally important part of reproducible research.
Load data
Note that this loads five different objects
Each object contains similar data in different states of disrepair
Some are easier to tidy than others
load("course_material/data/OISST_mangled.RData")
SACTN1SACTN2SACTN3# Spread across two dataframesSACTN4aSACTN4b
OISST1
These are how tidy data should look
If this looks familiar, it is because they are the same data from the first exercise
head(OISST1)
# A tibble: 6 × 3
site t temp
<chr> <date> <dbl>
1 Med 2008-01-01 13.7
2 Med 2008-01-02 13.7
3 Med 2008-01-03 13.2
4 Med 2008-01-04 13.3
5 Med 2008-01-05 13.3
6 Med 2008-01-06 13.4
OISST1
Simple line plot of the tidyOISST1 dataset
ggplot(data = OISST1, aes(x = t, y = temp)) +geom_line(aes(colour = site)) +labs(x =NULL, y ="Temperature (°C)", colour ="Site") +theme_bw()
Long vs wide
Generally, tidy data are considered long
This is because tidy dataframes tend to be longer than they are wide (with some exceptions)
The opposite of long data therefore are wide data
When one finds a dataset that is wider than long it is probably not tidy
When data are too wide
In OISST2 we can see that temperatures are placed in columns by site
This may seem like a reasonable organisation, but it is not tidy because temperature is one variable and should not take up more than one column
# A tibble: 6 × 3
t site temp
<date> <chr> <dbl>
1 2008-01-01 Med 13.7
2 2008-01-01 NW_Atl 6.23
3 2008-01-01 WA 23.0
4 2008-01-02 Med 13.7
5 2008-01-02 NW_Atl 6.22
6 2008-01-02 WA 23.2
When data are too long
If data are too long they have individual observations in multiple rows
This happens when we have two or more variables stored within the same column
head(OISST3)
# A tibble: 6 × 4
idx type name temp
<int> <chr> <chr> <dbl>
1 1 site Med 13.7
2 1 t 2008-01-01 13.7
3 2 site Med 13.7
4 2 t 2008-01-02 13.7
5 3 site Med 13.2
6 3 t 2008-01-03 13.2
When data are too long
We use pivot_wider() to fix this issue
Note the necessity of an idx column to identify individual samples
# A tibble: 6 × 4
idx type name temp
<int> <chr> <chr> <dbl>
1 1 site Med 13.7
2 1 t 2008-01-01 13.7
3 2 site Med 13.7
4 2 t 2008-01-02 13.7
5 3 site Med 13.2
6 3 t 2008-01-03 13.2
Separating and uniting
We’ve now covered how to make our dataframes longer or wider depending on their tidiness. Now we will look at how to manage our columns when they contain more (or less) than one variable, but the overall dataframe does not need to be made wider or longer. This is generally the case when a column has two variables, or two or more variables are spread across multiple columns, with only one observation per row.
Separate
OISST4a has site and t column replaced by index
This is an efficient way to store these data, but it is not tidy because the site and source of each observation are separate variables
head(OISST4a)
# A tibble: 6 × 2
index temp
<chr> <dbl>
1 Med 2008-01-01 13.7
2 Med 2008-01-02 13.7
3 Med 2008-01-03 13.2
4 Med 2008-01-04 13.3
5 Med 2008-01-05 13.3
6 Med 2008-01-06 13.4
Separate
Often times the separate() function is able to guess how to split text, but it is better to be explicit
# A tibble: 6 × 3
site t temp
<chr> <chr> <dbl>
1 Med 2008-01-01 13.7
2 Med 2008-01-02 13.7
3 Med 2008-01-03 13.2
4 Med 2008-01-04 13.3
5 Med 2008-01-05 13.3
6 Med 2008-01-06 13.4
Unite
We see this issue most often with date values when the year, month, and day values are given in different columns
If we look at OISST4b we will see that there is a year, month, and day column
head(OISST4b)
# A tibble: 6 × 5
site year month day idx
<chr> <chr> <chr> <chr> <int>
1 Med 2008 01 01 1
2 Med 2008 01 02 2
3 Med 2008 01 03 3
4 Med 2008 01 04 4
5 Med 2008 01 05 5
6 Med 2008 01 06 6
Unite
There are uses for the data in this way, though it is not terribly tidy
We usually want to unite() the date of any observation to be shown in one column
OISST4b_tidy <- OISST4b %>%unite(year, month, day, col ="t", sep ="-")head(OISST4b_tidy)
# A tibble: 6 × 3
site t idx
<chr> <chr> <int>
1 Med 2008-01-01 1
2 Med 2008-01-02 2
3 Med 2008-01-03 3
4 Med 2008-01-04 4
5 Med 2008-01-05 5
6 Med 2008-01-06 6
Joining
Remember that one of the rules of tidy data is that only one complete dataset is saved per dataframe
Looking at OISST4a_tidy and OISST4b_tidy we see that they are each missing different columns
head(OISST4a_tidy)
# A tibble: 6 × 3
site t temp
<chr> <chr> <dbl>
1 Med 2008-01-01 13.7
2 Med 2008-01-02 13.7
3 Med 2008-01-03 13.2
4 Med 2008-01-04 13.3
5 Med 2008-01-05 13.3
6 Med 2008-01-06 13.4
head(OISST4b_tidy)
# A tibble: 6 × 3
site t idx
<chr> <chr> <int>
1 Med 2008-01-01 1
2 Med 2008-01-02 2
3 Med 2008-01-03 3
4 Med 2008-01-04 4
5 Med 2008-01-05 5
6 Med 2008-01-06 6
Joining
Were we to join these dataframes together they would complete each other
The tidyverse provides us with several methods of doing this, but we will demonstrate here only left_join()
This function joins two dataframes together based on the matching of columns from the left to the right
It combines values together where it sees that they match up, and adds new rows and columns where they do not
Joining
left_join() will often guess what we want to do and provide a message in the Console
OISST4_tidy <-left_join(OISST4a_tidy, OISST4b_tidy, by =c("site", "t"))head(OISST4_tidy)
# A tibble: 6 × 4
site t temp idx
<chr> <chr> <dbl> <int>
1 Med 2008-01-01 13.7 1
2 Med 2008-01-02 13.7 2
3 Med 2008-01-03 13.2 3
4 Med 2008-01-04 13.3 4
5 Med 2008-01-05 13.3 5
6 Med 2008-01-06 13.4 6
But why though?
At this point one may be wondering what the point of all of this is. Sure it’s all well and good to see how to tidy one’s data in R, but couldn’t this be done more quickly and easily in Excel? Perhaps, yes, with a small dataset. But remember, (for many) the main reason we are learning R is to ensure that we are performing reproducible research. This means that every step in our workflow must be documented. And we accomplish this by writing R scripts.