Wrangling data

Robert Schlegel

Problem

What actually are tidy data?
How does one tidy their data?
Is there a basic approach to doing this?

Solution

We cover the definition for tidy data here
We are introduced to the main functions we’ll need for tidying data
We go over some tidy workflows to clarify

Setup

For these slides we will only need the tidyverse package.

library(tidyverse) # All-in-one

tidy data

So what exactly are tidy data? It is not just a a buzz word, there is a real definition. In three parts, to be exact. Taken from Hadley Wickham’s R for Data Science:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

tidy data

Following three rules make a dataset tidy — variables are in columns, observations are in rows, and values are in cells. Reproduced from R for Data Science

tidy workflow

The tidying of ones data should be the second step in any workflow, after the loading of the data.

Data tidying in the data processing pipeline. Reproduced from R for Data Science

The basics

To illustrate the implementation of this three part definition in a workflow we are going to learn how to manipulate a non-tidy dataset into a tidy one.

We will use NOAA OISST data for the three time series from 2008 - 2009, with some notable (untidy) changes

The basics

The secondary purpose of these exercises is to illustrate that these steps may be done more quickly in R than excel. Thereby allowing for ones raw data to remain exactly how they were collected and having all of the manipulations performed on them documented in an R script. This is a centrally important part of reproducible research.

Load data

Note that this loads five different objects
Each object contains similar data in different states of disrepair
Some are easier to tidy than others

load("course_material/data/OISST_mangled.RData")

SACTN1
SACTN2
SACTN3

# Spread across two dataframes
SACTN4a
SACTN4b

`OISST1`

These are how tidy data should look
If this looks familiar, it is because they are the same data from the first exercise

head(OISST1)

# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

`OISST1`

Simple line plot of the tidy OISST1 dataset

ggplot(data = OISST1, aes(x = t, y = temp)) +
  geom_line(aes(colour = site)) +
  labs(x = NULL, y = "Temperature (°C)", colour = "Site") +
  theme_bw()

Long vs wide

Generally, tidy data are considered long
This is because tidy dataframes tend to be longer than they are wide (with some exceptions)
The opposite of long data therefore are wide data
When one finds a dataset that is wider than long it is probably not tidy

When data are too wide

In OISST2 we can see that temperatures are placed in columns by site
This may seem like a reasonable organisation, but it is not tidy because temperature is one variable and should not take up more than one column

head(OISST2)

# A tibble: 6 × 4
  t            Med NW_Atl    WA
  <date>     <dbl>  <dbl> <dbl>
1 2008-01-01  13.7   6.23  23.0
2 2008-01-02  13.7   6.22  23.2
3 2008-01-03  13.2   5.81  23.1
4 2008-01-04  13.3   4.88  22.3
5 2008-01-05  13.3   4.42  22.2
6 2008-01-06  13.4   4.83  22.3

When data are too wide

We need to pivot_longer() these site columns back together
We do this by telling pivot_longer() what the names of the columns are we want to squish together

OISST2_tidy <- OISST2 %>%
  pivot_longer(cols = c(Med, NW_Atl, WA), 
               names_to = "site", values_to = "temp")
head(OISST2_tidy)

# A tibble: 6 × 3
  t          site    temp
  <date>     <chr>  <dbl>
1 2008-01-01 Med    13.7 
2 2008-01-01 NW_Atl  6.23
3 2008-01-01 WA     23.0 
4 2008-01-02 Med    13.7 
5 2008-01-02 NW_Atl  6.22
6 2008-01-02 WA     23.2

When data are too long

If data are too long they have individual observations in multiple rows
This happens when we have two or more variables stored within the same column

head(OISST3)

# A tibble: 6 × 4
    idx type  name        temp
  <int> <chr> <chr>      <dbl>
1     1 site  Med         13.7
2     1 t     2008-01-01  13.7
3     2 site  Med         13.7
4     2 t     2008-01-02  13.7
5     3 site  Med         13.2
6     3 t     2008-01-03  13.2

When data are too long

We use pivot_wider() to fix this issue
Note the necessity of an idx column to identify individual samples

OISST3_tidy <- OISST3 %>% 
  pivot_wider(id_cols = idx, names_from = type, values_from = name)
head(OISST3)

# A tibble: 6 × 4
    idx type  name        temp
  <int> <chr> <chr>      <dbl>
1     1 site  Med         13.7
2     1 t     2008-01-01  13.7
3     2 site  Med         13.7
4     2 t     2008-01-02  13.7
5     3 site  Med         13.2
6     3 t     2008-01-03  13.2

Separating and uniting

We’ve now covered how to make our dataframes longer or wider depending on their tidiness. Now we will look at how to manage our columns when they contain more (or less) than one variable, but the overall dataframe does not need to be made wider or longer. This is generally the case when a column has two variables, or two or more variables are spread across multiple columns, with only one observation per row.

Separate

OISST4a has site and t column replaced by index
This is an efficient way to store these data, but it is not tidy because the site and source of each observation are separate variables

head(OISST4a)

# A tibble: 6 × 2
  index           temp
  <chr>          <dbl>
1 Med 2008-01-01  13.7
2 Med 2008-01-02  13.7
3 Med 2008-01-03  13.2
4 Med 2008-01-04  13.3
5 Med 2008-01-05  13.3
6 Med 2008-01-06  13.4

Separate

Often times the separate() function is able to guess how to split text, but it is better to be explicit

OISST4a_tidy <- OISST4a %>% 
  separate(col = index, into = c("site", "t"), sep = " ")
head(OISST4a_tidy)

# A tibble: 6 × 3
  site  t           temp
  <chr> <chr>      <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

Unite

We see this issue most often with date values when the year, month, and day values are given in different columns
If we look at OISST4b we will see that there is a year, month, and day column

head(OISST4b)

# A tibble: 6 × 5
  site  year  month day     idx
  <chr> <chr> <chr> <chr> <int>
1 Med   2008  01    01        1
2 Med   2008  01    02        2
3 Med   2008  01    03        3
4 Med   2008  01    04        4
5 Med   2008  01    05        5
6 Med   2008  01    06        6

Unite

There are uses for the data in this way, though it is not terribly tidy
We usually want to unite() the date of any observation to be shown in one column

OISST4b_tidy <- OISST4b %>% 
  unite(year, month, day, col = "t", sep = "-")
head(OISST4b_tidy)

# A tibble: 6 × 3
  site  t            idx
  <chr> <chr>      <int>
1 Med   2008-01-01     1
2 Med   2008-01-02     2
3 Med   2008-01-03     3
4 Med   2008-01-04     4
5 Med   2008-01-05     5
6 Med   2008-01-06     6

Joining

Remember that one of the rules of tidy data is that only one complete dataset is saved per dataframe
Looking at OISST4a_tidy and OISST4b_tidy we see that they are each missing different columns

head(OISST4a_tidy)

# A tibble: 6 × 3
  site  t           temp
  <chr> <chr>      <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

head(OISST4b_tidy)

# A tibble: 6 × 3
  site  t            idx
  <chr> <chr>      <int>
1 Med   2008-01-01     1
2 Med   2008-01-02     2
3 Med   2008-01-03     3
4 Med   2008-01-04     4
5 Med   2008-01-05     5
6 Med   2008-01-06     6

Joining

Were we to join these dataframes together they would complete each other
The tidyverse provides us with several methods of doing this, but we will demonstrate here only left_join()
This function joins two dataframes together based on the matching of columns from the left to the right
It combines values together where it sees that they match up, and adds new rows and columns where they do not

Joining

left_join() will often guess what we want to do and provide a message in the Console

OISST4_tidy <- left_join(OISST4a_tidy, OISST4b_tidy)

But it is better to be explicit

OISST4_tidy <- left_join(OISST4a_tidy, OISST4b_tidy, by = c("site", "t"))
head(OISST4_tidy)

# A tibble: 6 × 4
  site  t           temp   idx
  <chr> <chr>      <dbl> <int>
1 Med   2008-01-01  13.7     1
2 Med   2008-01-02  13.7     2
3 Med   2008-01-03  13.2     3
4 Med   2008-01-04  13.3     4
5 Med   2008-01-05  13.3     5
6 Med   2008-01-06  13.4     6

But why though?

At this point one may be wondering what the point of all of this is. Sure it’s all well and good to see how to tidy one’s data in R, but couldn’t this be done more quickly and easily in Excel? Perhaps, yes, with a small dataset. But remember, (for many) the main reason we are learning R is to ensure that we are performing reproducible research. This means that every step in our workflow must be documented. And we accomplish this by writing R scripts.