Wrangling data

Robert Schlegel

Problem

  • What actually are tidy data?
  • How does one tidy their data?
  • Is there a basic approach to doing this?

Solution

  • We cover the definition for tidy data here
  • We are introduced to the main functions we’ll need for tidying data
  • We go over some tidy workflows to clarify

Setup

For these slides we will only need the tidyverse package.

library(tidyverse) # All-in-one

tidy data

So what exactly are tidy data? It is not just a a buzz word, there is a real definition. In three parts, to be exact. Taken from Hadley Wickham’s R for Data Science:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

tidy data

Following three rules make a dataset tidy — variables are in columns, observations are in rows, and values are in cells. Reproduced from R for Data Science

tidy workflow

The tidying of ones data should be the second step in any workflow, after the loading of the data.

Data tidying in the data processing pipeline. Reproduced from R for Data Science

The basics

To illustrate the implementation of this three part definition in a workflow we are going to learn how to manipulate a non-tidy dataset into a tidy one.

We will use NOAA OISST data for the three time series from 2008 - 2009, with some notable (untidy) changes

The basics

The secondary purpose of these exercises is to illustrate that these steps may be done more quickly in R than excel. Thereby allowing for ones raw data to remain exactly how they were collected and having all of the manipulations performed on them documented in an R script. This is a centrally important part of reproducible research.

Load data

  • Note that this loads five different objects
  • Each object contains similar data in different states of disrepair
  • Some are easier to tidy than others
load("course_material/data/OISST_mangled.RData")
SACTN1
SACTN2
SACTN3

# Spread across two dataframes
SACTN4a
SACTN4b

OISST1

  • These are how tidy data should look
  • If this looks familiar, it is because they are the same data from the first exercise
head(OISST1)
# A tibble: 6 × 3
  site  t           temp
  <chr> <date>     <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

OISST1

  • Simple line plot of the tidy OISST1 dataset
ggplot(data = OISST1, aes(x = t, y = temp)) +
  geom_line(aes(colour = site)) +
  labs(x = NULL, y = "Temperature (°C)", colour = "Site") +
  theme_bw()

Long vs wide

  • Generally, tidy data are considered long
  • This is because tidy dataframes tend to be longer than they are wide (with some exceptions)
  • The opposite of long data therefore are wide data
  • When one finds a dataset that is wider than long it is probably not tidy

When data are too wide

  • In OISST2 we can see that temperatures are placed in columns by site
  • This may seem like a reasonable organisation, but it is not tidy because temperature is one variable and should not take up more than one column
head(OISST2)
# A tibble: 6 × 4
  t            Med NW_Atl    WA
  <date>     <dbl>  <dbl> <dbl>
1 2008-01-01  13.7   6.23  23.0
2 2008-01-02  13.7   6.22  23.2
3 2008-01-03  13.2   5.81  23.1
4 2008-01-04  13.3   4.88  22.3
5 2008-01-05  13.3   4.42  22.2
6 2008-01-06  13.4   4.83  22.3

When data are too wide

  • We need to pivot_longer() these site columns back together
  • We do this by telling pivot_longer() what the names of the columns are we want to squish together
OISST2_tidy <- OISST2 %>%
  pivot_longer(cols = c(Med, NW_Atl, WA), 
               names_to = "site", values_to = "temp")
head(OISST2_tidy)
# A tibble: 6 × 3
  t          site    temp
  <date>     <chr>  <dbl>
1 2008-01-01 Med    13.7 
2 2008-01-01 NW_Atl  6.23
3 2008-01-01 WA     23.0 
4 2008-01-02 Med    13.7 
5 2008-01-02 NW_Atl  6.22
6 2008-01-02 WA     23.2 

When data are too long

  • If data are too long they have individual observations in multiple rows
  • This happens when we have two or more variables stored within the same column
head(OISST3)
# A tibble: 6 × 4
    idx type  name        temp
  <int> <chr> <chr>      <dbl>
1     1 site  Med         13.7
2     1 t     2008-01-01  13.7
3     2 site  Med         13.7
4     2 t     2008-01-02  13.7
5     3 site  Med         13.2
6     3 t     2008-01-03  13.2

When data are too long

  • We use pivot_wider() to fix this issue
  • Note the necessity of an idx column to identify individual samples
OISST3_tidy <- OISST3 %>% 
  pivot_wider(id_cols = idx, names_from = type, values_from = name)
head(OISST3)
# A tibble: 6 × 4
    idx type  name        temp
  <int> <chr> <chr>      <dbl>
1     1 site  Med         13.7
2     1 t     2008-01-01  13.7
3     2 site  Med         13.7
4     2 t     2008-01-02  13.7
5     3 site  Med         13.2
6     3 t     2008-01-03  13.2

Separating and uniting

We’ve now covered how to make our dataframes longer or wider depending on their tidiness. Now we will look at how to manage our columns when they contain more (or less) than one variable, but the overall dataframe does not need to be made wider or longer. This is generally the case when a column has two variables, or two or more variables are spread across multiple columns, with only one observation per row.

Separate

  • OISST4a has site and t column replaced by index
  • This is an efficient way to store these data, but it is not tidy because the site and source of each observation are separate variables
head(OISST4a)
# A tibble: 6 × 2
  index           temp
  <chr>          <dbl>
1 Med 2008-01-01  13.7
2 Med 2008-01-02  13.7
3 Med 2008-01-03  13.2
4 Med 2008-01-04  13.3
5 Med 2008-01-05  13.3
6 Med 2008-01-06  13.4

Separate

  • Often times the separate() function is able to guess how to split text, but it is better to be explicit
OISST4a_tidy <- OISST4a %>% 
  separate(col = index, into = c("site", "t"), sep = " ")
head(OISST4a_tidy)
# A tibble: 6 × 3
  site  t           temp
  <chr> <chr>      <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4

Unite

  • We see this issue most often with date values when the year, month, and day values are given in different columns
  • If we look at OISST4b we will see that there is a year, month, and day column
head(OISST4b)
# A tibble: 6 × 5
  site  year  month day     idx
  <chr> <chr> <chr> <chr> <int>
1 Med   2008  01    01        1
2 Med   2008  01    02        2
3 Med   2008  01    03        3
4 Med   2008  01    04        4
5 Med   2008  01    05        5
6 Med   2008  01    06        6

Unite

  • There are uses for the data in this way, though it is not terribly tidy
  • We usually want to unite() the date of any observation to be shown in one column
OISST4b_tidy <- OISST4b %>% 
  unite(year, month, day, col = "t", sep = "-")
head(OISST4b_tidy)
# A tibble: 6 × 3
  site  t            idx
  <chr> <chr>      <int>
1 Med   2008-01-01     1
2 Med   2008-01-02     2
3 Med   2008-01-03     3
4 Med   2008-01-04     4
5 Med   2008-01-05     5
6 Med   2008-01-06     6

Joining

  • Remember that one of the rules of tidy data is that only one complete dataset is saved per dataframe
  • Looking at OISST4a_tidy and OISST4b_tidy we see that they are each missing different columns
head(OISST4a_tidy)
# A tibble: 6 × 3
  site  t           temp
  <chr> <chr>      <dbl>
1 Med   2008-01-01  13.7
2 Med   2008-01-02  13.7
3 Med   2008-01-03  13.2
4 Med   2008-01-04  13.3
5 Med   2008-01-05  13.3
6 Med   2008-01-06  13.4
head(OISST4b_tidy)
# A tibble: 6 × 3
  site  t            idx
  <chr> <chr>      <int>
1 Med   2008-01-01     1
2 Med   2008-01-02     2
3 Med   2008-01-03     3
4 Med   2008-01-04     4
5 Med   2008-01-05     5
6 Med   2008-01-06     6

Joining

  • Were we to join these dataframes together they would complete each other
  • The tidyverse provides us with several methods of doing this, but we will demonstrate here only left_join()
  • This function joins two dataframes together based on the matching of columns from the left to the right
  • It combines values together where it sees that they match up, and adds new rows and columns where they do not

Joining

  • left_join() will often guess what we want to do and provide a message in the Console
OISST4_tidy <- left_join(OISST4a_tidy, OISST4b_tidy)
  • But it is better to be explicit
OISST4_tidy <- left_join(OISST4a_tidy, OISST4b_tidy, by = c("site", "t"))
head(OISST4_tidy)
# A tibble: 6 × 4
  site  t           temp   idx
  <chr> <chr>      <dbl> <int>
1 Med   2008-01-01  13.7     1
2 Med   2008-01-02  13.7     2
3 Med   2008-01-03  13.2     3
4 Med   2008-01-04  13.3     4
5 Med   2008-01-05  13.3     5
6 Med   2008-01-06  13.4     6

But why though?

At this point one may be wondering what the point of all of this is. Sure it’s all well and good to see how to tidy one’s data in R, but couldn’t this be done more quickly and easily in Excel? Perhaps, yes, with a small dataset. But remember, (for many) the main reason we are learning R is to ensure that we are performing reproducible research. This means that every step in our workflow must be documented. And we accomplish this by writing R scripts.