Colours and stats in ggplot2

Robert Schlegel

Overview

Now that we have seen the basics of ggplot2, let’s take a moment to delve further into the beauty of our figures. It may sound vain at first, but the colour palette of a figure is actually very important. This is for two main reasons. The first being that a consistent colour palette looks more professional. But most importantly it is necessary to have a good colour palette because it makes the information in our figures easier to understand. The communication of information to others is central to good science.

Problem

  • Standard colours don’t look great
  • Established philosophy on colour palettes
  • Plots alone do not communicate statistics

Solution

  • Using RColorBrewer gives us more control
  • Developed colour palettes are available in ggplot2
  • Statistics may be plotted directly with ggpubr

Setup

We will need the following four packages for the examples in these slides.

library(tidyverse) # Contains ggplot2

library(ggpubr) # Helps us to combine figures

library(palmerpenguins) # Contains dataset

RColorBrewer

Central to the purpose of ggplot2 is the creation of beautiful figures. For this reason there are many built in functions that we may use in order to have precise control over the colours we use, as well as additional packages that extend our options even further. The RColorBrewer package should have been installed on your computer and activated automatically when we installed and activated the tidyverse. We will use this package for its lovely colour palettes, which may be found in ‘colorPaletteCheatsheet.pdf’.

Continuous colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = bill_depth_mm))

Continuous colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = bill_depth_mm)) +
  # Change the continuous variable colour palette
  scale_colour_distiller() 

Continuous colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = bill_depth_mm)) +
  # Choose a pre-set palette
  scale_colour_distiller(palette = "Spectral")

Continuous colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = bill_depth_mm)) +
  # Viridis colour palette
  scale_colour_viridis_c(option = "D")

Discrete colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = as.factor(year))) +
  # The discrete colour palette function
  scale_colour_brewer()

Discrete colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = as.factor(year))) +
  # Choose a colour palette
  scale_colour_brewer(palette = "Set1")

Discrete colour scales

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = as.factor(year))) +
  # Discrete viridis colour palette
  scale_colour_viridis_d(option = "A")

Make your own palettes

This is all well and good. But didn’t we claim that this should give us complete control over our colours? So far it looks like it has just given us a few more palettes to use. And that’s nice, but it’s not ‘infinite choices’. That is where the Internet comes to our rescue. There are many places we may go to for support in this regard. The following links, in descending order, are very useful. And fun!

Make your own palettes

I find the first link the easiest to use. But the second is better at generating discrete colour palettes. During the following exercise spend several minutes playing with the different websites and decide for yourself which one(s) you like.

Continuous palettes

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = bill_depth_mm)) +
  scale_colour_gradientn(colours = c("#A5A94D", "#6FB16F", "#45B19B",
                                    "#59A9BE", "#9699C4", "#CA86AD"))

Continuous palettes

ggplot(data = penguins,
       aes(x = body_mass_g, y = bill_length_mm)) +
  geom_point(aes(colour = as.factor(sex))) +
  # How to use custom palette
  scale_colour_manual(values = c("#A5A94D", "#9699C4"),
                      # How to change the legend text
                      labels = c("female", "male", "other")) + 
  # How to change the legend title
  labs(colour = "Sex") 

Plotting stats

  • ggpubr contains functions to seamless plot statistics on our figures
  • compare_means() compares means of two (t-test) or more (ANOVA) groups of values
  • stat_compare_means() is designed to be integrated directly into ggplot2 code

compare_means()

# t-test
compare_means(bill_length_mm~sex, data = penguins, method = "t.test")
# A tibble: 1 × 8
  .y.            group1 group2        p         p.adj p.format p.signif method
  <chr>          <chr>  <chr>     <dbl>         <dbl> <chr>    <chr>    <chr> 
1 bill_length_mm female male   1.07e-10 0.00000000011 1.1e-10  ****     T-test
# ANOVA
compare_means(bill_length_mm~species, data = penguins, method = "anova")
# A tibble: 1 × 6
  .y.                   p   p.adj p.format p.signif method
  <chr>             <dbl>   <dbl> <chr>    <chr>    <chr> 
1 bill_length_mm 2.69e-91 2.7e-91 <2e-16   ****     Anova 

stat_*() vs geom_*()

In the previous slides we have seen how to add dots, lines, etc. to graphs using functions that look like geom_*(). But we can also add statistical calculations and other non-graphical elements to plots using functions that look like stat_*().

stat_compare_means()

ggplot(data = penguins, aes(x = species, y = bill_length_mm)) +
  geom_boxplot(aes(fill = species), show.legend = F) +
  stat_compare_means(method = "anova")

stat_compare_means()

ggplot(data = penguins, aes(x = species, y = bill_length_mm)) +
  geom_boxplot(aes(fill = species), show.legend = F) +
  stat_compare_means(method = "anova", 
                     aes(label = paste0("p ", ..p.format..)), 
                     label.x = 2) +
  theme_bw()

Further applications

As mentioned above, these functions may be used with paired tests, non-parametric tests, and multiple mean tests. These outputs have mates in the ggplot2 sphere and so may be visualised with relative ease. The following is an example of how to do this with a multiple mean (ANOVA) test.

Multiple means

# First create a list of comparisons to feed into our figure
penguins_levels <- levels(penguins$species)
my_comparisons <- list(c(penguins_levels[1], penguins_levels[2]), 
                       c(penguins_levels[2], penguins_levels[3]),
                       c(penguins_levels[1], penguins_levels[3]))

# Then we stack it all together
ggplot(data = penguins, aes(x = species, y = bill_length_mm)) +
  geom_boxplot(aes(fill  = species), colour = "grey40", show.legend = F) +
  stat_compare_means(method = "anova", colour = "grey50",
                     label.x = 1.8, label.y = 32) +
  # Add pairwise comparisons p-value
  stat_compare_means(comparisons = my_comparisons,
                     label.y = c(62, 64, 66)) +
  # Perform t-tests between each group and the overall mean
  stat_compare_means(label = "p.signif", 
                     method = "t.test",
                     ref.group = ".all.") + 
  # Add horizontal line at base mean
  geom_hline(yintercept = mean(penguins$bill_length_mm, na.rm = T), 
             linetype = 2) + 
  labs(y = "Bill length (mm)", x = NULL) +
  theme_bw()