Setting up R and Rstudio

Before we can begin an R workshop, we need to make sure everyone has access to a computer with R and RStudio installed. The process for how to do this is detailed below, followed by a brief introduction to why we use these two pieces of software and what the difference is between them.

NB: If you are not using your own computer please make the instructor aware of this as it is assumed that all participants will be using their personal (or university/institution etc.) laptops.

Installing R

Installing R on your machine is a straightforward process. Follow these steps:

  1. Go to the CRAN (Comprehensive R Archive Network) website. If you type ‘r’ into Google it is the first entry

  2. Choose to download R for Windows, Mac, or Linux

  3. For Windows users selecting ‘base’ will link you to the download file, follow the prompts to install

  4. For Mac users, choose the version relevant to your Operating System, follow the prompts after downloading

  5. If you are a Linux user, you know what to do!

Installing RStudio

Although R can run in its own console or in a Terminal (Mac and Linux; the Windows command line is a bit limiting), we will use RStudio in this Workshop. RStudio is a free front-end to R for Windows, Mac, or Linux (i.e., R is working in the background). It makes working with R easier, more organised and productive, especially for new users. There are other front-ends, but RStudio is the most popular. To install:

  1. Go to the posit website

  2. Click the ‘Download RStudio’ button in the top right of the page

  3. Scroll down to click the ‘Download’ button under RStudio Desktop Free

  4. For Windows users click the ‘Download RStudio Desktop for Windows’ button under Step 2 on the page

  5. For all other Operating Systems, scroll down further and select the corresponding file

  6. After downloading, follow the prompts to install RStudio

Why R?

As scientists, we are increasingly driven to analyse and manipulate larger and larger datasets. As these datasets grow in size our analyses are becoming more sophisticated. There are many statistical packages on the market that one can use, but R is one of the global standards. There are several reasons for this:

The positives

  1. It is free, which is nice if you despise commercial software (e.g. Microsoft Office)

  2. It is powerful, flexible and robust; it is developed and used by leading academic statisticians

  3. It contains advanced statistical routines not yet available in other software

  4. The cutting-edge statistical routines open up scientific possibilities in creative new ways

  5. It does not depend on a ‘point and click’ interface, such as SPSS, and requires one to write scripts

  6. It has state-of-the-art graphics

  7. Users continually extend the functionality by updating existing packages and adding new ones; in fact, this entire website was written in Quarto (RStudio) and the files supporting this Workshop material can be edited on any computer using a variety of operating systems such as Mac OS X, Linux, and Microsoft Windows

It is truly amazing that such a powerful and comprehensive software is freely available and we are indebted to the developers of R for going down this path.

The negatives

Although there are many positives of using R, there are some negatives:

  1. It can have a steep learning curve for those who are not into statistics or data manipulation, and it does require frequent use to remain familiar with it and to develop advanced skills

  2. Error trapping can be confusing and frustrating

  3. Rudimentary debugging, although there are some packages available to enhance the process

  4. Handles large datasets easily (eg. 100 MB), but can have some trouble with massive datasets (e.g. 100 GB)

  5. Some simple tasks can be tricky to do in R

  6. There are multiple ways of doing the same thing

The challenges

The big difference between R and many other statistical packages that you might have used is that it is not, and never will be, a menu-driven ‘point and click’ software. R requires you to write your own code to tell it exactly what you want to do. This means that there is a learning curve, but these are outweighed by numerous advantages:

  1. To write new programs, you can modify your existing ones or those of others, saving you considerable time

  2. You have a record of your statistical analyses and thus can re-run your previous analyses exactly at any time in the future, even if you can’t remember what you did — this is central to reproducible research

  3. The recorded code can include the liberal use of internal documentation, which is often overlooked by practicing scientists

  4. It is more flexible at manipulating data and graphics than menu-driven software

  5. You will develop and improve your programming, which is a valuable general skill

  6. You will improve your statistical knowledge

  7. You can automate large problems

  8. You can provide and share code that underpins published analyses; more and more journals are requesting the code for analyses in papers, to increase transparency and reproduceability

  9. Integration with tools like Git (e.g. GitHub and Bitbucket) enable online collaboration in large statistical research programmes and they allow one to rely on version control systems

  10. Programming is simply heaps more fun than point-and-click!

Why RStudio?

One could, and some still do, use ‘Base R’. This is the name used to describe the R software as it is shipped from the core R team. It is run generally via a Terminal and provides very rudimentary access to help files and visuals. The Base R software hasn’t developed past this point because it is not in its mandate to do so. It is first and foremost a computer programming language. And while it is still constantly being updated and improved, it will likely never have much in the way of a user interface (UI). That is where RStudio enters the picture. It is an integrated development environment (IDE), which is more advanced than a simple UI in that it adds dozens of additional features to R that the base software itself is not capable of performing. We will not be covering most of these extended features in this workshop, but if you are interested in learning more about them feel free to ask the instructor. Though be sure that you are close to an exit door so you can escape if they don’t stop talking about all of the benefits of RStudio.

General settings

Before we start using RStudio let’s set it up properly. In the menu at the top of the RStudio software, go to: Tools (Preferences on Mac) > Global Options…. From here we have a very wide range of options for the functionality of RStudio. At the moment we will leave the general settings to their default.

Customising appearance

RStudio is highly customisable. Under the Appearance tab in the Global Options you can see all of the different themes that come with RStudio. We recommend choosing a theme with a black background (e.g. ‘Chaos’) as this will be easier on your eyes and your computer. It is also good to choose a theme with a sufficient amount of contrast between the different colours used to denote different types of objects/values in your code. Take a moment now and chose a different them before accepting and closing the window.

The RStudio Project

A very nifty way of managing workflow in RStudio is through the built-in functionality of the RStudio Project. We do not need to install any packages or change any settings to use these. Creating a new project is a very simple task, as well. For this course we will be using the R_Workshop.Rproj file you will download tomorrow with the course material so that we are all running identical projects. This will prevent a lot of issues by ensuring we are doing things by the same standard. Better yet, an RStudio Project integrates seamlessly into version control software (e.g. GitHub) and allows for instant, world class collaboration on any research project. We will cover the concepts and benefits of an RStudio Project more as we move through the course.

Installing packages

The most common functions used in R are contained within the base package; this makes R useful ‘out of the box.’ However, there is extensive additional functionality that is being expanded all the time through the use of packages. Packages are simply collections of code called functions that automate complex mathematical or statistical tasks. One of the most useful features of R is that users are continuously developing new packages and making them available for free. You can find a comprehensive list of available packages on the CRAN website. There are currently (2022-11-08) 18824 packages available for R!

If the thought of searching for and finding R packages is daunting, a good place to start is the R Task View page. This page curates collections of packages for general tasks you might encounter, such as Experimental Design, Meta-Analysis, or Multivariate Statistics. Go and have a look for yourself, you might be surprised to find a good explanation of what you need.

In the menu bar click Tools > Install Packages type in the package name tidyverse in the ‘Packages’ text box (note that it is case sensitive) and select the Install button. The Console will run the code needed to install the package, and then provide some commentary on the installation of the package and any of its dependencies (i.e., other R packages needed to run the required package).

The installation process makes sure that the functions within the packages contained within the tidyverse are now available on your computer, but to avoid potential conflicts in the names of functions, it will not load these automatically. To make R ‘know’ about these functions in a particular session, you need either to load the package via ticking the checkbox for that package in the Packages tab, or execute:

library(tidyverse)

Since we will develop the habit of doing all of our analyses from R scripts, it is best practice to simply list all of the libraries to be loaded right at the start of your script. Comments may be used to remind your future-self (to quote Hadley Wickham) what those packages are for.

Question Why is it best practice to explicitly include packages you use in your R program at the start of your script?

The panes of RStudio

RStudio has four main panes, each in a quadrant of your screen: Source Editor, Console, Environment (and History, Connections), and Plots (and Files, Packages, Help, Viewer, Presentations). The layout of these four panes can be adjusted under the Tools > Global Options…> Pane Layout menu. For now we will keep the factory default layout, but note that there might be subtle differences between RStudio installations on different operating systems. We will discuss each of the panes in turn below.

Source Editor

Generally we will want to write chunks of code longer than a few lines. The Source Editor can help you open, edit and execute the scripts that allow us to do this. To ensure that RStudio is working as expected, follow these three steps:

  1. Click the ‘New File’ icon in the top left, just under the menu bar, then select ‘R Script’. This will open a new blank tab in the Source Editor. Save this file to your desktop and close RStudio.

  2. Now make RStudio the default application to open .R files (right click on the file and set RStudio to open it as the default if it isn’t already).

  3. Double click on the file – this will open it in RStudio in the Source Editor in the top left pane.

Note .R files are simply standard text files and can be created in any text editor and saved with a .R (or .r) extension, but the Source editor in RStudio has the advantage of providing syntax highlighting, code completion, and smart indentation. You can see the different colours for numbers and there is also highlighting to help you count brackets.

Console

This is where you can type code that executes immediately. This is also known as the command line. Entering code in the command line is intuitive and easy. For example, we can use R as a calculator by typing into the Console (and pressing Enter after each line):

6 * 3
[1] 18
5 + 4
[1] 9
2 ^ 3
[1] 8

Note that spaces are optional around simple calculations.

We can also use the assignment operator <- to assign any calculation to a variable so we can access it later (the = sign would work, too, but it’s bad practice to use it… and we’ll talk about this as we go):

a <- 2
b <- 7
a + b
[1] 9

To type the assignment operator (<-) push the following two keys together: alt -. There are many keyboard shortcuts in R and we will introduce them as we go along.

Spaces are also optional around assignment operators. It is good practice to use single spaces in your R scripts, and the alt - shortcut will do this for you automagically. Spaces are not only there to make the code more readable to the human eye, but also to the machine. Try this:

d<-2
d < -2
[1] FALSE

Note that the first line of code assigns d a value of 2, whereas the second statement asks R whether this variable has a value less than 2. When asked, it responds with FALSE. If we hadn’t used spaces, how would R have known what we meant?

Another important question here is, is R case sensitive? Is A the same as a? Figure out a way to check for yourself.

We can create a vector in R by using the combine c() function:

apples <- c(5.3, 3.8, 4.5)

A vector is a one-dimensional array (i.e., a list of numbers), and this is the simplest form of data used in R (you can think of a single value in R as just a very short vector). We’ll talk about more complex (and therefore more powerful) types of data structures as we go along.

If you want to display the value of apples type:

apples
[1] 5.3 3.8 4.5

Finally, there are default functions in R for nearly all basic statistical analyses, including mean() and sd() (standard deviation):

mean(apples)
[1] 4.533333
sd(apples)
[1] 0.7505553

Variable names
It is best not to use c as the name of a value or array. Why? What other words might not be good to use?

Or try this:

round(sd(apples), 2)
[1] 0.75

Question
What did we do above? What can you conclude from those functions?

RStudio supports the automatic completion of code using the Tab key. For example, type the three letters app and then the Tab key. What happens?

The code completion feature also provides brief inline help for functions whenever possible. For example, type mean() and press the Tab key.

The RStudio Console automagically maintains a ‘history’ so that you can retrieve previous commands, a bit like your Internet browser or Google. On a blank line in the Console, press the up arrow, and see what happens.

If you wish to review a list of your recent commands and then select a command from this list you can use Ctrl+Up to review the list (Cmd+Up on the Mac). If you prefer a ‘bird’s eye’ overview of the R command history, you may also use the RStudio History pane (see below).

The Console title bar has a few useful features:

  1. It displays the current R working directory (more on this later)

  2. It provides the ability to interrupt R during a long computation (a stop sign will appear whilst code is running)

  3. It allows you to minimise and maximise the Console in relation to the Source pane using the buttons at the top-right or by double-clicking the title bar)

Environment and History panes

The Environment pane is very useful as it shows you what objects (i.e., dataframes, arrays, values, and functions) you have in your environment (i.e. workspace). You can see the values for objects with a single value and for those that are longer R will tell you their class. When you have data in your environment that have two dimensions (rows and columns) you may click on them and they will appear in the Source Editor pane like a spreadsheet.

You can then go back to your program in the Source Editor by clicking its tab or closing the tab for the object you opened. Also in the Environment is the History tab, where you can see all of the code executed for the session. If you double-click a line or highlight a block of lines and then double-click those, you can send it to the Console (i.e., run them).

Typing the following into the Console will list everything you’ve loaded into the Environment:

ls()
[1] "a"        "apples"   "b"        "d"        "pkgs_lst" "url"     

What do we have loaded into our environment? Did all of these objects come from one script, or more than one? How can we tell where an object was generated?

Files, Plots, Packages, Help, Viewer, and Presentation panes

The last pane has a number of different tabs. The Files tab has a navigable file manager, just like the file system on your operating system. The Plot tab is where graphics you create will appear. The Packages tab shows you the packages that are installed and those that can be installed. The Help tab allows you to search the R documentation for help and is where the help appears when you ask for it from the Console. The Viewer tab is where more advanced/interactive visuals are generated, and if one is authoring presentations in RStudio, they will be visualised in the Presentation pane.

Methods of getting help from the Console include…

?mean

…or:

help(mean)

We will go into this in more detail in R Workflow - I.

Resources

Below you can find the source code to some books and other links to websites about R:

Exercise

It which shall not be named