Local data

Robert Schlegel

Problem

  • What are the most common complex file types?
  • Are there tools to help us with this?
  • How do we use them?

Solution

  • We focus on the most common complex file types here
  • We will look at the packages for the common file types
  • A few example files are loaded

Setup

  • We will look at a few packages that help us load common file types.
  • NB: Some of these packages overwrite functions from others
  • We deal wit this by calling functions specifically within packages
  • e.g. dplyr::select()
library(tidyverse) # All-in-one

library(raster) # For working with raster files

library(sf) # For most of our spatial data needs

library(sfheaders) # A bit more help

library(tidync) # For working with NetCDF files

Rasters

  • Raster files store evenly gridded data
  • These are generally used to show a value over a surface
  • Common file types are: .asc or .tif, but there are others

Rasters

  • Download and unzip .asc file from Bio-oracle to course_material/data/
# Load with one function
temp_mean_2100 <- raster("course_material/data/2100AOGCM.RCP85.Surface.Temperature.Mean.asc.BOv2_1.asc")

Rasters

  • Single layer raster files can be converted rather easily
# Convert to data.frame
temp_mean_2100_df <- as.data.frame(temp_mean_2100, xy = T)

# Manually change column names - be careful here
colnames(temp_mean_2100_df) <- c("lon", "lat", "temp")

Rasters

  • Then we can use them in ggplot2 as normal
# Pre-filter raster pixels for faster plotting
temp_mean_2100_df %>% 
  filter(between(lon, 90, 160),
         between(lat, -30, 30)) %>% 
  ggplot(aes(x = lon, y = lat)) +
  geom_raster(aes(fill = temp)) +
  borders() +
  scale_fill_viridis_c() +
  coord_quickmap(expand = FALSE,
                 xlim = c(90, 160), ylim = c(-30, 30))

Shape files

  • These are files designed to show polygon shapes
  • They can often be saved in complex file structures with many additional dependent files
  • The main file extension we usually want is .shp
  • Note that there are many sites that provide these sorts of files
  • While global products exist, locally created files are usually better

Shape files

  • Download the GSHHG products, create a new folder GSHHG in course_material/data/ and unzip the files there
  • NB: This is a very beefy file
# Load shapefile
coastline_full <- read_sf("course_material/data/GSHHG/GSHHS_shp/f/GSHHS_f_L1.shp")

# Convert to data.frame
coastline_full_df <- sf_to_df(coastline_full, fill = TRUE)

Shape files

  • Once converted they work in ggplot2 as normal
# Filter to Kongsfjorden and plot
# NB: filter much wider than necessary to ensure
# that you get enough of the polygon to avoid issues
coastline_full_df %>% 
  filter(between(x, 6, 13),
         between(y, 78, 80)) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_polygon(aes(group = id), 
               fill = "grey70", colour = "black") +
  coord_quickmap(expand = FALSE,
                 xlim = c(11, 12.6), ylim = c(78.88, 79.05))

NetCDF

  • These files can hold any number of things
  • Usually it will be large model or satellite datasets
  • While they can seem scary, this is a very good data storage system, and is not going to go away
  • Download the most recent day of NOAA OISST data and place in course_material/data/

NetCDF

  • Data are now relatively easy to load directly as tidy data thanks to tidync
  • But they can still provide complex or unclear data.frames
sst_NOAA_recent <- tidync("course_material/data/oisst-avhrr-v02r01.20221121_preliminary.nc") %>% 
  # Use this to convert to a tibble (i.e. fancy data.frame)
  hyper_tibble()
head(sst_NOAA_recent)
# A tibble: 6 × 8
    sst    anom   err   ice   lon   lat  zlev  time
  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1.80 -0.0600 0.300    NA  166. -78.4     0 16395
2 -1.79 -0.0500 0.300    NA  166. -78.4     0 16395
3 -1.80 -0.0500 0.310    NA  166. -78.4     0 16395
4 -1.80 -0.0500 0.310    NA  167. -78.4     0 16395
5 -1.80 -0.0400 0.310    NA  167. -78.4     0 16395
6 -1.80 -0.0400 0.310    NA  167. -78.4     0 16395

NetCDF

  • Often we need to investigate a NetCDF file to understand what is inside of it
tidync("course_material/data/oisst-avhrr-v02r01.20221121_preliminary.nc")

Data Source (1): oisst-avhrr-v02r01.20221121_preliminary.nc ...

Grids (5) <dimension family> : <associated variables> 

[1]   D3,D2,D1,D0 : sst, anom, err, ice    **ACTIVE GRID** ( 1036800  values per variable)
[2]   D0          : time
[3]   D1          : zlev
[4]   D2          : lat
[5]   D3          : lon

Dimensions 4 (all active): 
  
  dim   name  length       min     max start count     dmin   dmax unlim coord…¹ 
  <chr> <chr>  <dbl>     <dbl>   <dbl> <int> <int>    <dbl>  <dbl> <lgl> <lgl>   
1 D0    time       1 16395     16395       1     1  1.64e+4 1.64e4 TRUE  TRUE    
2 D1    zlev       1     0         0       1     1  0       0      FALSE TRUE    
3 D2    lat      720   -89.9      89.9     1   720 -8.99e+1 8.99e1 FALSE TRUE    
4 D3    lon     1440     0.125   360.      1  1440  1.25e-1 3.60e2 FALSE TRUE    
# … with abbreviated variable name ¹​coord_dim 

NetCDF

  • Let’s look at the full report
print(ncdf4::nc_open("course_material/data/oisst-avhrr-v02r01.20221121_preliminary.nc"))
File ../data/oisst-avhrr-v02r01.20221121_preliminary.nc (NC_FORMAT_NETCDF4):

     4 variables (excluding dimension variables):
        short sst[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Daily sea surface temperature
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: -300
            valid_max: 4500
        short anom[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Daily sea surface temperature anomalies
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: -1200
            valid_max: 1200
        short err[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Estimated error standard deviation of analysed_sst
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: 0
            valid_max: 1000
        short ice[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Sea ice concentration
            units: %
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: 0
            valid_max: 100

     4 dimensions:
        time  Size:1   *** is unlimited *** 
            long_name: Center time of the day
            units: days since 1978-01-01 12:00:00
        zlev  Size:1 
            long_name: Sea surface height
            units: meters
            positive: down
            actual_range: 0, 0
        lat  Size:720 
            long_name: Latitude
            units: degrees_north
            grids: Uniform grid from -89.875 to 89.875 by 0.25
        lon  Size:1440 
            long_name: Longitude
            units: degrees_east
            grids: Uniform grid from 0.125 to 359.875 by 0.25

    37 global attributes:
        Conventions: CF-1.6, ACDD-1.3
        title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolation Sea Surface Temperature (OISST) Analysis, Version 2.1 - Inter
        references: Reynolds, et al.(2007) Daily High-Resolution-Blended Analyses for Sea Surface Temperature (available at https://doi.org/10.1175/2007JCLI1824.1). Banzon, et al.(2016) A long-term record of blended satellite and in situ sea-surface temperature for climate monitoring, modeling and environmental studies (available at https://doi.org/10.5194/essd-8-165-2016). Huang et al. (2020) Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Version v02r01, submitted.Climatology is based on 1971-2000 OI.v2 SST. Satellite data: Pathfinder AVHRR SST, Navy AVHRR SST, and NOAA ACSPO SST. Ice data: NCEP Ice and GSFC Ice.
        source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfinder_AVHRR, Navy_AVHRR, NOAA_ACSP
        id: oisst-avhrr-v02r01.20221121_preliminary.nc
        naming_authority: gov.noaa.ncei
        summary: NOAAs 1/4-degree Daily Optimum Interpolation Sea Surface Temperature (OISST) (sometimes referred to as Reynolds SST, which however also refers to earlier products at different resolution), currently available as version v02r01, is created by interpolating and extrapolating SST observations from different sources, resulting in a smoothed complete field. The sources of data are satellite (AVHRR) and in situ platforms (i.e., ships and buoys), and the specific datasets employed may change over time. At the marginal ice zone, sea ice concentrations are used to generate proxy SSTs.  A preliminary version of this file is produced in near-real time (1-day latency), and then replaced with a final version after 2 weeks. Note that this is the AVHRR-ONLY DOISST, available from Oct 1981, but there is a companion DOISST product that includes microwave satellite data, available from June 2002
        cdm_data_type: Grid
        history: Final file created using preliminary as first guess, and 3 days of AVHRR data. Preliminary uses only 1 day of AVHRR data.
        date_modified: 2022-11-22T09:02:00Z
        date_created: 2022-11-22T09:02:00Z
        product_version: Version v02r01
        processing_level: NOAA Level 4
        institution: NOAA/National Centers for Environmental Information
        creator_url: https://www.ncei.noaa.gov/
        creator_email: oisst-help@noaa.gov
        keywords: Earth Science > Oceans > Ocean Temperature > Sea Surface Temperature
        keywords_vocabulary: Global Change Master Directory (GCMD) Earth Science Keywords
        platform: Ships, buoys, Argo floats, MetOp-A, MetOp-B
        platform_vocabulary: Global Change Master Directory (GCMD) Platform Keywords
        instrument: Earth Remote Sensing Instruments > Passive Remote Sensing > Spectrometers/Radiometers > Imaging Spectrometers/Radiometers > AVHRR > Advanced Very High Resolution Radiometer
        instrument_vocabulary: Global Change Master Directory (GCMD) Instrument Keywords
        standard_name_vocabulary: CF Standard Name Table (v40, 25 January 2017)
        geospatial_lat_min: -90
        geospatial_lat_max: 90
        geospatial_lon_min: 0
        geospatial_lon_max: 360
        geospatial_lat_units: degrees_north
        geospatial_lat_resolution: 0.25
        geospatial_lon_units: degrees_east
        geospatial_lon_resolution: 0.25
        time_coverage_start: 2022-11-21T00:00:00Z
        time_coverage_end: 2022-11-21T23:59:59Z
        metadata_link: https://doi.org/10.25921/RE9P-PT57
        ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
        comment: Data was converted from NetCDF-3 to NetCDF-4 format with metadata updates in November 2017.
        sensor: Thermometer, AVHRR

NetCDF

  • We can see that the time column is in units: days since 1978-01-01 12:00:00
  • With that we can finish tidying our data
sst_NOAA_tidy <- sst_NOAA_recent %>% 
  mutate(t = as.Date(time, origin = "1978-01-01 12:00:00"),
         lon = ifelse(lon > 180, lon-360, lon))

NetCDF

  • And then let’s plot it!
ggplot(sst_NOAA_tidy, aes(x = lon, y = lat)) +
  geom_tile(aes(fill = anom)) +
  borders(fill = "grey70", colour = NA) +
  geom_tile(data = na.omit(sst_NOAA_tidy), fill = "lightskyblue") +
  scale_fill_gradient2(low = "blue", high = "red") +
  coord_quickmap(expand = FALSE)