dynamicSDM offers novel tools for developing species geographical distribution and abundance modelling (SDM) at high spatiotemporal resolution. Across the dynamicSDM vignettes, we demonstrate package functions on a case study species and guide users through the four key species distribution modelling stages:
Before starting this tutorial, the dynamicSDM package needs to be installed. You can install the latest version from GitHub with:
or from CRAN with
Due to the utilisation of Google Earth Engine and Google Drive across
dynamicSDM functions, you must also set up a free Google
account at Google.
Once you have your Google log-in details, these can be used to authorise
Google Earth Engine in the rgee
package and Google Drive in
the googledrive
package.
As we continue through the species distribution modelling framework, functions will generate a range of outputs. To keep organised, we recommend a structured directory system. The code below will create a new project directory in your temporary folders, but you can edit the script to specify your home directory if you want to retain function outputs between R sessions.
In this four stage tutorial, we will be studying the case study species, the red-billed quelea (Quelea quelea), a nomadic bird inhabiting sub-Saharan Africa. With highly variable distributions driven by short-term weather and resource availability, dynamicSDM provides key functions to accurately model quelea’s distribution through space and time.
In this tutorial, we will be processing a sample of species occurrence records for the red-billed quelea (Quelea quelea) to generate an SDM response variable data frame. This will involve filtering records by spatio-temporal quality, extent and resolution; testing and accounting for spatio-temporal biases in records, and generating pseudo-absences through space and time.
A sample of red-billed quelea occurrence data can be loaded into your local R environment using the code below.
These records are formatted as exported from the Global Biodiversity Information Facility, with columns labelled “decimalLongitude” and “decimalLatitude”. For dynamicSDM functions, occurrence data must be in a specific format; containing numeric data in columns labelled “x” and “y” for the record co-ordinates longitude and latitude, and “day”, “month” and “year” for the record’s date.
convert_gbif()
can be used to convert between these
formats, or this can be achieved manually for data not extracted from
GBIF.
Species occurrence records may contain anomalous or missing values in
the co-ordinates or dates given. spatiotemp_check()
can be
used to exclude records with missing (NA) values (argument
na.handle
), invalid co-ordinates
(coord.handle
), invalid dates (date.handle
)
and duplicate records (duplicate.handle
).
Here, we specify date.res = "day"
to check all date
columns from year to day. If you only want to check the year of record
dates, then you could specify date.res = "year"
.
sample_occ_filtered <- spatiotemp_check(occ.data = sample_occ,
na.handle = "exclude",
date.handle = "exclude",
date.res = "day",
coord.handle = "exclude",
duplicate.handle = "exclude")
#> omitting any species records with coordinates containing NA
#> omitting any species records with dates containing NA
#> omitting any duplicate records
#> any records with invalid co-ordinates excluded
#> any records with invalid dates excluded
nrow(sample_occ_filtered)
#> [1] 311
Additionally, users can opt to take advantage of the
CoordinateCleaner
package functions to exclude records
flagged as containing potentially anomalous co-ordinates, based on a
wider variety of spatial tests than dynamicSDM alone
offers.
When generating dynamic species distribution models, it is important
that occurrence records match the spatio-temporal extent of the study
and any remote-sensing datasets to be utilised as explanatory variables.
spatiotemporal_extent()
removes any records outside of the
set extents.
In this case study, we want to model the subspecies Q. q. lathamii which is typically limited to southern Africa and we will incorporate remote sensing datasets available for the years 2001 to 2019. To filter records to these extents, we use a polygon of southern African countries, which can be read into your R environment and applied using the following code.
data("sample_extent_data")
sample_occ_cropped <- spatiotemp_extent(occ.data = sample_occ_filtered,
temporal.ext = c("2001-01-01", "2018-12-01"),
spatial.ext = terra::ext(sample_extent_data),
prj = "+proj=longlat +datum=WGS84 +no_defs")
## Lets plot the change
#terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_filtered[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"))
#terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"))
The spatial resolution of occurrence record co-ordinates and the
temporal resolution of occurrence record dates are also important when
generating dynamic SDMs. spatiotemp_resolution()
can filter
records by set thresholds.
As quelea migrations are driven by highly local and short-term environmental factors, we will filter to exclude records not given to the specific day and co-ordinates to four decimal places.
For many reasons, the collection of species occurrence records are
biased through space and time. These biases can impact SDM performance
by over- or under- representing conditions at a given site or time.
spatiotemp_bias()
returns the results of statistical tests
and plots for visual assessment of such biases. The code below tests for
bias across the entire dataset and species range.
bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped,
temporal.level = c("year"),
plot = FALSE,
spatial.method = "simple",
prj = "+proj=longlat +datum=WGS84")
bias_results
#> $Temporal_bias_year
#>
#> Chi-squared test for given probabilities
#>
#> data: occ.data.frequency
#> X-squared = 50.451, df = 15, p-value = 1.016e-05
#>
#>
#> $Spatial_bias
#>
#> Welch Two Sample t-test
#>
#> data: min.nndist.actual and min.nndist.random
#> t = -6.7637, df = 255.6, p-value = 9.111e-11
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -71605.91 -39311.61
#> sample estimates:
#> mean of x mean of y
#> 45486.12 100944.88
Whereas, this code will test for bias across only the core of the species range.
bias_results <- spatiotemp_bias(occ.data = sample_occ_cropped,
temporal.level = c("year"),
plot = FALSE,
spatial.method = "core",
prj = "+proj=longlat +datum=WGS84")
#> Warning: [mask] CRS do not match
bias_results
#> $Temporal_bias_year
#>
#> Chi-squared test for given probabilities
#>
#> data: occ.data.frequency
#> X-squared = 50.451, df = 15, p-value = 1.016e-05
#>
#>
#> $Spatial_bias
#>
#> Welch Two Sample t-test
#>
#> data: min.nndist.actual and min.nndist.random
#> t = 0.30771, df = 171.05, p-value = 0.7587
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -11721.49 16050.79
#> sample estimates:
#> mean of x mean of y
#> 45486.12 43321.47
Both approaches can generate important results. The latter may be a more reliable indicator if the species range is expanding or shifting in response to global change. This is because records at the range periphery that have been collected randomly may still not be evenly distributed through space and time due to underlying ecological processes. Therefore, testing for biases on the species range core may be more representative of true spatial and temporal sampling biases.
One approach to correct spatio-temporal sampling biases are to “thin”
records, which involves excluding records that have been collected too
closely in space or time. spatiotemp_thin()
allows users to
specify the spatial and temporal distance to thin records by. An
important argument of spatiotemp_thin()
is
spatial.split.degrees
which specifies the size of grid
cells to split occurrence records into before temporal thinning. This
prevents spatially distant but temporally close records from being
filtered out.
occ_thin <- spatiotemp_thin(occ.data = sample_occ_cropped,
temporal.method = "day",
temporal.dist = 30,
spatial.split.degrees = 3,
spatial.dist = 100000,
iterations = 5)
#> [1] "19 of 142 removed by temporal"
#> **********************************************
#> Beginning Spatial Thinning.
#> Script Started at: Tue Feb 18 10:34:01 2025
#> lat.long.thin.count
#> 47
#> 1
#> [1] "Maximum number of records after thinning: 47"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "18 of 142 removed by temporal"
#> **********************************************
#> Beginning Spatial Thinning.
#> Script Started at: Tue Feb 18 10:34:02 2025
#> lat.long.thin.count
#> 48
#> 1
#> [1] "Maximum number of records after thinning: 48"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "20 of 142 removed by temporal"
#> **********************************************
#> Beginning Spatial Thinning.
#> Script Started at: Tue Feb 18 10:34:02 2025
#> lat.long.thin.count
#> 46
#> 1
#> [1] "Maximum number of records after thinning: 46"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "18 of 142 removed by temporal"
#> **********************************************
#> Beginning Spatial Thinning.
#> Script Started at: Tue Feb 18 10:34:02 2025
#> lat.long.thin.count
#> 49
#> 1
#> [1] "Maximum number of records after thinning: 49"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
#> [1] "19 of 142 removed by temporal"
#> **********************************************
#> Beginning Spatial Thinning.
#> Script Started at: Tue Feb 18 10:34:02 2025
#> lat.long.thin.count
#> 48
#> 1
#> [1] "Maximum number of records after thinning: 48"
#> [1] "Number of data.frames with max records: 1"
#> [1] "No files written for this run."
# Plot the difference in spatial distribution after thinning
# Non-thinned
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T)
# Thinned
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(occ_thin[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"),add=T)
Another approach to account for spatio-temporal sampling biases is to weight records by sampling effort when fitting SDMs.This results in the downweighting of the importance of records from over-sampled regions and vice versa.
Here, we will read in a sample of e-Bird avian sampling records in
southern Africa to use as a proxy for sampling effort, and sum these
across a spatial and temporal buffer from each occurrence record using
spatiotemp_weights()
.
Often the paucity of species absence records necessitates the
generation of pseudo-absences, which are inferred absences based upon
presence records. When modelling a species distribution at high
spatiotemporal resolution, it may be best to generate pseudo-absences
within close spatial and temporal buffers of species occurrence. This is
because model presence-absence comparisons are made at fine scales,
instead of coarsely between biomes and seasons. Using
spatiotemp_pseudoabs()
you can specify the spatial and
temporal buffer size or extents to randomly generate pseudo-absences
within.
# Pseudo-absences generated within spatial and temporal buffer
pseudo_abs_buff <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped,
spatial.method = "buffer",
temporal.method = "buffer",
spatial.ext = sample_extent_data,
spatial.buffer = c(250000, 500000),
temporal.buffer = c(42, 84),
n.pseudoabs = nrow(sample_occ_cropped))
# Pseudo-absences generated randomly across given spatial and temporal extent
pseudo_abs_rand <- spatiotemp_pseudoabs(occ.data = sample_occ_cropped,
spatial.method = "random",
temporal.method = "random",
spatial.ext = sample_extent_data,
temporal.ext = c("2002-01-01", "2019-12-01"),
n.pseudoabs = nrow(sample_occ_cropped))
# Plot the spatial distribution of pseudo-absences (red) compared to occurrence records for:
# Buffered
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T)
terra::plot(terra::vect(pseudo_abs_buff[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T)
# Random
terra::plot(sample_extent_data$geometry)
terra::plot(terra::vect(sample_occ_cropped[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "black",add=T)
terra::plot(terra::vect(pseudo_abs_rand[, c("x", "y")],
geom = c("x", "y"),
crs = "+proj=longlat +datum=WGS84 +no_defs"),col = "red", add=T)
Then, the generated pseudo-absences need to be added to the occurrence record data frame by adding relevant columns and binding the rows together.
pseudo_abs_buff$decimalLatitude <- pseudo_abs_buff$y
pseudo_abs_buff$decimalLongitude <- pseudo_abs_buff$x
pseudo_abs_buff$occurrenceStatus <- rep("absent", nrow(pseudo_abs_buff))
pseudo_abs_buff$source <- rep("dynamicSDM", nrow(pseudo_abs_buff))
pseudo_abs_buff<-spatiotemp_weights(occ.data = pseudo_abs_buff,
samp.events = sample_events_data,
spatial.dist = 100000,
temporal.dist = 30,
prj = "+proj=longlat +datum=WGS84")
complete.dataset <- as.data.frame(rbind(sample_occ_cropped, pseudo_abs_buff))
At the end of this vignette, we now have a complete data frame of filtered species occurrence and pseudo-absence records of appropriate spatio-temporal quality, extent and resolution. Let’s save this to our project directory for use in the next tutorial!