Abstract
A scalable and fast method for estimating joint Species Distribution Models (jSDMs) for big community data, including eDNA data. The package estimates a full (i.e. non-latent) jSDM with different response functions (including the traditional multivariate probit model). As described in Pichler & Hartig (2021) doi:10.1111/2041-210X.13687, scalability is achieved by using a Monte Carlo approximation of the joint likelihood implemented via ‘PyTorch’ and ‘reticulate’, which can be run on CPUs or GPUs. The package includes support for different response families, the ability to account for spatial autocorrelation, and the option to fit responses via deep neural networks instead of a standard linear predictor. In addition, you can perform variation partitioning (VP) / ANOVA on the fitted models for environmental, spatial, and co-association model components, and further subdivide these three components per species and sites (internal metacommunity structure, see Leibold et al., doi:10.1111/oik.08618).
The sjSDM package provides R functions for estimating so-called joint species distribution models.
A jSDM is a GLMM that models a multivariate (i.e. a many-species) response to the environment, space and a covariance term that models conditional (on the other terms) correlations between the outputs (i.e. species). In other research domains, the covariance component is also known as the multivariate probit model.
A big challenge in jSDM implementation is computational speed. The goal of the sjSDM (which stands for “scalable joint species distribution models”) is to make jSDM computations fast and scalable. Unlike many other packages, which use a latent-variable approximation to make estimating jSDMs faster, sjSDM fits a full covariance matrix in the likelihood, which is, however, numerically approximated via simulations. The method is described in Pichler & Hartig (2021) A new joint species distribution model for faster and more accurate inference of species associations from big community data, https://www.doi.org/10.1111/2041-210X.13687.
The core code of sjSDM is implemented in Python / PyTorch, which is then wrapped into an R package. In principle, you can also use it stand-alone under Python (see instructions below). Note: for both the R and the python package, python >= 3.9 and pytorch must be installed (more details below). However, for most users, it will be more convenient to use sjSDM via the sjSDM R package, which also provides a large number of downstream functionalities.
To get citation info for sjSDM when you use it for your reseach, type
citation("sjSDM")
#> To cite sjSDM in publications use:
#>
#> Pichler, M. and Hartig, F. (2021), A new joint species distribution model for faster and more accurate inference of species associations from big
#> community data. Methods in Ecology and Evolution. Accepted Author Manuscript. https://doi.org/10.1111/2041-210X.13687
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {A new joint species distribution model for faster and more accurate inference of species associations from big community data},
#> author = {Maximilian Pichler and Florian Hartig},
#> journal = {Methods in Ecology and Evolution},
#> year = {2021},
#> doi = {10.1111/2041-210X.13687},
#> }
sjSDM is distributed via CRAN. For most users, it will be best to install the package from CRAN
Depencies for the package can be installed before or after installing the package. Detailed explanations of the can be found below. Very briefly, the dependencies can be automatically installed from within R:
For advanced users: if you want to install the current (development) version from this repository, run
devtools::install_github("https://github.com/TheoreticalEcology/s-jSDM", subdir = "sjSDM", ref = "master")
dependencies should be installed as above. If the installation fails,
check out the help of ?install_sjSDM
, ?installation_help,
and vignette("Dependencies", package = "sjSDM")
.
install_sjSDM()
?installation_help
We start with a dataset about eucalypt communities (the dataset is from Pollock et a., 2014)
library(sjSDM)
set.seed(42)
Env = eucalypts$env # environment
PA = eucalypts$PA # presence absence
Coords = eucalypts$lat_lon # coordinates
Prepare data:
Env$Rockiness = scale(Env$Rockiness)
Env$PPTann = scale(Env$PPTann)
Env$cvTemp = scale(Env$cvTemp)
Env$T0 = scale(Env$T0)
Coords = scale(Coords)
The model is fit by the function sjSDM()
. You have to
provide all predictors and response as matrices or data frames. Here, we
specify all three predictor components of an SDM:
biotic = bioticStruct(diag = TRUE)
, which
would result in a standard MSDM.model <- sjSDM(Y = PA,
env = linear(data = Env, formula = ~.),
spatial = linear(data = Coords, formula = ~0+latitude*longitude),
family=binomial("probit"),
se = TRUE,
verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
The “linear” expression used in the environmental and spatial component means that you essentially fit a linear regression structure for the respective component. Alternatively, you could also fit a neural network (see later).
The option se = TRUE
tells the model to calculate CIs
and p-values. Note that these are approximated based on Wald CIs, which
may have a certain amount of error for small data and if regularization
is applied.
To get an overall summary of the model, type
summary(model)
#> Family: binomial
#>
#> LogLik: -242.1242
#> Regularization loss: 0
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
This output will show you the fitted environmental, spatial and covariance parameters (provided that those components were specified). Implemented S3 functions for a model object include
The environmental effects are displayed in the summary() table. To get a visual plot, you can use
The spatial effects are displayed in the summary()
table. Currently, there are not additional options implemented to
visualize the spatial effects
The species-species associations are displayed in the
summary()
table. You can extract and visualize them as
follows:
association = getCor(model)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
Note that these associations are to be interpreted as correlations in species presence or abundance after accounting for environmental and spatial effects.
ANOVA or variation partitioning means that we want to calculate how much variation (=signal) is explained by the different model components.
The sjSDM package includes several functions for this purpose.
The first option is a global ANOVA, which will calculate the amount of variation explained by the three components (environment, associations, and space):
an = anova(model, verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
The resulting object is of class sjSDManova, and you can use the standard summary and plot function on this object:
summary(an)
#> Analysis of Deviance Table
#>
#> Deviance Residual deviance R2 Nagelkerke R2 McFadden
#> Abiotic 673.26304 3002.96075 0.77008 0.1654
#> Associations 169.13782 3507.08597 0.30878 0.0416
#> Spatial 150.96228 3525.26151 0.28080 0.0371
#> Shared Abiotic+Associations 142.48582 3533.73797 0.26736 0.0350
#> Shared Abiotic+Spatial 1531.78234 2144.44145 0.96472 0.3763
#> Shared Spatial+Associations 11.24359 3664.98020 0.02425 0.0028
#> Shared Abiotic+Associations+Spatial -1361.34035 5037.56415 -18.53795 -0.3345
#> Full 1317.53455 2358.68924 0.94368 0.3237
We see that the environmental component explains the most variance in the data, followed by the biotic association component. As in any ANOVA, there are unique and shared fractions for each component. The shared fractions depict variation that can be explained from either of the components (e.g. due to collinearity).
Different ANOVA strategies deal differently with these shared
fractions. For example, in type II/III ANOVA, the shared fractions are
usually discarded. The default here is to display all fractions.
Alternatively (more on this see ?summary.sjSDManova
):
summary(an, fractions = "discard")
discards shared
fractions, as in a type II/III ANOVAsummary(an, fractions = "proportional")
distributes
shared fractions to the unique fractions proportional to the unique
fractionssummary(an, fractions = "equal")
distributes shared
fractions evenly to the unique fractionsNote that in VP, some fractions can get negative. For more discussion on this and how to interpret it see here and here.
An interesting feature of jSDMs is that we can further partition the global VP discussed just before to sites and species. This idea was first presented in
Leibold, M. A., Rudolph, F. J., Blanchet, F. G., De Meester, L., Gravel, D., Hartig, F., ... & Chase, J. M. (2022). The internal structure of metacommunities. Oikos, 2022(1).
We have implemented this feature call the calculation for this “internal metacommunity structure” via
The resulting object contains R2 values for each of the three
components (Env, Space, Associations) for each species and site. Note
the comments in ?internalStructure
, especially about how to
deal with negative values.
The results can be plotted via
The ternary plots report the relative importance of the components for sites and species. Leaning to one of the corners means that this corner is mot important (e.g. here Species and Sites lean more to the environmental corner).
A attractive feature of the internal metacommunity structure calculations is that we can now regress the partial R2 of sites and species against predictors, to see how the strength of the three assembly components (Environment, Space, Associations) changes with these predictors.
The default plot will plot the assembly processes per plot against environmental and spatial distinctiveness, and richness:
However, you can use your own predictors, and also switch between
looking at sites or species as a response. For more on this, see
?plotAssemblyEffects
.
To predict with an sjSDM object, use
pred = predict(model, newdata = Env, SP = Coords)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
The standard predictions are marginal, i.e. assume that we have no information about what species exist on a site we predict to.
As jSDMs fit correlations between species, predictions of jSDM improve if we can provide information for some of the species on a new site. This is known as a conditional prediction. As an example, let’s assume that the first 6 species in this data are unknown, whereas the P/A of the rest is known to us.
New_PA = PA
New_PA[, 1:6] = NA
head(New_PA)
#> ALA ARE BAX CAM GON MEL OBL OVA WIL ALP VIM ARO.SAB
#> [1,] NA NA NA NA NA NA 0 0 1 1 0 0
#> [2,] NA NA NA NA NA NA 1 0 1 1 0 0
#> [3,] NA NA NA NA NA NA 0 0 1 1 0 0
#> [4,] NA NA NA NA NA NA 0 0 1 0 0 0
#> [5,] NA NA NA NA NA NA 1 0 0 0 0 0
#> [6,] NA NA NA NA NA NA 0 0 1 1 0 0
We can then predict occurrences for the first 6 species, conditioning on the rest of the species (7:12):
pred = predict(model, newdata = Env, SP = Coords, Y = New_PA)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
head(pred)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.06919945 0.0009476454 0.1910016 0.0018830881 0.0128013078 0.0024387540
#> [2,] 0.11315208 0.0009214621 0.1472399 0.0002079794 0.0072220865 0.0003615931
#> [3,] 0.04488569 0.0006793896 0.2183437 0.0011189727 0.0177170230 0.0009391013
#> [4,] 0.03075938 0.0009504823 0.6358477 0.0013088832 0.0062728245 0.0042722428
#> [5,] 0.03657349 0.0002785174 0.7242049 0.0016383200 0.0003361727 0.0132508982
#> [6,] 0.06529714 0.0008642071 0.2183885 0.0005441471 0.0231352945 0.0005368480
Note that the predict function returns only predictions for species with NA in Y
sjSDM supports other responses than presence-absence data: Simulate non-presence-absence data:
com = simulate_SDM(env = 3L, species = 5L, sites = 100L,
link = "identical", response = "count", verbose = FALSE)
#> Error in simulate_SDM(env = 3L, species = 5L, sites = 100L, link = "identical", : unused argument (verbose = FALSE)
jSDMs account for correlation between species within communities (sites), in real datasets, however, communities (sites) are often also correlated (== spatial autocorrelation). Usually conditional autoregressive (CAR) models are used to account for the spatial autocorrelation in the residuals, which we, however, do not support yet. A similar approach is to condition the model on space, which we can do by using space as predictors.
Let’s first simulate test data:
com = simulate_SDM(env = 3L, species = 5L, sites = 100L,
link = "identical", response = "identical")
X = com$env_weights
Y = com$response
XYcoords = matrix(rnorm(200), 100, 2)+2
WW = as.matrix(dist(XYcoords))
spatialResiduals = mvtnorm::rmvnorm( 5L, sigma = exp(-WW))
There are three options to condition our model on space:
The idea of the trend surface model is to use the spatial coordinates within a polynom:
Sometimes a linear model and a polynom is not flexible enough to account for space. We can use a “simple” DNN for space to condition our linear environmental model on the space:
sjSDM supports l1 (lasso) and l2 (ridge) regularization: * alpha is the weighting between lasso and ridge * alpha = 0.0 corresponds to pure lasso * alpha = 1.0 corresponds to pure ridge
We can do the same for the species associations:
model = sjSDM(Y = com$response,
env = linear(data = com$env_weights,
formula = ~0+ I(X1^2),
lambda = 0.5),
biotic = bioticStruct(lambda =0.1),
iter = 50L, verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
com = simulate_SDM(env = 3L, species = 5L, sites = 100L)
X = com$env_weights
Y = com$response
# three fully connected layers with relu as activation function
model = sjSDM(Y = Y,
env = DNN(data = X,
formula = ~.,
hidden = c(10L, 10L, 10L),
activation = "relu"),
iter = 50L, se = TRUE, verbose = FALSE)
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
summary(model)
#> Family: binomial
#>
#> LogLik: -242.1242
#> Regularization loss: 0
#> Error in eval(expr, envir, enclos): Required version of NumPy not available: incompatible NumPy binary version 33554432 (expecting version 16777225)
The methods for sjSDM() work also for the non-linear model:
association = getCor(model) # species association matrix
pred = predict(model) # predict on fitted data
pred = predict(model, newdata = X) # predict on new data
Extract and set weights of model:
Plot the training history:
The sjSDM::install_sjSDM()
function can install
automatically all necessary ‘python’ dependencies but it can fail
sometimes because of individual system settings or if other
‘python’/‘conda’ installations get into the way.
A few notes before you start with the installation (skip this point if you do not know conda):
Sometimes the automatic ‘miniconda’ installation (via
sjSDM::install_sjSDM()
).doesn’t work because of white
spaces in the user’s name. But you can easily download and install
‘conda’ on your own:
Download and install the latest ‘conda’ version
Afterwards run:
Reload the package and run the example, if this doesn’t work:
Download and install the latest ‘conda’ version
Open the command window (cmd.exe - hit windows key + r and write cmd)
Run in cmd.exe:
conda create --name r-sjsdm python=3.10
conda activate r-sjsdm
conda install pytorch torchvision torchaudio cpuonly -c pytorch # cpu
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia #gpu
python -m pip install pyro-ppl torch_optimizer madgrad
Restart R, try the example, and if it does not work:
Run in R:
Restart R try to run the example, if this doesn’t work:
We strongly advise to use a ‘conda’ environment but a virtual environment should also work. The only requirement is that it is named ‘r-sjsdm’
Download and install the latest ‘conda’ version
Open your terminal and run:
conda create --name r-sjsdm python=3.10
conda activate r-sjsdm
conda install pytorch torchvision torchaudio cpuonly -c pytorch # cpu
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia #gpu
python -m pip install pyro-ppl torch_optimizer madgrad
Restart R try to run the example, if this doesn’t work:
Run in R:
Restart R try to run the example, if this doesn’t work:
We strongly advise to use a ‘conda’ environment but a virtual environment should also work. The only requirement is that it is named ‘r-sjsdm’
Download and install the latest conda conda version
Open your terminal and run:
conda create --name r-sjsdm python=3.10
conda activate r-sjsdm
conda install pytorch::pytorch torchvision torchaudio -c pytorch
python -m pip install pyro-ppl torch_optimizer madgrad
Restart R, try the example, if it does not work:
sjSDM::install_diagnostic()
as a quote.