bigDM: fitting multivariate spatial models
Introduction
In previous vignettes, we show how to fit spatial Poisson mixed models for high-dimensional areal count data, how to use parallel or distributed computation strategies, and how to use the bigDM package to analyse high-dimensional spatio-temporal count data. Here, we describe how to use this package to fit order-free multivariate scalable Bayesian models to smooth mortality (or incidence) risks of several diseases simultaneously (Vicente et al., 2023).
M-models for multivariate disease mapping
Let us assume that the area of interest is divided into
Following the work by Botella-Rocamora et al. (2015), we rearrange the spatial effects into the matrix
The potential association between the spatial patterns of the different diseases are included in the model considering the decomposition of
The matrix
On the other hand,
Once the between-diseases dependencies are incorporated into the model, the resulting prior distributions for
Prior distributions for the disease-specific random effects
Several priors distributions are implemented in the MCAR_INLA()
function to deal with spatial dependence within-diseases:
prior="intrinsic"
for the M-model implementation of the intrinsic multivariate CAR latent effect.prior="Leroux"
for the M-model implementation of the Leroux et al. (1999) multivariate CAR latent effect.prior="proper"
for the M-model implementation of the proper multivariate CAR latent effect.prior="iid"
for the M-model implementation of spatially non-structured multivariate latent effect.
As for the spatial prior distributions for univariate (single disease) models, appropriate sum-to-zero constraints must be imposed to solve identifiability problems with the disease-specific intercepts. See Vicente et al. (2023) for details about prior distributions for model hyperparameters.
Note: The M-model implementation of these models using R-INLA
requires the use of at least
Between-disease correlations and variance parameters
In addition to enlarge the effective sample size and improving smoothing by borrowing information from the different responses, one of the main advantages of multivariate disease mapping models is that they take into account correlations between the spatial patterns of the different diseases
We compute the marginal posterior estimates of these parameters by sampling from the approximated joint posterior for the model hyperparameters using the inla.hyperpar.sample()
function and computing kernel density estimates of the derived samples for the elements of the correlation matrix of the random effects. The results (summary statistics and posterior marginal densities) are contained in the summary.cor
/summary.var
and marginals.cor
/marginals.var
elements of the inla
model.
The MCAR_INLA
function
As in the CAR_INLA()
and STCAR_INLA()
functions, three main modelling approaches can be considered:
- the usual model with a global spatial random effect whose dependence structure is based on the whole neighbourhood graph of the areal units (
model="global"
argument), - a disjoint model based on a partition of the whole spatial domain where independent spatial CAR models are simultaneously fitted in each partition (
model="partition"
andk=0
arguments), - a modelling approach where
-order neighbours are added to each partition to avoid border effects in the disjoint model (model="partition"
andk>0
arguments).
For both the disjoint and
The data and its associated cartography file need to be specified into the MCAR_INLA()
function. These are some of the most relevant arguments of this function:
carto
: an object of classsf
orSpatialPolygonsDataFrame
that must contain at least the target variable of interest specified in the argumentID.area
.data
: an object of classdata.frame
that must contain the target variables of interest specified in the argumentsID.area
,ID.disease
,O
andE
.ID.area
: name of the variable that contains the IDs of spatial areal units. The values of this variable must match those given in the carto and data variable.ID.disease
: name of the variable that contains the IDs of the diseases.ID.group
: name of the variable that contains the IDs of the spatial partition (grouping variable). Only required ifmodel="partition"
.O
: name of the variable that contains the observed number of disease cases for each areal and time point.E
: name of the variable that contains either the expected number of disease cases or the population at risk for each areal unit and time point.W
: optional argument with the binary adjacency matrix of the spatial areal units. IfNULL
(default), this object is computed from thecarto
argument (two areas are considered as neighbours if they share a common border).merge.strategy
: one of either"mixture"
or"original"
(default), that specifies the merging strategy to compute posterior marginal estimates of the linear predictor (log-risks or log-rates). SeemergeINLA()
function for further details.compute.fitted.values
: logical value (defaultFALSE
); ifTRUE
transforms the posterior marginal distribution of the linear predictor to the exponential scale (risks or rates).
The Carto_SpainMUN
object included in the bigDM package, contains the spatial polygons of the municipalities of continental Spain and simulated colorectal cancer mortality data (see the examples of the CAR_INLA function).
library(INLA)
library(bigDM)
data(Carto_SpainMUN)
head(Carto_SpainMUN)
#> Simple feature collection with 6 features and 8 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 485318 ymin: 4727428 xmax: 543317 ymax: 4779153
#> Projected CRS: ETRS89 / UTM zone 30N
#> ID name area perimeter obs
#> 1 01001 Alegria-Dulantzi 19913794 [m^2] 34372.11 [m] 2
#> 2 01002 Amurrio 96145595 [m^2] 63352.32 [m] 28
#> 3 01003 Aramaio 73338806 [m^2] 41430.46 [m] 6
#> 4 01004 Artziniega 27506468 [m^2] 22605.22 [m] 3
#> 5 01006 Arminon 10559721 [m^2] 17847.35 [m] 0
#> 6 01008 Arrazua-Ubarrundia (San Martin de Ania) 57502811 [m^2] 64968.81 [m] 2
#> exp SMR region geometry
#> 1 3.0237149 0.6614380 Pais Vasco MULTIPOLYGON (((538259 4737...
#> 2 20.8456682 1.3432047 Pais Vasco MULTIPOLYGON (((503520 4760...
#> 3 3.7527301 1.5988360 Pais Vasco MULTIPOLYGON (((533286 4759...
#> 4 3.2093191 0.9347777 Pais Vasco MULTIPOLYGON (((491260 4776...
#> 5 0.4817391 0.0000000 Pais Vasco MULTIPOLYGON (((509851 4727...
#> 6 1.9643891 1.0181282 Pais Vasco MULTIPOLYGON (((534678 4746...
In this vignette, simulated cancer mortality data for three diseases in the 7907 municipalities of mainland Spain (excluding Baleareas and Canary Islands, and the autonomous cities of Ceuta and Melilla) included in the object Data_MultiCancer
will be used for illustration (modified in order to preserve the confidentiality of the original data).
data(Data_MultiCancer)
str(Data_MultiCancer)
#> 'data.frame': 23721 obs. of 5 variables:
#> $ ID : chr "01001" "01002" "01003" "01004" ...
#> $ disease: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ obs : int 3 41 4 6 0 2 4 10 2 7 ...
#> $ exp : num 6.615 42.634 7.431 6.355 0.934 ...
#> $ SMR : num 0.454 0.962 0.538 0.944 0 ...
Note that both objects contain a common identification variable of the areal units named as ID
.
Global model
We refer as Global model to the spatial multivariate model described in Equation @ref(eq:Mmodel), where the whole neighbourhood graph of the areal units is considered to define the adjacency matrix
The Global model with an iCAR prior for the spatial random effects is fitted using the MCAR_INLA()
function as
<- MCAR_INLA(carto=Carto_SpainMUN, data=Data_MultiCancer, ID.area="ID", ID.disease="disease",
Global O="obs", E="exp", prior="intrinsic", model="global", strategy="gaussian")
#> STEP 1: Pre-processing data
#> STEP 2: Fitting global model with INLA (this may take a while...)
summary(Global)
#> Time used:
#> Pre = 1.24, Running = 242, Post = 6.26, Total = 249
#> Fixed effects:
#> mean sd 0.025quant 0.5quant 0.975quant mode kld
#> I1 -0.139 0.006 -0.150 -0.139 -0.127 -0.139 0
#> I2 -0.049 0.006 -0.062 -0.049 -0.036 -0.049 0
#> I3 0.026 0.010 0.007 0.026 0.046 0.026 0
#>
#> Random effects:
#> Name Model
#> idx RGeneric2
#>
#> Model hyperparameters:
#> mean sd 0.025quant 0.5quant 0.975quant mode
#> Theta1 for idx -1.292 0.031 -1.353 -1.292 -1.230 -1.292
#> Theta2 for idx -1.982 0.076 -2.136 -1.980 -1.837 -1.972
#> Theta3 for idx -1.361 0.067 -1.495 -1.360 -1.232 -1.356
#> Theta4 for idx 0.118 0.011 0.097 0.118 0.139 0.118
#> Theta5 for idx 0.102 0.018 0.067 0.102 0.137 0.102
#> Theta6 for idx 0.092 0.026 0.041 0.092 0.143 0.092
#>
#> Deviance Information Criterion (DIC) ...............: 79497.16
#> Deviance Information Criterion (DIC, saturated) ....: 25182.16
#> Effective number of parameters .....................: 1721.78
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 79398.30
#> Effective number of parameters .................: 1397.43
#>
#> Marginal log-Likelihood: -40196.54
#> CPO, PIT is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
## Posterior estimates of between-disease correlations ##
$summary.cor
Global#> mean sd 0.025quant 0.5quant 0.975quant
#> rho12 0.6483155 0.04464886 0.5599726 0.6494295 0.7336759
#> rho13 0.3473821 0.05730459 0.2286583 0.3492246 0.4535381
#> rho23 0.4629081 0.07071069 0.3146017 0.4656483 0.5934391
## Posterior estimates of variance parameters ##
$summary.var
Global#> mean sd 0.025quant 0.5quant 0.975quant
#> var1 0.07566692 0.004874530 0.06647279 0.07552309 0.08560595
#> var2 0.03326421 0.004004212 0.02580772 0.03316071 0.04152341
#> var3 0.08611126 0.010905311 0.06611620 0.08577073 0.10850247
When the number of areas is very large, the M-model approach can be computationally very intensive. In this situation, the computational burden of these models is so high that they could be unfeasible for users with limited computing capacity. In addition, to fit a single model in the whole region could be not the best strategy as the degree of smoothing does not need to be same in the whole region. In contrast, the scalable Bayesian multivariate modelling approach described in Vicente et al. (2023) can be used to jointly smooth incidence or mortality risks of several diseases for high-dimensional areal count data. This proposal divides the spatial domain into MCAR_INLA()
function.
Disjoint model
A natural way to think of partitions is to consider subregions based on administrative subdivisions of the area of interest. For our example data in Data_MultiCancer
we propose to divide the data into the region
variable of the Carto_SpainMUN
object.
library(tmap)
<- packageVersion("tmap") >= "3.99"
tmap4
if(tmap4){
tm_shape(Carto_SpainMUN) +
tm_polygons(fill="region", fill.scale=tm_scale(values="brewer.set3")) +
tm_layout(legend.frame=FALSE)
else{
}tm_shape(Carto_SpainMUN) +
tm_polygons(col="region") +
tm_layout(legend.outside=TRUE)
}
In the code below, we show how to fit the Disjoint model with an iCAR prior for the spatial random effects and Gaussian approximation strategy using 4 local clusters (in parallel)
<- MCAR_INLA(carto=Carto_SpainMUN, data=Data_MultiCancer,
Disjoint ID.area="ID", ID.disease="disease", O="obs", E="exp", ID.group="region",
prior="intrinsic", model="partition", k=0, strategy="gaussian",
plan="cluster", workers=rep("localhost",4))
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#>
#> *** inla.core.safe: The inla program failed, but will rerun in case better initial values may help. try=1/1
#>
#> *** inla.core.safe: rerun with improved initial values
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results
summary(Disjoint)
#> Time used:
#> Running = 231, Merging = 51.2, Total = 282, NA = NA
#> Random effects:
#> Name Model
#> idx RGeneric2
#>
#> Deviance Information Criterion (DIC) ...............: 79523.58
#> Deviance Information Criterion (DIC, saturated) ....: 25218.21
#> Effective number of parameters .....................: 1979.72
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 79392.50
#> Effective number of parameters .................: 1592.49
#>
#> is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
* Computations are made in personal computer with a 3.41 GHz Intel Core i5-7500 processor and 32GB RAM using R-INLA stable version INLA_24.12.11.
The result is an object of class inla
where the full domain log-risk is just the union of the posterior marginal estimates of each subregion, i.e., save.models=TRUE
argument is included, a list with all the inla submodels is saved in a temporary folder, that can be used as input argument for the mergeINLA()
function.
k-order neighbourhood model
As in the case of the scalable spatial and spatio-temporal models fitted with CAR_INLA()
and STCAR_INLA()
functions, respectively, k-order neighbourhood models can be defined to avoid the border effect of considering disjoint partitions. In this case, the entire spatial region
Two different merging strategies can be considered to obtain a unique posterior estimate of the linear predictor for those areas in more than one submodel:
If the
merge.strategy="original"
argument is specified (default option), the posterior marginal distributions of the log-risk estimated from its original partition is used. See Orozco-Acosta et al. (2023).If the
merge.strategy="mixture"
argument is specified, mixture distributions of the estimated posterior probability density functions with weights proportional to the conditional predictive ordinates (CPOs) are computed. See Orozco-Acosta et al. (2021) for further details.
In the code below, we show how to fit the 1st and 2nd-order neighbourhood model with an iCAR prior for the spatial random effects and Gaussian approximation strategy using 4 local clusters (in parallel):
<- MCAR_INLA(carto=Carto_SpainMUN, data=Data_MultiCancer,
order1 ID.area="ID", ID.disease="disease", O="obs", E="exp", ID.group="region",
prior="intrinsic", model="partition", k=1, strategy="gaussian",
plan="cluster", workers=rep("localhost",4))
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=1) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results
summary(order1)
#> Time used:
#> Running = 216, Merging = 64.7, Total = 281, NA = NA
#> Deviance Information Criterion (DIC) ...............: 79464.34
#> Deviance Information Criterion (DIC, saturated) ....: 25158.97
#> Effective number of parameters .....................: 1862.13
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 79357.21
#> Effective number of parameters .................: 1513.52
#>
#> is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
## Posterior estimates of between-disease correlations ##
$summary.cor
order1#> mean sd 0.025quant 0.5quant 0.975quant
#> rho12 0.6633581 0.03928655 0.5812636 0.6650186 0.7365318
#> rho13 0.4419385 0.04887337 0.3431967 0.4424212 0.5350900
#> rho23 0.4095120 0.06295177 0.2814378 0.4113667 0.5268354
## Posterior estimates of variance parameters ##
$summary.var
order1#> mean sd 0.025quant 0.5quant 0.975quant
#> var1 0.06600475 0.004567020 0.05745493 0.06586423 0.07533450
#> var2 0.03058330 0.003652627 0.02386736 0.03042558 0.03808069
#> var3 0.09135834 0.010155532 0.07266927 0.09099157 0.11251572
<- MCAR_INLA(carto=Carto_SpainMUN, data=Data_MultiCancer,
order2 ID.area="ID", ID.disease="disease", O="obs", E="exp", ID.group="region",
prior="intrinsic", model="partition", k=1, strategy="gaussian",
plan="cluster", workers=rep("localhost",4))
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=1) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results
summary(order2)
#> Time used:
#> Running = 208, Merging = 67.1, Total = 275, NA = NA
#> Deviance Information Criterion (DIC) ...............: 79472.96
#> Deviance Information Criterion (DIC, saturated) ....: 25167.59
#> Effective number of parameters .....................: 1865.87
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 79364.11
#> Effective number of parameters .................: 1515.45
#>
#> is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
## Posterior estimates of between-disease correlations ##
$summary.cor
order2#> mean sd 0.025quant 0.5quant 0.975quant
#> rho12 0.6814623 0.03713892 0.6059669 0.6823613 0.7512084
#> rho13 0.5121131 0.04377169 0.4239085 0.5127631 0.5958917
#> rho23 0.5007879 0.05267512 0.3942343 0.5019426 0.6010450
## Posterior estimates of variance parameters ##
$summary.var
order2#> mean sd 0.025quant 0.5quant 0.975quant
#> var1 0.06853351 0.004444304 0.06019242 0.06836791 0.07751518
#> var2 0.03176106 0.003610629 0.02520510 0.03157432 0.03933410
#> var3 0.09515434 0.009682087 0.07699128 0.09478624 0.11488388
* Computations are made in personal computer with a 3.41 GHz Intel Core i5-7500 processor and 32GB RAM using R-INLA stable version INLA_24.12.11.
mergeINLA
function
This function takes local models fitted for each subregion of the whole spatial domain and unifies them into a single inla
object. It is called by the main function MCAR_INLA()
, and is valid for both Disjoint and k-order neighbourhood models. In addition, approximations to model selection criteria such as the deviance information criterion (DIC) (Spiegelhalter et al., 2002) and Watanabe-Akaike information criterion (WAIC) (Watanabe, 2010) are also computed. See Vicente et al. (2023) for details on how to compute these measures for the multivariate scalable models described in this vignette.
Computation of between-disease correlation and variance parameters
Partition models provide extra information about local relationships between the diseases in the subdivisions, as they provide local estimates of model’s parameters of interest: between-disease correlations
, and variance parameters (diagonal elements of the between-disease covariance matrix).In the
mergeINLA()
function, we implement an adaptation of the consensus Monte Carlo algorithm (Scott et al., 2016) to obtain global estimates of these parameters in the overall study domain from the marginal estimates of the partition models. Further details are given in Vicente et al. (2023).
Acknowledgments
This work has been supported by Project MTM2017-82553-R (AEI/FEDER, UE) and Project PID2020-113125RB-I00/MCIN/AEI/10.13039/501100011033. It has also been partially funded by the Public University of Navarra (project PJUPNA2001).