In previous vignettes, we show how to fit spatial Poisson mixed models for high-dimensional areal count data and how to use parallel or distributed computation strategies using the bigDM package. Here, we describe how to use this package to fit spatio-temporal models with conditional autoregressive (CAR) priors for space and random walk (RW) priors for time including space-time interactions (Knorr-Held, 2000; Ugarte et al., 2014) by extending the scalable model’s proposals described in Orozco-Acosta et al. (2021) to deal with massive spatio-temporal data (Orozco-Acosta et al., 2023).
Let us assume that the region under study is divided into \(n\) contiguous small areas labeled as \(i=1,\ldots,n\) and data are available for consecutive time periods labeled as \(t=1,\ldots,T\). For a given area \(i\) and time period \(t\),
To compute the number of expected cases \(E_{it}\), both direct and indirect standardization procedures can be performed (Ugarte, 2009), usually considering age and/or sex as standardization variables. Using the indirect standardization method, the number of expected cases for area \(i\) and time \(t\) is defined as
\[E_{it}=\sum_{j=1}^{J}N_{it}\frac{O_j}{N_j} \ \ \mbox{for} \ \ i=1,\ldots,n; \ \ t=1,\ldots,T,\] where \(O_j=\sum_{i=1}^{n}\sum_{t=1}^{T}O_{itj}\) and \(N_j=\sum_{i=1}^{n}\sum_{t=1}^{T}N_{itj}\) are the number of observed cases and the population at risk in the \(j^{th}\) age-group, respectively. Then, the standardized mortality/incidence ratio (SMR or SIR) is defined as the number of observed cases divided by the number of expected cases. Although its interpretation is very simple, these measures are extremely variable when analyzing rare diseases or very low-populated areas, as it is the case of high-dimensional data. This makes it necessary the use of statistical models to smooth risks borrowing information from neighbouring regions and time periods.
Poisson mixed models are typically used for the analysis of count data within a hierarchical Bayesian framework. Conditional to the relative risk \(r_{it}\), the number of observed cases in the \(i\)th area and time period \(t\) is assumed to be Poisson distributed with mean \(\mu_{it} = E_{it}r_{it}\). That is,
\[\begin{eqnarray*} \label{eq:Model_Poisson} \begin{array}{rcl} O_{it}|r_{it} & \sim & Poisson(\mu_{it}=E_{it}r_{it}),\\ \log \mu_{it} & = & \log E_{it}+\log r_{it}, \end{array} \end{eqnarray*}\]
where \(\log E_{it}\) is an offset. Depending on the specification of the log-risks different models can be defined. The non-parametric models based on CAR priors for spatial random effects, random walk priors for temporal random effects, and different types of spatio-temporal interactions described in Knorr-Held (2000) are currently implemented in the bigDM package.
So, the log-risks are modelled as
\[\begin{equation} \log r_{it}=\alpha+\xi_{i}+\gamma_t+\delta_{it}, \tag{1} \end{equation}\]
where:
In what follows, we will refer to model (1) as the Global model. These models are flexible enough to describe many real situations, and their interpretation is simple and attractive. However, the models are typically not identifiable and appropriate sum-to-zero constraints must be imposed over the random effects (Goicoa et al., 2018). See Table 2 for a full description of the identifiability constraints that needs to be imposed in each type of space-time interaction.
Several priors distributions for spatial and temporal random effect are implemented in the STCAR_INLA()
function, which are specified through the spatial=...
and temporal=...
arguments.
For the spatial random effect, the same values as the prior=...
argument in the CAR_INLA()
function are defined (see the vignette bigDM: fitting spatial models for more details). For the temporally structured random effect, random walks of first (RW1) or second order (RW2) prior distributions can be assumed as follow:
\[\begin{equation} \label{eq:temporal} \gamma \sim N(0,[\tau_{\gamma}R_{\gamma}]^{-}), \end{equation}\]
where \(\tau_{\gamma}\) is a precision parameter and \(R_{\gamma}\) is the \(T \times T\) structure matrix of a RW1/RW2 (see Rue & Held (2005), pp. 95 and 110) and \(^{-}\) denotes the Moore-Penrose generalized inverse. Finally, the following prior distribution is assumed for the space-time interaction random effect \(\delta=(\delta_{11},\ldots,\delta_{n1},\ldots,\delta_{1T},\ldots,\delta_{nT})^{'}\)
\[\begin{equation} \label{eq:space-time} \delta \sim N(0,[\tau_{\delta}R_{\delta}]^{-}). \end{equation}\]
Here, \(\tau_{\delta}\) is a precision parameter and \(R_{\delta}\) is the \(nT \times nT\) matrix obtained as the Kronecker product of the corresponding spatial and temporal structure matrices (recall that \(R_{\xi}=\textbf{D}_{W}-\textbf{W}\)), where four types of interactions can be considered (see Table 1).
Interaction | \(R_{\delta}\) | Spatial correlation | Temporal correlation |
---|---|---|---|
Type I | \(I_{T} \otimes I_{n}\) | - | - |
Type II | \(R_{\gamma} \otimes I_{n}\) | - | \(\checkmark\) |
Type III | \(I_{T} \otimes R_{\xi}\) | \(\checkmark\) | - |
Type IV | \(R_{\gamma} \otimes R_{\xi}\) | \(\checkmark\) | \(\checkmark\) |
Interaction | \(R_{\delta}\) | Constraints |
---|---|---|
Type I | \(I_{T} \otimes I_{n}\) | \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \sum\limits_{i=1}^n \sum\limits_{t=1}^T \delta_{it}=0.\) |
Type II | \(R_{\gamma} \otimes I_{n}\) | \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \sum\limits_{t=1}^T \delta_{it}=0, \, \mbox{for } \, i=1,\ldots,n.\) |
Type III | \(I_{T} \otimes R_{\xi}\) | \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \sum\limits_{i=1}^n \delta_{it}=0, \, \mbox{for } \, t=1,\ldots,T.\) |
Type IV | \(R_{\gamma} \otimes R_{\xi}\) | \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \begin{array}{l} \sum\limits_{t=1}^T \delta_{it}=0, \, \mbox{for } \, i=1,\ldots,n, \\ \sum\limits_{i=1}^n \delta_{it}=0, \, \mbox{for } \, t=1,\ldots,T. \\ \end{array}\) |
In all the models for latent Gaussian fields described in this vignette, prior distributions for the precision parameters have to be specified. By default, log-Gamma distributions are given to
the log-precision parameters in R-INLA
. However, these priors may lead to wrong results and
have been criticized in the literature (Carroll et al., 2015; Simpson et al., 2017).
Here, improper uniform prior distribution on the positive real line are considered for the standard deviations, i.e., \(\sigma=1/\sqrt{\tau} \sim U(0,\infty)\) and standard uniform distribution for the spatial smoothing parameter, i.e., \(\beta \sim U(0,1)\equiv \mbox{Beta}(1,1)\) when fitting the LCAR or BYM2 model for the spatial random effect. Penalised complexity (PC) priors (Simpson et al., 2017) are also available in STCAR_INLA()
function by setting the argument PCpriors=TRUE
. If PC priors are used for the precision of a Gaussian random effect, the parameters \(U\) and \(\alpha\) must be specified so that \(P(\sigma>U)=\alpha\). The default values in R-INLA
\((U,\alpha)=(1,0.01)\) are considered when fitting the iCAR and RW1/RW2 prior distributions, as well as for the corresponding space–time interaction effect. If the BYM2 model is selected for the spatial random effect \(\xi\), the values \((U,\alpha)=(0.5,0.5)\) are given to the probability statement \(P(\lambda_{\xi}>U)=\alpha\).
STCAR_INLA
functionAs in the CAR_INLA()
function, three main modelling approaches can be considered:
model="global"
argument),model="partition"
and k=0
arguments),model="partition"
and k>0
arguments).For both the disjoint and \(k\)-order neighbour models, parallel or distributed computation strategies can be performed to speed up computations by using the future package (Bengtsson, 2021). See the vignette bigDM: parallel and distributed modelling for some examples and further details.
The data and its associated cartography file need to be specified into the STCAR_INLA()
function. These are some of the most relevant arguments of this function:
carto
: an object of class sf
or SpatialPolygonsDataFrame
that must contain at least the target variable of interest specified in the argument ID.area
.data
: an object of class data.frame
that must contain the target variables of interest specified in the arguments ID.area
, ID.year
, O
and E
.ID.area
: name of the variable that contains the IDs of spatial areal units. The values of this variable must match those given in the carto and data variable.ID.year
: name of the variable that contains the IDs of time points.ID.group
: name of the variable that contains the IDs of the spatial partition (grouping variable). Only required if model="partition"
.O
: name of the variable that contains the observed number of disease cases for each areal and time point.E
: name of the variable that contains either the expected number of disease cases or the population at risk for each areal unit and time point.W
: optional argument with the binary adjacency matrix of the spatial areal units. If NULL
(default), this object is computed from the carto
argument (two areas are considered as neighbours if they share a common border).merge.strategy
: one of either "mixture"
or "original"
(default), that specifies the merging strategy to compute posterior marginal estimates of the linear predictor (log-risks or log-rates). See mergeINLA()
function for further details.compute.fitted.values
: logical value (default FALSE
); if TRUE
transforms the posterior marginal distribution of the linear predictor to the exponential scale (risks or rates). CAUTION: This method might be time consuming.The Carto_SpainMUN
object included in the bigDM package, contains the spatial polygons of the municipalities of continental Spain and simulated colorectal cancer mortality data (see the examples of the CAR_INLA function).
library(INLA)
library(bigDM)
data(Carto_SpainMUN)
head(Carto_SpainMUN)
#> Simple feature collection with 6 features and 10 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 485318 ymin: 4727428 xmax: 543317 ymax: 4779153
#> Projected CRS: ETRS89 / UTM zone 30N
#> ID name lat long area
#> 1 01001 Alegria-Dulantzi 4742974 4742974 19913794
#> 2 01002 Amurrio 4765886 4765886 96145595
#> 3 01003 Aramaio 4766225 4766225 73338806
#> 4 01004 Artziniega 4775453 4775453 27506468
#> 5 01006 Arminon 4729884 4729884 10559721
#> 6 01008 Arrazua-Ubarrundia (San Martin de Ania) 4752714 4752714 57502811
#> perimeter obs exp SMR region geometry
#> 1 34372.11 2 3.0237149 0.6614380 Pais Vasco MULTIPOLYGON (((538259 4737...
#> 2 63352.32 28 20.8456682 1.3432047 Pais Vasco MULTIPOLYGON (((503520 4760...
#> 3 41430.46 6 3.7527301 1.5988360 Pais Vasco MULTIPOLYGON (((533286 4759...
#> 4 22605.22 3 3.2093191 0.9347777 Pais Vasco MULTIPOLYGON (((491260 4776...
#> 5 17847.35 0 0.4817391 0.0000000 Pais Vasco MULTIPOLYGON (((509851 4727...
#> 6 64968.81 2 1.9643891 1.0181282 Pais Vasco MULTIPOLYGON (((534678 4746...
In this vignette, simulated data of lung cancer mortality data during the period 1991-2015 included in the object Data_LungCancer
will be used as illustration (modified in order to preserve the confidentiality of the original data).
data(Data_LungCancer)
str(Data_LungCancer)
#> 'data.frame': 197675 obs. of 6 variables:
#> $ ID : chr "01001" "01002" "01003" "01004" ...
#> $ year: int 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
#> $ obs : int 0 3 0 0 0 0 1 2 0 0 ...
#> $ exp : num 0.276 2.721 0.496 0.37 0.07 ...
#> $ SMR : num 0 1.1 0 0 0 ...
#> $ pop : num 483.8 4949 667.9 591.1 62.8 ...
Note that both objects contains a common identification variable of the areal units named as ID
.
We refer as Global model to the spatio-temporal model described in Equation (1), where the whole neighbourhood graph of the areal units is considered to define the adjacency matrix \(\textbf{W}\).
The Global model with a BYM2 spatial random effect, RW1 temporal random effect and Type IV interaction random effect (default values) is fitted using the STCAR_INLA()
function as
# Not run:
Global <- STCAR_INLA(carto=Carto_SpainMUN, data=Data_LungCancer,
ID.area="ID", ID.year="year", O="obs", E="exp",
spatial="BYM2", temporal="rw1", interaction="TypeIV",
model="global", strategy="gaussian")
When the number of small areas and time periods increases considerably (as is the case of analysing count data at municipality level), fitting the Global model becomes computationally very demanding or even unfeasible. Instead of considering global random effects whose correlation structures are based on the whole spatial/temporal neighbourhood graphs of the areal-time units, we propose to divide the data into \(D\) spatial subdomains. Doing so, local spatio-temporal models can be fitted simultaneously (in parallel or distributed) substantially reducing the computational time. In what follows, we show how to fit the Disjoint and k-order neighbourhood models (extending the methodology described in Orozco-Acosta et al. (2021)) for estimating spatio-temporal disease risks using the STCAR_INLA()
function.
Note that in the Disjoint model each area-time unit belongs to a single subdomain. Consequently, the log-risk surface \(\log{\boldsymbol{r}} = (\log{\boldsymbol{r}_1},\ldots,\log{\boldsymbol{r}_{D}})^{'}\) is just the union of the posterior marginal estimates of each spatio-temporal sub-model.
For our example data in Data_LungCancer
we propose to divide the data into the \(D_s=47\) provinces of continental Spain. To classify the areas into provinces, the first two digits of the ID.area
variable is used.
Carto_SpainMUN$ID.prov <- substr(Carto_SpainMUN$ID,1,2)
In the code below, we show how to fit the Disjoint model with Type I space-time interaction random effect and Gaussian approximation strategy using 4 local clusters (in parallel)
Disjoint <- STCAR_INLA(carto=Carto_SpainMUN, data=Data_LungCancer,
ID.area="ID", ID.year="year", O="obs", E="exp", ID.group="ID.prov",
spatial="BYM2", temporal="rw1", interaction="TypeI",
model="partition", k=0, strategy="gaussian",
plan="cluster", workers=rep("localhost",4))
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 47
#> + Model 2 of 47
#> ...
#> + Model 47 of 47
#> STEP 3: Merging the results
summary(Disjoint)
#>
#> Call:
#> c("inla.core(formula = formula, family = family, contrasts = contrasts,
#> ", " data = data, quantiles = quantiles, E = E, offset = offset, ", "
#> scale = scale, weights = weights, Ntrials = Ntrials, strata = strata,
#> ", " lp.scale = lp.scale, link.covariates = link.covariates, verbose =
#> verbose, ", " lincomb = lincomb, selection = selection, control.compute
#> = control.compute, ", " control.predictor = control.predictor,
#> control.family = control.family, ", " control.inla = control.inla,
#> control.fixed = control.fixed, ", " control.mode = control.mode,
#> control.expert = control.expert, ", " control.hazard = control.hazard,
#> control.lincomb = control.lincomb, ", " control.update =
#> control.update, control.lp.scale = control.lp.scale, ", "
#> control.pardiso = control.pardiso, only.hyperparam = only.hyperparam,
#> ", " inla.call = inla.call, inla.arg = inla.arg, num.threads =
#> num.threads, ", " blas.num.threads = blas.num.threads, keep = keep,
#> working.directory = working.directory, ", " silent = silent, inla.mode
#> = inla.mode, safe = FALSE, debug = debug, ", " .parent.frame =
#> .parent.frame)")
#> Time used:
#> Running = 222, Merging = 355, Total = 577, NA = NA
#> Fixed effects:
#> mean sd 0.025quant 0.5quant 0.975quant
#> (Intercept) -0.083 0.003 -0.09 -0.083 -0.076
#>
#> Random effects:
#> Name Model
#> ID.area BYM2 model
#> ID.year RW1 model
#> ID.area.year IID model
#>
#> Deviance Information Criterion (DIC) ...............: 339074.02
#> Deviance Information Criterion (DIC, saturated) ....: 148890.00
#> Effective number of parameters .....................: 3448.13
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 339025.57
#> Effective number of parameters .................: 3247.14
#>
#> is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
* Computations are made in personal computer with a 3.41 GHz Intel Core i5-7500 processor and 32GB RAM using R-INLA stable version INLA_23.04.24.
Assuming independence between areas belonging to different sub-domains could be very restrictive and it may lead to border effects. The k-order neighbourhood model avoids this undesirable issue by adding neighbouring areas to each partition of the spatial domain.