Introduction

In previous vignettes, we show how to fit spatial Poisson mixed models for high-dimensional areal count data and how to use parallel or distributed computation strategies using the bigDM package. Here, we describe how to use this package to fit spatio-temporal models with conditional autoregressive (CAR) priors for space and random walk (RW) priors for time including space-time interactions (Knorr-Held, 2000; Ugarte et al., 2014) by extending the scalable model’s proposals described in Orozco-Acosta et al. (2021) to deal with massive spatio-temporal data (Orozco-Acosta et al., 2022).

Spatio-temporal models in disease mapping

Let us assume that the region under study is divided into \(n\) contiguous small areas labeled as \(i=1,\ldots,n\) and data are available for consecutive time periods labeled as \(t=1,\ldots,T\). For a given area \(i\) and time period \(t\),

  • \(O_{it}\) denotes the number of observed cases for area \(i\) and time \(t\)
  • \(E_{it}\) denotes the number of expected cases for area \(i\) and time \(t\)
  • \(r_{it}\) is the relative risk of mortality (incidence)

To compute the number of expected cases \(E_{it}\), both direct and indirect standardization procedures can be performed (Ugarte, 2009), usually considering age and/or sex as standardization variables. Using the indirect standardization method, the number of expected cases for area \(i\) and time \(t\) is defined as

\[E_{it}=\sum_{j=1}^{J}N_{it}\frac{O_j}{N_j} \ \ \mbox{for} \ \ i=1,\ldots,n; \ \ t=1,\ldots,T,\] where \(O_j=\sum_{i=1}^{n}\sum_{t=1}^{T}O_{itj}\) and \(N_j=\sum_{i=1}^{n}\sum_{t=1}^{T}N_{itj}\) are the number of observed cases and the population at risk in the \(j^{th}\) age-group, respectively. Then, the standardized mortality/incidence ratio (SMR or SIR) is defined as the number of observed cases divided by the number of expected cases. Although its interpretation is very simple, these measures are extremely variable when analyzing rare diseases or very low-populated areas, as it is the case of high-dimensional data. This makes it necessary the use of statistical models to smooth risks borrowing information from neighbouring regions and time periods.

Poisson mixed models are typically used for the analysis of count data within a hierarchical Bayesian framework. Conditional to the relative risk \(r_{it}\), the number of observed cases in the \(i\)th area and time period \(t\) is assumed to be Poisson distributed with mean \(\mu_{it} = E_{it}r_{it}\). That is,

\[\begin{eqnarray*} \label{eq:Model_Poisson} \begin{array}{rcl} O_{it}|r_{it} & \sim & Poisson(\mu_{it}=E_{it}r_{it}),\\ \log \mu_{it} & = & \log E_{it}+\log r_{it}, \end{array} \end{eqnarray*}\]

where \(\log E_{it}\) is an offset. Depending on the specification of the log-risks different models can be defined. The non-parametric models based on CAR priors for spatial random effects, random walk priors for temporal random effects, and different types of spatio-temporal interactions described in Knorr-Held (2000) are currently implemented in the bigDM package.

So, the log-risks are modelled as

\[\begin{equation} \label{eq:logrisk} \log r_{it}=\alpha+\xi_{i}+\gamma_t+\delta_{it}, \tag{0.1} \end{equation}\]

where:

  • \(\alpha\) is a global intercept
  • \(\xi_i\) is a spatial random effect with CAR prior distribution.
  • \(\gamma_t\) is a temporally structured random effect that follows random walk prior distribution
  • \(\delta_{it}\) is a spatio-temporal random effect (four types of interactions)

In what follows, we will refer to model (0.1) as the Global model. These models are flexible enough to describe many real situations, and their interpretation is simple and attractive. However, the models are typically not identifiable and appropriate sum-to-zero constraints must be imposed over the random effects (Goicoa et al., 2018). See Table 0.2 for a full description of the identifiability constraints that needs to be imposed in each type of space-time interaction.

Prior distributions for the random effects

Several priors distributions for spatial and temporal random effect are implemented in the STCAR_INLA() function, which are specified through the spatial=... and temporal=... arguments.

For the spatial random effect, the same values as the prior=... argument in the CAR_INLA()function are defined (see the vignette bigDM: fitting spatial models for more details). For the temporally structured random effect, random walks of first (RW1) or second order (RW2) prior distributions can be assumed as follow:

\[\begin{equation} \label{eq:temporal} \gamma \sim N(0,[\tau_{\gamma}R_{\gamma}]^{-}), \end{equation}\]

where \(\tau_{\gamma}\) is a precision parameter and \(R_{\gamma}\) is the \(T \times T\) structure matrix of a RW1/RW2 (see Rue & Held (2005), pp. 95 and 110) and \(^{-}\) denotes the Moore-Penrose generalized inverse. Finally, the following prior distribution is assumed for the space-time interaction random effect \(\delta=(\delta_{11},\ldots,\delta_{n1},\ldots,\delta_{1T},\ldots,\delta_{nT})^{'}\)

\[\begin{equation} \label{eq:space-time} \delta \sim N(0,[\tau_{\delta}R_{\delta}]^{-}). \end{equation}\]

Here, \(\tau_{\delta}\) is a precision parameter and \(R_{\delta}\) is the \(nT \times nT\) matrix obtained as the Kronecker product of the corresponding spatial and temporal structure matrices (recall that \(R_{\xi}=\textbf{D}_{W}-\textbf{W}\)), where four types of interactions can be considered (see Table 0.1).

Table 0.1: Specification for the different types of space-time interactions
Interaction \(R_{\delta}\) Spatial correlation Temporal correlation
Type I \(I_{T} \otimes I_{n}\) - -
Type II \(R_{\gamma} \otimes I_{n}\) - \(\checkmark\)
Type III \(I_{T} \otimes R_{\xi}\) \(\checkmark\) -
Type IV \(R_{\gamma} \otimes R_{\xi}\) \(\checkmark\) \(\checkmark\)
Table 0.2: Identifiability constraints for the different types of space-time interaction effects in CAR models
Interaction \(R_{\delta}\) Constraints
Type I \(I_{T} \otimes I_{n}\) \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \sum\limits_{i=1}^n \sum\limits_{t=1}^T \delta_{it}=0.\)
Type II \(R_{\gamma} \otimes I_{n}\) \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \sum\limits_{t=1}^T \delta_{it}=0, \, \mbox{for } \, i=1,\ldots,n.\)
Type III \(I_{T} \otimes R_{\xi}\) \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \sum\limits_{i=1}^n \delta_{it}=0, \, \mbox{for } \, t=1,\ldots,T.\)
Type IV \(R_{\gamma} \otimes R_{\xi}\) \(\sum\limits_{i=1}^n \xi_i=0, \, \sum\limits_{t=1}^T \gamma_t=0, \, \mbox{ and } \, \begin{array}{l} \sum\limits_{t=1}^T \delta_{it}=0, \, \mbox{for } \, i=1,\ldots,n, \\ \sum\limits_{i=1}^n \delta_{it}=0, \, \mbox{for } \, t=1,\ldots,T. \\ \end{array}\)

Prior distributions for the hyperparameters

In all the models for latent Gaussian fields described in this vignette, prior distributions for the precision parameters have to be specified. By default, log-Gamma distributions are given to the log-precision parameters in R-INLA. However, these priors may lead to wrong results and have been criticized in the literature (Carroll et al., 2015; Simpson et al., 2017).

Here, improper uniform prior distribution on the positive real line are considered for the standard deviations, i.e., \(\sigma=1/\sqrt{\tau} \sim U(0,\infty)\) and standard uniform distribution for the spatial smoothing parameter, i.e., \(\beta \sim U(0,1)\equiv \mbox{Beta}(1,1)\) when fitting the LCAR or BYM2 model for the spatial random effect. Penalised complexity (PC) priors (Simpson et al., 2017) are also available in STCAR_INLA()function by setting the argument PCpriors=TRUE. If PC priors are used for the precision of a Gaussian random effect, the parameters \(U\) and \(\alpha\) must be specified so that \(P(\sigma>U)=\alpha\). The default values in R-INLA \((U,\alpha)=(1,0.01)\) are considered when fitting the iCAR and RW1/RW2 prior distributions, as well as for the corresponding space–time interaction effect. If the BYM2 model is selected for the spatial random effect \(\xi\), the values \((U,\alpha)=(0.5,0.5)\) are given to the probability statement \(P(\lambda_{\xi}>U)=\alpha\).

The STCAR_INLA function

As in the CAR_INLA() function, three main modelling approaches can be considered:

  • the usual model with a global spatial random effect whose dependence structure is based on the whole neighbourhood graph of the areal units (model="global" argument),
  • a disjoint model based on a partition of the whole spatial domain where independent spatial CAR models are simultaneously fitted in each partition (model="partition" and k=0 arguments),
  • a modelling approach where \(k\)-order neighbours are added to each partition to avoid border effects in the disjoint model (model="partition" and k>0 arguments).

For both the disjoint and \(k\)-order neighbour models, parallel or distributed computation strategies can be performed to speed up computations by using the future package (Bengtsson, 2021). See the vignette bigDM: parallel and distributed modelling for some examples and further details.

Both the data of analysis and its associated cartography file needs specified into the STCAR_INLA() function. These are some of the most relevant arguments of this function:

  • carto: an object of class sf or SpatialPolygonsDataFrame that must contain at least the target variable of interest specified in the argument ID.area.
  • data: an object of class data.frame which must contain the target variables of interest specified in the arguments ID.area, ID.year, O and E.
  • ID.area: name of the variable which contains the IDs of spatial areal units. The values of this variable must match those given in the carto and data variable.
  • ID.year: name of the variable which contains the IDs of time points.
  • ID.group: name of the variable which contains the IDs of the spatial partition (grouping variable). Only required if model="partition".
  • O: name of the variable which contains the observed number of disease cases for each areal and time point.
  • E: name of the variable which contains either the expected number of disease cases or the population at risk for each areal unit and time point.
  • W: optional argument with the binary adjacency matrix of the spatial areal units. If NULL (default), this object is computed from the carto argument (two areas are considered as neighbours if they share a common border).
  • merge.strategy: one of either "mixture" or "original" (default), which specifies the merging strategy to compute posterior marginal estimates of relative risks. See mergeINLA() function for further details.

The Carto_SpainMUN object included in the bigDM package, contains the spatial polygons of the municipalities of continental Spain and simulated colorectal cancer mortality data (see the examples of the CAR_INLA function).

library(INLA)
library(bigDM)

data(Carto_SpainMUN)
head(Carto_SpainMUN)
#> Simple feature collection with 6 features and 10 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 485318 ymin: 4727428 xmax: 543317 ymax: 4779153
#> Projected CRS: ETRS89 / UTM zone 30N
#>      ID                                    name     lat    long     area
#> 1 01001                        Alegria-Dulantzi 4742974 4742974 19913794
#> 2 01002                                 Amurrio 4765886 4765886 96145595
#> 3 01003                                 Aramaio 4766225 4766225 73338806
#> 4 01004                              Artziniega 4775453 4775453 27506468
#> 5 01006                                 Arminon 4729884 4729884 10559721
#> 6 01008 Arrazua-Ubarrundia (San Martin de Ania) 4752714 4752714 57502811
#>   perimeter obs        exp       SMR     region                       geometry
#> 1  34372.11   2  3.0237149 0.6614380 Pais Vasco MULTIPOLYGON (((538259 4737...
#> 2  63352.32  28 20.8456682 1.3432047 Pais Vasco MULTIPOLYGON (((503520 4760...
#> 3  41430.46   6  3.7527301 1.5988360 Pais Vasco MULTIPOLYGON (((533286 4759...
#> 4  22605.22   3  3.2093191 0.9347777 Pais Vasco MULTIPOLYGON (((491260 4776...
#> 5  17847.35   0  0.4817391 0.0000000 Pais Vasco MULTIPOLYGON (((509851 4727...
#> 6  64968.81   2  1.9643891 1.0181282 Pais Vasco MULTIPOLYGON (((534678 4746...

In this vignette, simulated data of lung cancer mortality data during the period 1991-2015 included in the object Data_LungCancer will be used as illustration (modified in order to preserve the confidentiality of the original data).

data(Data_LungCancer)
head(Data_LungCancer)
#>        ID year obs        exp      SMR
#> 1   01001 1991   0 0.27554975 0.000000
#> 33  01002 1991   3 2.72124997 1.102435
#> 55  01003 1991   0 0.49612295 0.000000
#> 78  01004 1991   0 0.37040942 0.000000
#> 103 01006 1991   0 0.06999827 0.000000
#> 133 01008 1991   0 0.29240328 0.000000

Note that both objects contains a common identification variable of the areal units named as ID.

Global model

We refer as Global model to spatio-temporal model described in Equation (0.1), where the whole neighbourhood graph of the areal units is considered to define the adjacency matrix \(\textbf{W}\).

The Global model with a BYM2 spatial random effect, RW1 temporal random effect and Type IV interaction random effect (default values) is fitted using the STCAR_INLA() function as

# Not run:
Global <- STCAR_INLA(carto=Carto_SpainMUN, data=Data_LungCancer,
                     ID.area="ID", ID.year="year", O="obs", E="exp",
                     spatial="BYM2", temporal="rw1", interaction="TypeIV",
                     model="global", strategy="gaussian")

When the number of small areas and time periods increases considerably (as is the case of analysing count data at municipality level), fitting the Global model becomes computationally very demanding or even unfeasible. Instead of considering global random effects whose correlation structures are based on the whole spatial/temporal neighbourhood graphs of the areal-time units, we propose to divide the data into \(D=D_{s}\times D_{t}\) sub-domains, where \(D_{s}\) and \(D_{t}\) are the number spatial and temporal partitions, respectively. Doing so, local spatio-temporal models can be fitted simultaneously (in parallel or distributed) substantially reducing the computational time. In what follows, we show how to fit the Disjoint and k-order neighbourhood models (extending the methodology described in Orozco-Acosta et al. (2021)) for estimating spatio-temporal disease risks using the STCAR_INLA() function.

Disjoint model

Note that in the Disjoint model each area-time unit area-time unit belongs to a single sub-domain. Consequently, the log-risk surface \(\log{\boldsymbol{r}} = (\log{\boldsymbol{r}_1},\ldots,\log{\boldsymbol{r}_{D}})^{'}\) is just the union of the posterior marginal estimates of each spatio-temporal sub-model.

For our example data in Data_LungCancer we propose to divide the data into the \(D_s=47\) provinces of continental Spain (no temporary partition is defined as only 25 time periods are being considered). To classify the areas into provinces, the first two digits of the ID.area variable is used.

Carto_SpainMUN$ID.prov <- substr(Carto_SpainMUN$ID,1,2)

In the code below, we show how to fit the Disjoint model with Type I space-time interaction random effect and Gaussian approximation strategy using 4 local clusters (in parallel)

Disjoint <- STCAR_INLA(carto=Carto_SpainMUN, data=Data_LungCancer,
                       ID.area="ID", ID.year="year", O="obs", E="exp", ID.group="ID.prov",
                       spatial="BYM2", temporal="rw1", interaction="TypeI",
                       model="partition", k=0, strategy="gaussian",
                       plan="cluster", workers=rep("localhost",4))
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 47
#> + Model 2 of 47
#> ...
#> + Model 47 of 47
#> STEP 3: Merging the results

summary(Disjoint)
#>
#> Call:
#>    c("inla.core(formula = formula, family = family, contrasts = contrasts,
#>    ", " data = data, quantiles = quantiles, E = E, offset = offset, ", "
#>    scale = scale, weights = weights, Ntrials = Ntrials, strata = strata,
#>    ", " lp.scale = lp.scale, link.covariates = link.covariates, verbose =
#>    verbose, ", " lincomb = lincomb, selection = selection, control.compute
#>    = control.compute, ", " control.predictor = control.predictor,
#>    control.family = control.family, ", " control.inla = control.inla,
#>    control.fixed = control.fixed, ", " control.mode = control.mode,
#>    control.expert = control.expert, ", " control.hazard = control.hazard,
#>    control.lincomb = control.lincomb, ", " control.update =
#>    control.update, control.lp.scale = control.lp.scale, ", "
#>    control.pardiso = control.pardiso, only.hyperparam = only.hyperparam,
#>    ", " inla.call = inla.call, inla.arg = inla.arg, num.threads =
#>    num.threads, ", " blas.num.threads = blas.num.threads, keep = keep,
#>    working.directory = working.directory, ", " silent = silent, inla.mode
#>    = inla.mode, safe = FALSE, debug = debug, ", " .parent.frame =
#>    .parent.frame)")
#> Time used:
#>     Running = 226, Merging = 320, Total = 546, NA = NA
#> Random effects:
#>   Name     Model
#>     ID.area BYM2 model
#>    ID.year RW1 model
#>    ID.area.year IID model
#>
#> Deviance Information Criterion (DIC) ...............: 339076.57
#> Deviance Information Criterion (DIC, saturated) ....: 148892.56
#> Effective number of parameters .....................: 3452.94
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 339026.23
#> Effective number of parameters .................: 3250.64
#>
#>  is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

* Computations are made in personal computer with a 3.41 GHz Intel Core i5-7500 processor and 32GB RAM using R-INLA stable version INLA_22.05.03.

k-order neighbourhood model

Assuming independence between areas belonging to different sub-domains could be very restrictive and it may lead to border effects. The k-order neighbourhood model avoids this undesirable issue by adding neighbouring areas to each partition of the spatial domain.