Introduction

When fitting both the disjoint and k-order neighbour models with the bigDM package (Orozco-Acosta et al., 2021, 2022) parallel or distributed computation strategies can be performed to speed up computations by using the future package (Bengtsson, 2021).

Different computation strategies can be specified through the plan argument of the CAR_INLA() and STCAR_INLA() functions (available for bigDM version 0.4 and higher). For further details about these strategies see the documentation of the future package and its corresponding vignettes.

  • If plan="sequential" (default), the models are fitted sequentially and in the current R session (local machine).

  • If plan="cluster", multiple models can be fitted in parallel on external R sessions (local machine) or distributed in remote compute nodes. When using this option, the identifications of the local or remote workers where the models are going to be processed must be specified through the workers argument. The following options are available:

    1) Using local machines
    ## single local machine
    workers <- "localhost"
    # or
    workers <- Sys.info()[["nodename"]]
    
    ## multiple local machines
    workers <- rep("localhost", 2)  # two local machines
    workers <- future::availableWorkers()  # total number of available workers
    2) Using remote compute nodes
    ## single remote machine
    workers <- "172.0.0.1"  # using IP address
    workers <- "machine1.remote.es"  # using hostname
    
    ## multiple remote machines
    workers <- c("172.0.0.1","172.0.0.2")
    # or
    workers <- c("machine1.remote.es","machine2.remote.es")
    3) Using both local and remote machines
    workers <- c("localhost","172.0.0.1","172.0.0.2")
    4) It is also possible to fit more than one model simultaneously in each remote machine by setting
    ## two models per remote machine
    workers <- rep(c("172.0.0.1","172.0.0.2"), each=2)
    
    ## two models per local and remote machines
    workers <- rep(c("localhost","172.0.0.1","172.0.0.2"), each=2)

    Important note: To work with remote workers, public and private keys between the different machines must be previously configured. In addition, the bigDM package needs to be installed in all the local/remote machines. More details can be found at this vignette and in the documentation of the makeClusterPSOCK function.

Example: Spanish colorectal cancer mortality data

In what follows, we show how to fit the Disjoint model for the Spanish colorectal cancer mortality data using different computation strategies. A full description of the data and usage of the CAR_INLA() function is given in the vignette bigDM: fitting spatial models.

library(INLA)
library(bigDM)

data(Carto_SpainMUN)
head(Carto_SpainMUN)
#> Simple feature collection with 6 features and 10 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 485318 ymin: 4727428 xmax: 543317 ymax: 4779153
#> Projected CRS: ETRS89 / UTM zone 30N
#>      ID                                    name     lat    long     area
#> 1 01001                        Alegria-Dulantzi 4742974 4742974 19913794
#> 2 01002                                 Amurrio 4765886 4765886 96145595
#> 3 01003                                 Aramaio 4766225 4766225 73338806
#> 4 01004                              Artziniega 4775453 4775453 27506468
#> 5 01006                                 Arminon 4729884 4729884 10559721
#> 6 01008 Arrazua-Ubarrundia (San Martin de Ania) 4752714 4752714 57502811
#>   perimeter obs        exp       SMR     region                       geometry
#> 1  34372.11   2  3.0237149 0.6614380 Pais Vasco MULTIPOLYGON (((538259 4737...
#> 2  63352.32  28 20.8456682 1.3432047 Pais Vasco MULTIPOLYGON (((503520 4760...
#> 3  41430.46   6  3.7527301 1.5988360 Pais Vasco MULTIPOLYGON (((533286 4759...
#> 4  22605.22   3  3.2093191 0.9347777 Pais Vasco MULTIPOLYGON (((491260 4776...
#> 5  17847.35   0  0.4817391 0.0000000 Pais Vasco MULTIPOLYGON (((509851 4727...
#> 6  64968.81   2  1.9643891 1.0181282 Pais Vasco MULTIPOLYGON (((534678 4746...

For our example data in Carto_SpainMUN the \(D=15\) Autonomous Regions of Spain are used as a partition of the \(n=7907\) municipalites (region variable of the sf object).


1. Run the models sequentially in the current R session (default option)

Model1 <- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
                   model="partition", k=0, seed=1234, strategy="simplified.laplace",
                   plan="sequential", workers=NULL)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results

summary(Model1)
#>
#> Call:
#>    c("inla.core(formula = formula, family = family, contrasts = contrasts,
#>    ", " data = data, quantiles = quantiles, E = E, offset = offset, ", "
#>    scale = scale, weights = weights, Ntrials = Ntrials, strata = strata,
#>    ", " lp.scale = lp.scale, link.covariates = link.covariates, verbose =
#>    verbose, ", " lincomb = lincomb, selection = selection, control.compute
#>    = control.compute, ", " control.predictor = control.predictor,
#>    control.family = control.family, ", " control.inla = control.inla,
#>    control.fixed = control.fixed, ", " control.mode = control.mode,
#>    control.expert = control.expert, ", " control.hazard = control.hazard,
#>    control.lincomb = control.lincomb, ", " control.update =
#>    control.update, control.lp.scale = control.lp.scale, ", "
#>    control.pardiso = control.pardiso, only.hyperparam = only.hyperparam,
#>    ", " inla.call = inla.call, inla.arg = inla.arg, num.threads =
#>    num.threads, ", " blas.num.threads = blas.num.threads, keep = keep,
#>    working.directory = working.directory, ", " silent = silent, inla.mode
#>    = inla.mode, safe = FALSE, debug = debug, ", " .parent.frame =
#>    .parent.frame)")
#> Time used:
#>     Pre = 8.34, Running = 76.2, Post = 4.37, Merging = 18.7
#> Random effects:
#>   Name     Model
#>     ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27449.04
#> Deviance Information Criterion (DIC, saturated) ....: 8385.59
#> Effective number of parameters .....................: 400.88
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27433.89
#> Effective number of parameters .................: 341.94
#>
#>  is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

2. Run the models in parallel on external R sessions

workers <- rep("localhost", 4)

Model2 <- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
                   model="partition", k=0, seed=1234, strategy="simplified.laplace",
                   plan="cluster", workers=workers)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results

summary(Model2)
#>
#> Call:
#>    c("inla.core(formula = formula, family = family, contrasts = contrasts,
#>    ", " data = data, quantiles = quantiles, E = E, offset = offset, ", "
#>    scale = scale, weights = weights, Ntrials = Ntrials, strata = strata,
#>    ", " lp.scale = lp.scale, link.covariates = link.covariates, verbose =
#>    verbose, ", " lincomb = lincomb, selection = selection, control.compute
#>    = control.compute, ", " control.predictor = control.predictor,
#>    control.family = control.family, ", " control.inla = control.inla,
#>    control.fixed = control.fixed, ", " control.mode = control.mode,
#>    control.expert = control.expert, ", " control.hazard = control.hazard,
#>    control.lincomb = control.lincomb, ", " control.update =
#>    control.update, control.lp.scale = control.lp.scale, ", "
#>    control.pardiso = control.pardiso, only.hyperparam = only.hyperparam,
#>    ", " inla.call = inla.call, inla.arg = inla.arg, num.threads =
#>    num.threads, ", " blas.num.threads = blas.num.threads, keep = keep,
#>    working.directory = working.directory, ", " silent = silent, inla.mode
#>    = inla.mode, safe = FALSE, debug = debug, ", " .parent.frame =
#>    .parent.frame)")
#> Time used:
#>     Running = 82.9, Merging = 18.2, Total = 101, NA = NA
#> Random effects:
#>   Name     Model
#>     ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27451.33
#> Deviance Information Criterion (DIC, saturated) ....: 8387.88
#> Effective number of parameters .....................: 402.53
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27436.31
#> Effective number of parameters .................: 343.59
#>
#>  is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

The decision of how many models to process in parallel depends on the available computational resources. However, the more the models to be fitted in parallel, the more the R sessions to be opened in the backend, which would increase the computational time for data transfer.

3. Run the models distributed in remote machines

Before evaluating the following code, the user must set appropriate identifications for the remote machines.

## Example of distributed computation with four models per remote machine
workers <- rep(c("172.0.0.1","172.0.0.2","172.0.0.3"), each=4)

Model3 <- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
                   model="partition", k=0, seed=1234, strategy="simplified.laplace",
                   plan="cluster", workers=workers)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> starting worker pid=1444 on localhost:11204 at 13:03:13.910
#> starting worker pid=1465 on localhost:11204 at 13:03:14.123
#> starting worker pid=1489 on localhost:11204 at 13:03:14.330
#> starting worker pid=1509 on localhost:11204 at 13:03:14.544
#> starting worker pid=14966 on localhost:11208 at 13:03:15.548
#> starting worker pid=14998 on localhost:11209 at 13:03:16.545
#> starting worker pid=15028 on localhost:11210 at 13:03:17.514
#> starting worker pid=15058 on localhost:11211 at 13:03:18.544
#> starting worker pid=12722 on localhost:11212 at 13:03:19.010
#> starting worker pid=12750 on localhost:11213 at 13:03:19.430
#> starting worker pid=12778 on localhost:11214 at 13:03:19.764
#> starting worker pid=12806 on localhost:11215 at 13:03:20.088
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results

summary(Model3)

#> Call:
#>    c("inla(formula = formula, family = \"poisson\", data = data.INLA, ", " E = E,
#>    control.compute = list(dic = TRUE, cpo = TRUE, waic = TRUE, ", " config = TRUE),
#>    control.predictor = list(compute = TRUE, ", " cdf = c(log(1))), control.inla =
#>    list(strategy = strategy))" )
#> Time used:
#>     Running = 111, Merging = 16.7, Total = 128, NA = NA
#> Random effects:
#>   Name     Model
#>     ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27450.60
#> Deviance Information Criterion (DIC, saturated) ....: 8387.16
#> Effective number of parameters .....................: 403.90
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27433.63
#> Effective number of parameters .................: 343.31
#>
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

Acknowledgments

This work has been supported by the Spanish Ministry of Economy, Industry, and Competitiveness (project MTM2017-82553-R, AEI/FEDER, UE), and partially funded by la Caixa Foundation (ID 1000010434), Caja Navarra Foundation and UNED Pamplona, under agreement LCF/PR/PR15/51100007 (project REF P/13/20).

References

Bengtsson, H. (2021). A unifying framework for parallel and distributed processing in R using futures. The R Journal (Accepted Article). https://journal.r-project.org/archive/2021/RJ-2021-048/index.html
Orozco-Acosta, E., Adin, A., & Ugarte, M. D. (2021). Scalable Bayesian modeling for smoothing disease mapping risks in large spatial data sets using INLA. Spatial Statistics, 41, 100496. https://doi.org/10.1016/j.spasta.2021.100496
Orozco-Acosta, E., Adin, A., & Ugarte, M. D. (2022). Parallel and distributed Bayesian modelling for analysing high-dimensional spatio-temporal count data. https://arxiv.org/abs/2201.08323