bigDM: parallel and distributed modelling

Introduction

When fitting both the disjoint and k-order neighbour models with the bigDM package (Orozco-Acosta et al., 2021, 2023) parallel or distributed computation strategies can be performed to speed up computations by using the future package (Bengtsson, 2021).

Different computation strategies can be specified through the plan argument of the CAR_INLA(), STCAR_INLA() and MCAR_INLA() functions (available for bigDM version 0.5.x and higher). For further details about these strategies see the documentation of the future package and its corresponding vignettes.

If plan="sequential" (default), the models are fitted sequentially and in the current R session (local machine).
If plan="cluster", multiple models can be fitted in parallel on external R sessions (local machine) or distributed in remote compute nodes. When using this option, the identifications of the local or remote workers where the models are going to be processed must be specified through the workers argument. The following options are available:

1) Using local machines
```
## single local machine
workers <- "localhost"
# or
workers <- Sys.info()[["nodename"]]

## multiple local machines
workers <- rep("localhost", 2)  # two local machines
workers <- future::availableWorkers()  # total number of available workers
```
2) Using remote compute nodes
```
## single remote machine
workers <- "172.0.0.1"  # using IP address
workers <- "machine1.remote.es"  # using hostname

## multiple remote machines
workers <- c("172.0.0.1","172.0.0.2")
# or
workers <- c("machine1.remote.es","machine2.remote.es")
```
3) Using both local and remote machines
```
workers <- c("localhost","172.0.0.1","172.0.0.2")
```
4) It is also possible to fit more than one model simultaneously in each remote machine by setting
```
## two models per remote machine
workers <- rep(c("172.0.0.1","172.0.0.2"), each=2)

## two models per local and remote machines
workers <- rep(c("localhost","172.0.0.1","172.0.0.2"), each=2)
```
Important note: To work with remote workers, public and private keys between the different machines must be previously configured. In addition, the bigDM package needs to be installed in all the local/remote machines. More details can be found at this vignette and in the documentation of the makeClusterPSOCK function.

Example: Spanish colorectal cancer mortality data

In what follows, we show how to fit the Disjoint model for the Spanish colorectal cancer mortality data using different computation strategies. A full description of the data and usage of the CAR_INLA() function is given in the vignette bigDM: fitting spatial models.

library(INLA)
library(bigDM)

data(Carto_SpainMUN)
head(Carto_SpainMUN)
#> Simple feature collection with 6 features and 10 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 485318 ymin: 4727428 xmax: 543317 ymax: 4779153
#> Projected CRS: ETRS89 / UTM zone 30N
#>      ID                                    name     lat    long     area
#> 1 01001                        Alegria-Dulantzi 4742974 4742974 19913794
#> 2 01002                                 Amurrio 4765886 4765886 96145595
#> 3 01003                                 Aramaio 4766225 4766225 73338806
#> 4 01004                              Artziniega 4775453 4775453 27506468
#> 5 01006                                 Arminon 4729884 4729884 10559721
#> 6 01008 Arrazua-Ubarrundia (San Martin de Ania) 4752714 4752714 57502811
#>   perimeter obs        exp       SMR     region                       geometry
#> 1  34372.11   2  3.0237149 0.6614380 Pais Vasco MULTIPOLYGON (((538259 4737...
#> 2  63352.32  28 20.8456682 1.3432047 Pais Vasco MULTIPOLYGON (((503520 4760...
#> 3  41430.46   6  3.7527301 1.5988360 Pais Vasco MULTIPOLYGON (((533286 4759...
#> 4  22605.22   3  3.2093191 0.9347777 Pais Vasco MULTIPOLYGON (((491260 4776...
#> 5  17847.35   0  0.4817391 0.0000000 Pais Vasco MULTIPOLYGON (((509851 4727...
#> 6  64968.81   2  1.9643891 1.0181282 Pais Vasco MULTIPOLYGON (((534678 4746...

For our example data in Carto_SpainMUN the \(D=15\) Autonomous Regions of Spain are used as a partition of the \(n=7907\) municipalities (region variable of the sf object).

1. Run the models sequentially in the current R session (default option)

Model1 <- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
                   model="partition", k=0, strategy="simplified.laplace",
                   plan="sequential", workers=NULL)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results

summary(Model1)
#>
#> Call:
#>    c("inla.core(formula = formula, family = family, contrasts = contrasts,
#>    ", " data = data, quantiles = quantiles, E = E, offset = offset, ", "
#>    scale = scale, weights = weights, Ntrials = Ntrials, strata = strata,
#>    ", " lp.scale = lp.scale, link.covariates = link.covariates, verbose =
#>    verbose, ", " lincomb = lincomb, selection = selection, control.compute
#>    = control.compute, ", " control.predictor = control.predictor,
#>    control.family = control.family, ", " control.inla = control.inla,
#>    control.fixed = control.fixed, ", " control.mode = control.mode,
#>    control.expert = control.expert, ", " control.hazard = control.hazard,
#>    control.lincomb = control.lincomb, ", " control.update =
#>    control.update, control.lp.scale = control.lp.scale, ", "
#>    control.pardiso = control.pardiso, only.hyperparam = only.hyperparam,
#>    ", " inla.call = inla.call, inla.arg = inla.arg, num.threads =
#>    num.threads, ", " blas.num.threads = blas.num.threads, keep = keep,
#>    working.directory = working.directory, ", " silent = silent, inla.mode
#>    = inla.mode, safe = FALSE, debug = debug, ", " .parent.frame =
#>    .parent.frame)")
#> Time used:
#>     Pre = 13.9, Running = 81.4, Post = 6.58, Merging = 29
#> Fixed effects:
#>               mean    sd 0.025quant 0.5quant 0.975quant
#> (Intercept) -0.016 0.005     -0.026   -0.016     -0.007
#>
#> Random effects:
#>   Name     Model
#>     ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27449.25
#> Deviance Information Criterion (DIC, saturated) ....: 8385.81
#> Effective number of parameters .....................: 400.08
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27431.02
#> Effective number of parameters .................: 339.11
#>
#>  is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

2. Run the models in parallel on external R sessions

workers <- rep("localhost", 4)
num.threads <- "4:1"

Model2 <- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
                   model="partition", k=0, strategy="simplified.laplace",
                   plan="cluster", workers=workers, num.threads=num.threads)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results

summary(Model2)
#>
#> Call:
#>    c("inla.core(formula = formula, family = family, contrasts = contrasts,
#>    ", " data = data, quantiles = quantiles, E = E, offset = offset, ", "
#>    scale = scale, weights = weights, Ntrials = Ntrials, strata = strata,
#>    ", " lp.scale = lp.scale, link.covariates = link.covariates, verbose =
#>    verbose, ", " lincomb = lincomb, selection = selection, control.compute
#>    = control.compute, ", " control.predictor = control.predictor,
#>    control.family = control.family, ", " control.inla = control.inla,
#>    control.fixed = control.fixed, ", " control.mode = control.mode,
#>    control.expert = control.expert, ", " control.hazard = control.hazard,
#>    control.lincomb = control.lincomb, ", " control.update =
#>    control.update, control.lp.scale = control.lp.scale, ", "
#>    control.pardiso = control.pardiso, only.hyperparam = only.hyperparam,
#>    ", " inla.call = inla.call, inla.arg = inla.arg, num.threads =
#>    num.threads, ", " blas.num.threads = blas.num.threads, keep = keep,
#>    working.directory = working.directory, ", " silent = silent, inla.mode
#>    = inla.mode, safe = FALSE, debug = debug, ", " .parent.frame =
#>    .parent.frame)")
#> Time used:
#>     Running = 89.8, Merging = 27.6, Total = 117, NA = NA
#> Fixed effects:
#>               mean    sd 0.025quant 0.5quant 0.975quant
#> (Intercept) -0.016 0.005     -0.026   -0.016     -0.007
#>
#> Random effects:
#>   Name     Model
#>     ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27448.35
#> Deviance Information Criterion (DIC, saturated) ....: 8384.91
#> Effective number of parameters .....................: 400.05
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27430.58
#> Effective number of parameters .................: 339.39
#>
#>  is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')

The decision of how many models to process in parallel depends on the available computational resources. However, the more the models to be fitted in parallel, the more the R sessions to be opened in the backend, which would increase the computational time for data transfer.

3. Run the models distributed in remote machines

Before evaluating the following code, the user must set appropriate identifications for the remote machines.

## Example of distributed computation with four models per remote machine
workers <- rep(c("172.0.0.1","172.0.0.2","172.0.0.3"), each=4)
num.threads <- "8:1"

Model3 <- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
                   model="partition", k=0, strategy="simplified.laplace",
                   plan="cluster", workers=workers, num.threads=num.threads)

Acknowledgments

This work has been supported by Project MTM2017-82553-R (AEI/FEDER, UE) and Project PID2020-113125RB-I00/MCIN/AEI/10.13039/501100011033. It has also been partially funded by the Public University of Navarra (project PJUPNA2001).

References

Bengtsson, H. (2021). A unifying framework for parallel and distributed processing in R using futures. The R Journal, 13(2), 273–291. https://doi.org/10.32614/RJ-2021-048

Orozco-Acosta, E., Adin, A., & Ugarte, M. D. (2021). Scalable Bayesian modeling for smoothing disease mapping risks in large spatial data sets using INLA. Spatial Statistics, 41, 100496. https://doi.org/10.1016/j.spasta.2021.100496

Orozco-Acosta, E., Adin, A., & Ugarte, M. D. (2023). Big problems in spatio-temporal disease mapping: methods and software. Computer Methods and Programs in Biomedicine, 231, 107403. https://doi.org/10.1016/j.cmpb.2023.107403

bigDM: parallel and distributed modelling

A. Adin, E. Orozco-Acosta, M.D. Ugarte

2023-06-14

Introduction

1) Using local machines

2) Using remote compute nodes

3) Using both local and remote machines

4) It is also possible to fit more than one model simultaneously in each remote machine by setting

Example: Spanish colorectal cancer mortality data

Acknowledgments

References