bigDM: parallel and distributed modelling
Introduction
When fitting both the disjoint and k-order neighbour models with the bigDM package (Orozco-Acosta et al., 2021, 2023) parallel or distributed computation strategies can be performed to speed up computations by using the future package (Bengtsson, 2021).
Different computation strategies can be specified through the plan
argument of the CAR_INLA()
, STCAR_INLA()
and MCAR_INLA()
functions. For further details about these strategies see the documentation of the future package and its corresponding vignettes.
If
plan="sequential"
(default), the models are fitted sequentially and in the current R session (local machine).If
plan="cluster"
, multiple models can be fitted in parallel on external R sessions (local machine) or distributed in remote compute nodes. When using this option, the identifications of the local or remote workers where the models are going to be processed must be specified through theworkers
argument. The following options are available:1) Using local machines
## single local machine <- "localhost" workers # or <- Sys.info()[["nodename"]] workers ## multiple local machines <- rep("localhost", 2) # two local machines workers <- future::availableWorkers() # total number of available workers workers
2) Using remote compute nodes
## single remote machine <- "172.0.0.1" # using IP address workers <- "machine1.remote.es" # using hostname workers ## multiple remote machines <- c("172.0.0.1","172.0.0.2") workers # or <- c("machine1.remote.es","machine2.remote.es") workers
3) Using both local and remote machines
<- c("localhost","172.0.0.1","172.0.0.2") workers
4) It is also possible to fit more than one model simultaneously in each remote machine by setting
## two models per remote machine <- rep(c("172.0.0.1","172.0.0.2"), each=2) workers ## two models per local and remote machines <- rep(c("localhost","172.0.0.1","172.0.0.2"), each=2) workers
Important note: To work with remote workers, public and private keys between the different machines must be previously configured. In addition, the bigDM package needs to be installed in all the local/remote machines. More details can be found at this vignette and in the documentation of the
makeClusterPSOCK
function.
Example: Spanish colorectal cancer mortality data
In what follows, we show how to fit the Disjoint model for the Spanish colorectal cancer mortality data using different computation strategies. A full description of the data and usage of the CAR_INLA()
function is given in the vignette bigDM: fitting spatial models.
library(INLA)
library(bigDM)
data(Carto_SpainMUN)
head(Carto_SpainMUN)
#> Simple feature collection with 6 features and 8 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 485318 ymin: 4727428 xmax: 543317 ymax: 4779153
#> Projected CRS: ETRS89 / UTM zone 30N
#> ID name area perimeter obs
#> 1 01001 Alegria-Dulantzi 19913794 [m^2] 34372.11 [m] 2
#> 2 01002 Amurrio 96145595 [m^2] 63352.32 [m] 28
#> 3 01003 Aramaio 73338806 [m^2] 41430.46 [m] 6
#> 4 01004 Artziniega 27506468 [m^2] 22605.22 [m] 3
#> 5 01006 Arminon 10559721 [m^2] 17847.35 [m] 0
#> 6 01008 Arrazua-Ubarrundia (San Martin de Ania) 57502811 [m^2] 64968.81 [m] 2
#> exp SMR region geometry
#> 1 3.0237149 0.6614380 Pais Vasco MULTIPOLYGON (((538259 4737...
#> 2 20.8456682 1.3432047 Pais Vasco MULTIPOLYGON (((503520 4760...
#> 3 3.7527301 1.5988360 Pais Vasco MULTIPOLYGON (((533286 4759...
#> 4 3.2093191 0.9347777 Pais Vasco MULTIPOLYGON (((491260 4776...
#> 5 0.4817391 0.0000000 Pais Vasco MULTIPOLYGON (((509851 4727...
#> 6 1.9643891 1.0181282 Pais Vasco MULTIPOLYGON (((534678 4746...
For our example data in Carto_SpainMUN
the \(D=15\) Autonomous Regions of Spain are used as a partition of the \(n=7907\) municipalities (region
variable of the sf
object).
1. Run the models sequentially in the current R session (default option)
<- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
Model1 model="partition", k=0, strategy="simplified.laplace",
plan="sequential", workers=NULL)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results
summary(Model1)
#> Time used:
#> Running = 128, Merging = 19.9, Total = 148, NA = NA
#> Random effects:
#> Name Model
#> ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27455.94
#> Deviance Information Criterion (DIC, saturated) ....: 8392.49
#> Effective number of parameters .....................: 403.47
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27438.08
#> Effective number of parameters .................: 342.44
#>
#> is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
2. Run the models in parallel on external R sessions
<- rep("localhost", 4)
workers <- "4:1"
num.threads
<- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
Model2 model="partition", k=0, strategy="simplified.laplace",
plan="cluster", workers=workers, num.threads=num.threads)
#> STEP 1: Pre-processing data
#> STEP 2: Fitting partition (k=0) model with INLA
#> + Model 1 of 15
#> + Model 2 of 15
#> + Model 3 of 15
#> + Model 4 of 15
#> + Model 5 of 15
#> + Model 6 of 15
#> + Model 7 of 15
#> + Model 8 of 15
#> + Model 9 of 15
#> + Model 10 of 15
#> + Model 11 of 15
#> + Model 12 of 15
#> + Model 13 of 15
#> + Model 14 of 15
#> + Model 15 of 15
#> STEP 3: Merging the results
summary(Model2)
#> Time used:
#> Running = 122, Merging = 23.4, Total = 146, NA = NA
#> Random effects:
#> Name Model
#> ID.area Generic1 model
#>
#> Deviance Information Criterion (DIC) ...............: 27454.03
#> Deviance Information Criterion (DIC, saturated) ....: 8390.59
#> Effective number of parameters .....................: 403.15
#>
#> Watanabe-Akaike information criterion (WAIC) ...: 27437.10
#> Effective number of parameters .................: 342.78
#>
#> is computed
#> Posterior summaries for the linear predictor and the fitted values are computed
#> (Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
The decision of how many models to process in parallel depends on the available computational resources. However, the more the models to be fitted in parallel, the more the R sessions to be opened in the backend, which would increase the computational time for data transfer.
3. Run the models distributed in remote machines
Before evaluating the following code, the user must set appropriate identifications for the remote machines.
## Example of distributed computation with four models per remote machine
<- rep(c("172.0.0.1","172.0.0.2","172.0.0.3"), each=4)
workers <- "8:1"
num.threads
<- CAR_INLA(carto=Carto_SpainMUN, ID.area="ID", ID.group="region", O="obs", E="exp",
Model3 model="partition", k=0, strategy="simplified.laplace",
plan="cluster", workers=workers, num.threads=num.threads)
Acknowledgments
This work has been supported by Project MTM2017-82553-R (AEI/FEDER, UE) and Project PID2020-113125RB-I00/MCIN/AEI/10.13039/501100011033. It has also been partially funded by the Public University of Navarra (project PJUPNA2001).