Cache
with SpaDES
As part of a reproducible work flow, caching of various function calls are a critical component. Down the road, it is likely that an entire work flow from raw data to publication, decision support, report writing, presentation building etc., could be built and be reproducible anywhere, on demand. The reproducible::Cache
function is built to work with any R function. However, it becomes very powerful in a SpaDES
context because we can build large, powerful applications that are transparent and tied to the raw data that may be many conceptual steps upstream in the workflow. To do this, we have built several customizations within the SpaDES
package. Important to this is dealing correctly with the simList
, which is an object that has slot that is an environment. But more important are the various tools that can be used at higher levels, i.e., not just for “standard” functions.
SpaDES
Some of the details of the simList
-specific features of this Cache
function include:
The function converts all elements that have an environment as part of their attributes into a format that has no unique environment attribute, using format
if a function, and as.list
in the case of the simList
environment.
When used within SpaDES modules, Cache
(capital C) does not require that the argument cacheRepo
be specified. If called from inside a SpaDES module, Cache
will use the cacheRepo
argument from a call to cachePath(sim)
, taking the sim
from the call stack. Similarly, if no cacheRepo
argument is specified, then it will use getOption("spades.cachePath")
, which will, by default, be a temporary location with no persistence between R sessions! To persist between sessions, use SpaDES::setPaths()
every session.
In a SpaDES
context, there are several levels of caching that can be used as part of a reproducible workflow. Each level can be used to a modeler’s advantage; and, all can be – and are often – used concurrently.
spades
levelAnd entire call to spades
or experiment
can be cached. This will have the effect of eliminating any stochasticity in the model as the output will simply be the cached version of the simList
. This is likely most useful in situations where reproducibility is more important than “new” stochasticity (e.g., building decision support systems, apps, final version of a manuscript).
library(igraph) # for %>%
library(raster)
library(SpaDES)
mySim <- simInit(
times = list(start = 0.0, end = 5.0),
params = list(
.globals = list(stackName = "landscape", burnStats = "testStats"),
randomLandscapes = list(.plotInitialTime = NA),
fireSpread = list(.plotInitialTime = NA)
),
modules = list("randomLandscapes", "fireSpread"),
paths = list(modulePath = system.file("sampleModules", package = "SpaDES.core")))
This functionality can be achieved within a spades
call.
# compare caching ... run once to create cache
system.time(outSim <- spades(Copy(mySim), cache = TRUE, notOlderThan = Sys.time()))
## user system elapsed
## 3.077 0.301 3.588
Note that if there were any visualizations (here we turned them off with .plotInitialTime = NA
above) they will happen the first time through, but not the cached times.
## Using cached copy of burn event in fireSpread module
## user system elapsed
## 0.086 0.009 0.096
## [1] TRUE
experiment
levelThis functionality can be achieved within an experiment call. This can be done 2 ways, either: “internally” through the cache argument, which will cache each spades call; or, “externally” which will cache the entire experiment. If there are lots of spades
calls, then the former will be slow as the simList
will be digested once per spades
call.
## user system elapsed
## 2.589 0.230 2.816
# internal -- second time faster
system.time(sims2 <- experiment(mySim, replicates = 2, cache = TRUE))
## Using cached copy of burn event in fireSpread module
## Using cached copy of burn event in fireSpread module
## user system elapsed
## 0.136 0.014 0.151
## [1] TRUE
experiment
with Cache
Here, the simList
(and other arguments to experiment) is hashed once, and if it is found to be the same as previous, then the returned list of simList
objects is recovered. This means that even a very large experiment, with many replicates and combinations of parameters and modules can be recovered very quickly. Here we show that you can output objects to disk, so the list of simList
objects doesn’t get too big. Then, when we recover it in the Cached version, all the files are still there, the list of simList
objects is small, so very fast to recover.
# External
outputs(mySim) <- data.frame(objectName = "landscape")
system.time(sims3 <- Cache(experiment, mySim, replicates = 3, .plotInitialTime = NA,
clearSimEnv = TRUE))
## user system elapsed
## 3.723 0.301 4.032
The second time is way faster. We see the output files in the same location.
system.time(sims4 <- Cache(experiment, mySim, replicates = 3, .plotInitialTime = NA,
clearSimEnv = TRUE))
## loading cached result from previous experiment call.
## user system elapsed
## 0.028 0.003 0.032
## [1] TRUE
## [1] "experiment.RData" "rep1/landscape_year5.rds"
## [3] "rep2/landscape_year5.rds" "rep3/landscape_year5.rds"
Notice that speed up can be enormous; in this case ~100 times.
If the parameter .useCache
in the module’s metadata is set to TRUE
, then every event in the module will be cached. That means that every time that module is called from within a spades or experiment call, Cache
will be called. Only the objects inside the simList
that correspond to the inputObjects
or the outputObjects
from the module metadata will be assessed for caching.
For general use, module-level caching would be mostly useful for modules that have no stochasticity, such as data-preparation modules, GIS modules etc.
In this example, we will use the cache on the randomLandscapes module. This means that each subsequent call to spades will result in identical outputs from the randomLandscapes
module (only!). This would be useful when only one random landscape is needed simply for trying something out, or putting into production code (e.g., publication, decision support, etc.).
# Module-level
params(mySim)$randomLandscapes$.useCache <- TRUE
system.time(randomSim <- spades(Copy(mySim), .plotInitialTime = NA,
notOlderThan = Sys.time(), debug = TRUE))
## This is the current event, printed as it is happening:
## eventTime moduleName eventType eventPriority
## 0 checkpoint init 5
## 0 save init 5
## 0 progress init 5
## 0 load init 5
## 0 randomLandscapes init 5
## 0 fireSpread init 5
## 1 fireSpread burn 5
## 1 fireSpread stats 5
## 2 fireSpread burn 5
## 2 fireSpread stats 5
## 3 fireSpread burn 5
## 3 fireSpread stats 5
## 4 fireSpread burn 5
## 4 fireSpread stats 5
## 5 fireSpread burn 5
## 5 fireSpread stats 5
## 5 save spades 10
## user system elapsed
## 1.330 0.114 1.448
# vastly faster the second time
system.time(randomSimCached <- spades(Copy(mySim), .plotInitialTime = NA,
debug = TRUE))
## This is the current event, printed as it is happening:
## eventTime moduleName eventType eventPriority
## 0 checkpoint init 5
## 0 save init 5
## 0 progress init 5
## 0 load init 5
## 0 randomLandscapes init 5
## Using cached copy of randomLandscapes module
## 0 fireSpread init 5
## 1 fireSpread burn 5
## 1 fireSpread stats 5
## 2 fireSpread burn 5
## 2 fireSpread stats 5
## 3 fireSpread burn 5
## 3 fireSpread stats 5
## 4 fireSpread burn 5
## 4 fireSpread stats 5
## 5 fireSpread burn 5
## 5 fireSpread stats 5
## 5 save spades 10
## user system elapsed
## 0.219 0.017 0.239
Test that only layers produced in randomLandscapes
are identical, not fireSpread
.
layers <- list("DEM", "forestAge", "habitatQuality", "percentPine", "Fires")
same <- lapply(layers, function(l) identical(randomSim$landscape[[l]],
randomSimCached$landscape[[l]]))
names(same) <- layers
print(same) # Fires is not same because all non-init events in fireSpread are not cached
## $DEM
## [1] TRUE
##
## $forestAge
## [1] TRUE
##
## $habitatQuality
## [1] TRUE
##
## $percentPine
## [1] TRUE
##
## $Fires
## [1] FALSE
If the parameter .useCache
in the module’s metadata is set to a character or character vector, then that or those event(s), identified by their name, will be cached. That means that every time the event is called from within a spades
or experiment
call, Cache
will be called. Only the objects inside the simList
that correspond to the inputObjects
or the outputObjects
as defined in the module metadata will be assessed for caching inputs or outputs, respectively. The fact that all and only the named inputObjects
and outputObjects
are cached and returned may be inefficient (i.e., it may cache more objects than are necessary) for individual events.
Similar to module-level caching, event-level caching would be mostly useful for events that have no stochasticity, such as data-preparation events, GIS events etc. Here, we don’t change the module-level caching for randomLandscapes, but we add to it a cache for only the “init” event for fireSpread
.
params(mySim)$fireSpread$.useCache <- "init"
system.time(randomSim <- spades(Copy(mySim), .plotInitialTime = NA,
notOlderThan = Sys.time(), debug = TRUE))
## This is the current event, printed as it is happening:
## eventTime moduleName eventType eventPriority
## 0 checkpoint init 5
## 0 save init 5
## 0 progress init 5
## 0 load init 5
## 0 randomLandscapes init 5
## 0 fireSpread init 5
## 1 fireSpread burn 5
## 1 fireSpread stats 5
## 2 fireSpread burn 5
## 2 fireSpread stats 5
## 3 fireSpread burn 5
## 3 fireSpread stats 5
## 4 fireSpread burn 5
## 4 fireSpread stats 5
## 5 fireSpread burn 5
## 5 fireSpread stats 5
## 5 save spades 10
## user system elapsed
## 1.473 0.143 1.627
# vastly faster the second time
system.time(randomSimCached <- spades(Copy(mySim), .plotInitialTime = NA,
debug = TRUE))
## This is the current event, printed as it is happening:
## eventTime moduleName eventType eventPriority
## 0 checkpoint init 5
## 0 save init 5
## 0 progress init 5
## 0 load init 5
## 0 randomLandscapes init 5
## Using cached copy of randomLandscapes module
## 0 fireSpread init 5
## Using cached copy of init event in fireSpread module
## 1 fireSpread burn 5
## 1 fireSpread stats 5
## 2 fireSpread burn 5
## 2 fireSpread stats 5
## 3 fireSpread burn 5
## 3 fireSpread stats 5
## 4 fireSpread burn 5
## 4 fireSpread stats 5
## 5 fireSpread burn 5
## 5 fireSpread stats 5
## 5 save spades 10
## user system elapsed
## 0.258 0.020 0.279
Any function can be cached using: Cache(FUN = functionName, ...)
.
This will be a slight change to a function call, such as: projectRaster(raster, crs = crs(newRaster))
to Cache(projectRaster, raster, crs = crs(newRaster))
.
ras <- raster(extent(0, 1e3, 0, 1e3), res = 1)
system.time(map <- Cache(gaussMap, ras, cacheRepo = cachePath(mySim),
notOlderThan = Sys.time()))
## user system elapsed
## 3.115 0.124 3.296
# vastly faster the second time
system.time(mapCached <- Cache(gaussMap, ras, cacheRepo = cachePath(mySim)))
## loading cached result from previous gaussMap call.
## user system elapsed
## 0.084 0.006 0.091
## [1] TRUE
Since the cache is simply an archivist
repository, all archivist
functions will work as is. In addition, there are several helpers in the reproducible
package, including showCache
, keepCache
and clearCache
that may be useful. Also, one can access cached items manually (rather than simply rerunning the same Cache
function again).
## tagKey tagValue createdDate
## 1: function spades 2018-01-31 16:39:38
## 2: function experiment 2018-01-31 16:39:42
## 3: function doEvent 2018-01-31 16:39:46
## 4: function spades 2018-01-31 16:39:46
## 5: function gaussMap 2018-01-31 16:39:51
## 6: function doEvent 2018-01-31 16:39:46
## 7: function spades 2018-01-31 16:39:46
## 8: function spades 2018-01-31 16:39:34
## 9: function spades 2018-01-31 16:39:36
if (requireNamespace("archivist")) {
# get the RasterLayer that was produced with the gaussMap function:
map <- unique(showCache(mySim, userTags = "gaussMap")$artifact) %>%
archivist::loadFromLocalRepo(repoDir = cachePath(mySim), value = TRUE)
clearPlot()
Plot(map)
}
In general, we feel that a liberal use of Cache
will make a re-useable and reproducible work flow. shiny
apps can be made, taking advantage of Cache
. Indeed, much of the difficulty in managing data sets and saving them for future use, can be accommodated by caching.