This vignette addresses drake
's known edge cases, pitfalls, and weaknesses that might not be fixed in future releases. For the most up-to-date information on unhandled edge cases, please visit the issue tracker, where you can submit your own bug reports as well. Be sure to search the closed issues too, especially if you are not using the most up-to-date development version of drake
. For a guide to debugging and testing drake
projects, please refer to the separate “debug” vignette.
Versions of drake
after 4.4.0 have different caching internals. That means if you built your project with drake
4.4.0 or earlier, later versions of drake
will not let you call make()
, which could mangle your work. To migrate your project to a later drake
, you have three options.
devtools::install_version("drake", "4.4.0")
. Here, the devtools package must be installed (install.packages("devtools")
).make(..., force = TRUE)
.migrate_drake_project()
to convert your project to the new format. The migrate_drake_project()
function
drake
5.0.0 and later.It is common practice to divide the work of a project into multiple R files, but if you do this, you will not get the most out of drake
. Please see the best practices vignette for more details.
In your workflow plan, be sure that target names can be parsed as symbols and commands can be parsed as R code. To be safe, use check_plan(my_plan)
to screen for illegal symbols and other problem areas.
A common pitfall is using the evaluate_plan()
function to expand wildcards after applying single quotes to file targets.
library(magrittr) # for the pipe operator %>%
drake_plan(
data = readRDS("data_DATASIZE__rds")
) %>%
rbind(drake::drake_plan(
file.csv = write.csv(
data_DATASIZE__, # nolint
"file_DATASIZE__csv"
),
strings_in_dots = "literals",
file_targets = T
)) %>%
evaluate_plan(
rules = list(DATASIZE__ = c("small", "large"))
)
## target command
## 1 data_small readRDS('data_smallrds')
## 2 data_large readRDS('data_largerds')
## 3 'file.csv'_small write.csv(data_small, "file_smallcsv")
## 4 'file.csv'_large write.csv(data_large, "file_largecsv")
The single quotes in the middle of 'file.csv'_small
and 'file.csv'large
are illegal, and the target names do not even correspond to the files written. Instead, construct your workflow plan in multiple stages and apply the single quotes at the very end.
rules <- list(DATASIZE__ = c("small", "large"))
datasets <- drake_plan(data = readRDS("data_DATASIZE__rds")) %>%
evaluate_plan(rules = rules)
Plan the CSV files separately.
files <- drake_plan(
file = write.csv(data_DATASIZE__, "file_DATASIZE__csv"), # nolint
strings_in_dots = "literals"
) %>%
evaluate_plan(rules = rules)
Single-quote the file targets after evaluate_plan()
.
files$target <- paste0(
files$target, ".csv"
) %>%
as_drake_filename
Put the workflow plan together.
rbind(datasets, files)
## target command
## 1 data_small readRDS('data_smallrds')
## 2 data_large readRDS('data_largerds')
## 3 'file_small.csv' write.csv(data_small, "file_smallcsv")
## 4 'file_large.csv' write.csv(data_large, "file_largecsv")
For finer control over target names in cases like this, you may want to use the wildcard package.
In your workflow plan data frame (produced by drake_plan()
and accepted by make()
), your commands can usually be flexible R expressions.
drake_plan(
target1 = 1 + 1 - sqrt(sqrt(3)),
target2 = my_function(web_scraped_data) %>% my_tidy
)
## target command
## 1 target1 1 + 1 - sqrt(sqrt(3))
## 2 target2 my_function(web_scraped_data) %>% my_tidy
However, please try to avoid formulas and function definitions in your commands. You may be able to get away with drake_plan(f = function(x){x + 1})
or drake_plan(f = y ~ x)
in some use cases, but be careful. It is generally to define functions and formulas in your workspace and then let make()
import them. (Alternatively, use the envir
argument to make()
to tightly control which imported functions are available.) Use the check_plan()
function to help screen and quality-control your workflow plan data frame, use tracked()
to see the items that are reproducibly tracked, and use vis_drake_graph()
and build_drake_graph()
to see the dependency structure of your project.
drake
properly.You must properly install drake
using install.packages()
, devtools::install_github()
, or a similar approach. Functions like devtools::load_all()
are insufficient, particularly for parallel computing functionality in which separate new R sessions try to require(drake)
.
Your workflow may depend on external packages such as
ggplot2, dplyr, and MASS. Such packages must be formally installed with install.packages()
, devtools::install_github()
, devtools::install_local()
, or a similar command. If you load uninstalled packages with devtools::load_all()
, results may be unpredictable and incorrect.
When make()
fails, use failed()
and diagnose()
to debug. Try the following out yourself.
diagnose()
## character(0)
f <- function(){
stop("unusual error")
}
bad_plan <- drake_plan(target = f())
withr::with_message_sink(
stdout(),
make(bad_plan)
)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 11 imports: tmp, files, datasets, f, rules, simulate, reg1, my_plan, ...
## connect 1 target: target
## check 1 item: stop
## check 1 item: f
## check 1 item: target
## target target
## Error building target target: unusual error
## fail target
## Error: Target 'target' failed to build. Use diagnose(target) to retrieve diagnostic information.
failed() # From the last make() only
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## [1] "target"
diagnose() # From all previous make()s
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## [1] "target"
error <- diagnose(target)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
str(error)
## List of 3
## $ message: chr "unusual error"
## $ call : language f()
## $ calls :List of 3
## ..$ : language (function() { { ...
## ..$ : language f()
## ..$ : language stop("unusual error")
## .. ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 4 3 4 23 3 23 4 4
## .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x4fdfbc0>
## - attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
error$calls # View the traceback.
## [[1]]
## (function() {
## {
## f()
## }
## })()
##
## [[2]]
## f()
##
## [[3]]
## stop("unusual error")
As of version 3.0.0, drake's execution environment is the user's workspace by default. As an upshot, the workspace is vulnerable to side-effects of make()
(which are generally limited to loading and unloading targets). To protect your workspace, you may want to create a custom evaluation environment containing all your imported objects and then pass it to the envir
argument of make()
. Here is how.
library(drake)
clean(verbose = FALSE)
envir <- new.env(parent = globalenv())
eval(
expression({
f <- function(x){
g(x) + 1
}
g <- function(x){
x + 1
}
}
),
envir = envir
)
myplan <- drake_plan(out = f(1:3))
make(myplan, envir = envir)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 2 imports: f, g
## connect 1 target: out
## check 1 item: g
## check 1 item: f
## check 1 item: out
## target out
ls() # Check that your workspace did not change.
## [1] "bad_plan" "datasets" "envir" "error" "f"
## [6] "files" "good_plan" "my_plan" "myplan" "reg1"
## [11] "reg2" "rules" "simulate" "tmp"
ls(envir) # Check your evaluation environment.
## [1] "f" "g" "out"
envir$out
## [1] 3 4 5
readd(out)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## [1] 3 4 5
drake_config()
list early and often.The master configuration list returned by drake_config()
is important to drake
's internals, and you will need it for functions like outdated()
and vis_drake_graph()
. The config list corresponds to a single call to make()
, and you should not modify it by hand afterwards. For example, modifying the targets
element post-hoc will have no effect because the graph
element will remain the same. It is best to just call drake_config()
again.
drake
project is a package.Some users like to structure their drake
projects as formal R packages. The straightforward way to run such a project is to
*.R
files in the package's R/
folder.devtools::load_all()
.drake::make()
.env <- devtools::load_all("yourProject")$env # Has all your imported functions
drake::make(my_plan, envir = env) # Run the project normally.
However, the simple strategy above only works for parLapply
parallelism with jobs = 1
and mcapply
parallelism. For other kinds of parallelism, you must turn devtools::load_all("yourProject")$env
into an ordinary environment that does not look like a package namespace. Thanks to Jasper Clarkberg, we have the following workaround.
devtools::load_all("yourProject")$env
in order to change the binding environment of all your functions.env <- devtools::load_all("yourProject")$env
env <- list2env(as.list(env), parent = globalenv())
environment<-
.for (name in ls(env)){
assign(
x = name,
envir = env,
value = `environment<-`(get(n, envir = env), env)
)
}
drake
does not attach yourProject
as an external package.package_name <- "yourProject" # devtools::as.package(".")$package # nolint
packages_to_load <- setdiff(.packages(), package_name)
make()
.make(
my_plan, # Prepared in advance
envir = env, # Environment of package "yourProject"
parallelism = "Makefile", # Or "parLapply"
jobs = 2,
packages = packages_to_load # Does not include "yourProject"
)
You may need to adapt this last workaround, depending on the structure of the package, yourProject
.
lazy_load
flag does not work with "parLapply"
parallelism.Ordinarily, drake
prunes the execution environment at every parallelizable stage. In other words, it loads all the dependencies and unloads anything superfluous for entire batches of targets. This approach may require too much memory for some use cases, so there is an option to delay the loading of dependencies using the lazy_load
argument to make()
(powered by delayedAssign()
). There are two major risks.
make(..., lazy_load = TRUE, parallelism = "parLapply", jobs = 2)
does not work. If you want to use local multisession parallelism with multiple jobs and lazy loading, try "future_lapply"
parallelism instead.
library(future)
future::plan(multisession)
load_basic_example() # Get the code with drake_example("basic").
make(my_plan, lazy_load = TRUE, parallelism = "future_lapply")
Delayed evaluation may cause the same dependencies to be loaded multiple times, and these duplicated loads could be slow.
You can call make(..., timeout = 10)
to time out all each target after 10 seconds. However, timeouts rely on R.utils::withTimeout(), which in turn relies on setTimeLimit()
. These functions are the best that R can offer right now, but they have known issues, and timeouts may fail to take effect for certain environments.
With alternate triggers and the option to skip imports, you can sacrifice reproducibility to gain speed. However, these options can throw the dependency network out of sync. You should only use them for testing and debugging.
You should explicitly learn the items in your workflow and the dependencies of your targets.
?deps
?tracked
?vis_drake_graph
Drake
can be fooled into skipping objects that should be treated as dependencies. For example:
f <- function(){
b <- get("x", envir = globalenv()) # x is incorrectly ignored
file_dependency <- readRDS('input_file.rds') # 'input_file.rds' is incorrectly ignored # nolint
digest::digest(file_dependency)
}
deps(f)
## [1] "digest::digest" "get" "globalenv" "readRDS"
command <- "x <- digest::digest('input_file.rds'); assign(\"x\", 1); x"
deps(command)
## [1] "'input_file.rds'" "assign" "digest::digest"
knitr
reportsIn dynamic knitr reports, drake
automatically looks for active code chunks and learns dependencies from calls to loadd()
and readd()
. This behavior activates when
knit()
or render()
(from the rmarkdown
package), and*.Rmd
/*.Rnw
source file already exists before make()
.load_basic_example() # Get the code with drake_example("basic").
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 15 imports: tmp, error, envir, files, datasets, myplan, f, rules, sim...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
my_plan[1, ]
## target command
## 1 'report.md' knit('report.Rmd', quiet = TRUE)
Above, the R Markdown report loads targets 'small', 'large', and 'coef_regression2_small' using code chunks marked for evaluation.
deps("knit('report.Rmd')")
## [1] "'report.Rmd'" "coef_regression2_small"
## [3] "knit" "large"
## [5] "small"
deps("render('report.Rmd')")
## [1] "'report.Rmd'" "coef_regression2_small"
## [3] "large" "render"
## [5] "small"
deps("'report.Rmd'") # These are actually dependencies of 'report.md' (output)
## [1] "coef_regression2_small" "large"
## [3] "small"
However, you must explicitly mention each and every target loaded into a report. The following examples are discouraged in code chunks because they do not reference any particular target directly or literally in a way that static code analysis can detect.
var <- "good_target"
# Works in isolation, but drake sees "var" literally as a dependency,
# not "good_target".
readd(target = var, character_only = TRUE)
loadd(list = var)
# All cached items are loaded, but none are treated as dependencies.
loadd()
loadd(imports_only = TRUE)
knit()
and render()
functionsFunctions knit()
and render()
are special. When you explicitly mention them in a command for a target, you are signaling to drake
that you have a dynamic report (like an R Markdown *.Rmd
file). Drake assumes you are using knit()
from the knitr
package and render()
from the rmarkdown
package. Thus, unless you know exactly what you are doing, please do not define your own custom knitr()
or render()
functions or load different versions of these functions from other packages.
Vectorize()
With functions produced by Vectorize()
, detecting dependencies is especially hard because the body of every such function is
args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
names <- if (is.null(names(args)))
character(length(args)) else names(args)
dovec <- names %in% vectorize.args
do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
Thus, if f
is constructed with Vectorize(g, ...)
, drake
searches g()
for dependencies, not f()
. In fact, if drake
sees that environment(f)[["FUN"]]
exists and is a function, then environment(f)[["FUN"]]
will be analyzed instead of f()
. Furthermore, if f()
is the output of Vectorize()
, then drake
reproducibly tracks environment(f)[["FUN"]]
rather than f()
itself. Thus, if the configuration settings of vectorization change (such as which arguments are vectorized), but the core element-wise functionality remains the same, then make()
will not react. Also, if you hover over the f
node in vis_drake_graph(hover = TRUE)
, then you will see the body of environment(f)[["FUN"]]
, not the body of f()
.
Some R functions use .Call()
to run compiled code in the backend. The R code in these functions is tracked, but not the compiled object called with .Call()
, nor its C/C++/Fortran source.
In your workflow plan, you can use single quotes to assert that some targets/imports are external files. However, entire directories (i.e. folders) cannot be reproducibly tracked this way. Please see issue 12 for a discussion.
Drake
may import functions from packages, but the packages themselves are not tracked as dependencies. For this, you will need other tools that support reproducibility beyond the scope of drake
. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.
Drake
claims that it can
However, the practical efficiency of the parallel computing functionality remains to be verified rigorously. Serious performance studies will be part of future work that has not yet been conducted at the time of writing. In addition, each project has its own best parallel computing set up, and the user needs to optimize it on a case-by-case basis. Some general considerations include the following.
"mclapply"
parallelism, as opposed to distributed computing, which can spread the memory demands over the available nodes on a cluster.Be mindful of the maximum number of simultaneous parallel jobs you deploy. At best, too many jobs is poor etiquette on a system with many users and limited resources. At worst, too many jobs will crash a system. The jobs
argument to make()
sets the maximum number of simultaneous jobs in most cases, but not all.
For most of drake
's parallel backends, jobs
sets the maximum number of simultaneous parallel jobs. However, there are ways to break the pattern. For example, make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")
uses at most 2 jobs for the imports and at most 4 jobs for the targets. (In make()
, args
overrides jobs
for the targets). For make(..., parallelism = "future_lapply")
, the jobs
argument is ignored altogether. Instead, you should set the workers
argument where it is available (for example, future::plan(mutlisession(workers = 2))
or future::plan(future.batchtools::batchtools_local(workers = 2))
) in the preparations before make()
. Alternatively, you might limit the max number of jobs by setting options(mc.cores = 2)
before calling make()
. Depending on the future
backend you select with future::plan()
or future::plan()
, you might make use of one of the other environment variables listed in ?future::future.options
.
On Windows, do not use make(..., parallelism = "mclapply", jobs = n)
with n
greater than 1. You could try, but jobs
will just be demoted to 1. Instead, please replace "mclapply"
with one of the other parallelism_choices()
or let drake
choose the parallelism
backend for you. For make(..., parallelism = "Makefile")
, Windows users need to download and install Rtools
.
The "future_lapply"
backend unlocks a large array of distributed computing options on serious computing clusters. However, it is your responsibility to configure your workflow for your specific job scheduler. In particular, special batchtools *.tmpl
configuration files are required, and the technique is described in the documentation of batchtools. You can find some examples of these files in the inst/templates
folders of the batchtools and future.batchtools GitHub repositories. Drake
has some built-in prepackaged example workflows. See drake_examples()
to view your options, and then drake_example()
to write the files for an example.
drake_example("sge") # Sun/Univa Grid Engine workflow and supporting files
drake_example("slurm") # SLURM
drake_example("torque") # TORQUE
To write just *.tmpl
files from these examples, see the drake_batchtools_tmpl_file()
function.
Unfortunately, there is no one-size-fits-all *.tmpl
configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand your job scheduler and batchtools.
The Makefile generated by make(myplan, parallelism = "Makefile")
is not standalone. Do not run it outside of drake::make()
. Drake
uses dummy timestamp files to tell the Makefile what to do, and running make
in the terminal will most likely give incorrect results.
Makefile-level parallelism is only used for targets in your workflow plan data frame, not imports. To process imported objects and files, drake selects the best local parallel backend for your system and uses the jobs
argument to make()
. To use at most 2 jobs for imports and at most 4 jobs for targets, run
make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")
Some parallel backends, particularly mclapply
and future::multicore
, may create zombie processes. Zombie children are not usually harmful, but you may wish to kill them yourself. The following function by Carl Boneri should work on Unix-like systems. For a discussion, see drake issue 116.
fork_kill_zombies <- function(){
require(inline)
includes <- "#include <sys/wait.h>"
code <- "int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};"
wait <- inline::cfunction(
body = code,
includes = includes,
convention = ".C"
)
invisible(wait())
}
The storage vignette describes how storage works in drake
. As explained near the end of that vignette, you can plug custom storr caches into make()
. However, non-RDS caches such as storr_dbi()
may not work with parallel computing. In addition, please do not try to change the short hash algorithm of an existing cache.
In predict_runtime()
and rate_limiting_times()
, drake
only accounts for the targets with logged build times. If some targets have not been timed, drake
throws a warning and prints the untimed targets.