This tutorial is the abridged version of the quickstart vignette. For more information, see the documentation website, which includes the rendered online version of the quickstart vignette.
Is there an association between the weight and the fuel efficiency of cars? To find out, we use the mtcars
dataset from the datasets
package. The mtcars
dataset originally came from the 1974 Motor Trend US magazine, and it contains design and performance data on 32 models of automobile.
# ?mtcars # more info
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Here, wt
is weight in tons, and mpg
is fuel efficiency in miles per gallon. We want to figure out if there is an association between wt
and mpg
. The mtcars
dataset itself only has 32 rows, so we generate two larger bootstrapped datasets and then analyze them with regression models. We summarize the regression models to see if there is an association.
drake
's basic exampleYour workspace begins with a bunch of imports: functions, pre-loaded data objects, and saved files available before the real work begins.
load_basic_example(verbose = FALSE) # Get the code with drake_example("basic").
# Drake looks for data objects and functions in your R session environment
ls()
## [1] "b" "bad_plan" "command" "config" "datasets"
## [6] "debug_plan" "envir" "error" "f" "files"
## [11] "good_plan" "my_plan" "myplan" "reg1" "reg2"
## [16] "rules" "simulate" "tmp" "x"
# and saved files in your file system.
list.files()
## [1] "best-practices.R" "best-practices.Rmd" "best-practices.html"
## [4] "best-practices.md" "caution.R" "caution.Rmd"
## [7] "caution.html" "caution.md" "debug.R"
## [10] "debug.Rmd" "debug.html" "debug.md"
## [13] "drake.R" "drake.Rmd" "example-gsp.Rmd"
## [16] "example-packages.Rmd" "graph.Rmd" "logo-vignettes.png"
## [19] "parallelism.Rmd" "quickstart.Rmd" "report.R"
## [22] "report.Rmd" "storage.Rmd" "timing.Rmd"
Your real work is outlined in a data frame of data analysis steps called “targets”. The targets depend on the imports, and drake
will figure out how they are all connected.
my_plan
## target
## 1 'report.md'
## 2 small
## 3 large
## 4 regression1_small
## 5 regression1_large
## 6 regression2_small
## 7 regression2_large
## 8 summ_regression1_small
## 9 summ_regression1_large
## 10 summ_regression2_small
## 11 summ_regression2_large
## 12 coef_regression1_small
## 13 coef_regression1_large
## 14 coef_regression2_small
## 15 coef_regression2_large
## command
## 1 knit('report.Rmd', quiet = TRUE)
## 2 simulate(48)
## 3 simulate(64)
## 4 reg1(small)
## 5 reg1(large)
## 6 reg2(small)
## 7 reg2(large)
## 8 suppressWarnings(summary(regression1_small$residuals))
## 9 suppressWarnings(summary(regression1_large$residuals))
## 10 suppressWarnings(summary(regression2_small$residuals))
## 11 suppressWarnings(summary(regression2_large$residuals))
## 12 suppressWarnings(summary(regression1_small))$coefficients
## 13 suppressWarnings(summary(regression1_large))$coefficients
## 14 suppressWarnings(summary(regression2_small))$coefficients
## 15 suppressWarnings(summary(regression2_large))$coefficients
Wildcard templating generates these data frames at scale.
library(magrittr)
dataset_plan <- drake_plan(
small = simulate(5),
large = simulate(50)
)
dataset_plan
## target command
## 1 small simulate(5)
## 2 large simulate(50)
analysis_methods <- drake_plan(
regression = regNUMBER(dataset__) # nolint
) %>%
evaluate_plan(wildcard = "NUMBER", values = 1:2)
analysis_methods
## target command
## 1 regression_1 reg1(dataset__)
## 2 regression_2 reg2(dataset__)
analysis_plan <- plan_analyses(
plan = analysis_methods,
datasets = dataset_plan
)
analysis_plan
## target command
## 1 regression_1_small reg1(small)
## 2 regression_1_large reg1(large)
## 3 regression_2_small reg2(small)
## 4 regression_2_large reg2(large)
whole_plan <- rbind(dataset_plan, analysis_plan)
whole_plan
## target command
## 1 small simulate(5)
## 2 large simulate(50)
## 3 regression_1_small reg1(small)
## 4 regression_1_large reg1(large)
## 5 regression_2_small reg2(small)
## 6 regression_2_large reg2(large)
Using static code analysis, drake
detects the dependencies of all your targets. The result is an interactive network diagram.
vis_drake_graph(my_plan)
At this point, all your targets are out of date because the project is new.
config <- drake_config(my_plan, verbose = FALSE) # Master configuration list
outdated(config)
## [1] "'report.md'" "coef_regression1_large"
## [3] "coef_regression1_small" "coef_regression2_large"
## [5] "coef_regression2_small" "large"
## [7] "regression1_large" "regression1_small"
## [9] "regression2_large" "regression2_small"
## [11] "small" "summ_regression1_large"
## [13] "summ_regression1_small" "summ_regression2_large"
## [15] "summ_regression2_small"
The make()
function traverses the network and builds the targets that require updates.
make(my_plan)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 23 imports: reg1, error, reg2, envir, b, rules, files, command, simul...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', data.frame, knit, lm, mtcars, nrow, sample.int, ...
## check 3 items: reg1, reg2, simulate
## check 2 items: large, small
## target large
## target small
## check 4 items: regression1_large, regression1_small, regression2_large, regre...
## target regression1_large
## target regression1_small
## target regression2_large
## target regression2_small
## check 8 items: coef_regression1_large, coef_regression1_small, coef_regressio...
## target coef_regression1_large
## target coef_regression1_small
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression1_large
## target summ_regression1_small
## target summ_regression2_large
## target summ_regression2_small
## check 1 item: 'report.md'
## unload 11 items: regression1_small, regression1_large, regression2_small, reg...
## target 'report.md'
For the reg2()
model on the small dataset, the p-value on x2
is so small that there may be an association between weight and fuel efficiency after all.
readd(coef_regression2_small)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.2685834 1.04298306 27.10359 6.285517e-30
## x2 -0.6517132 0.07053517 -9.23955 4.732794e-12
The project is currently up to date, so the next make()
does nothing.
make(my_plan)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## Unloading targets from environment:
## coef_regression2_small
## small
## large
## connect 23 imports: reg1, error, reg2, envir, b, rules, files, command, simul...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', data.frame, knit, lm, mtcars, nrow, sample.int, ...
## check 3 items: reg1, reg2, simulate
## check 2 items: large, small
## check 4 items: regression1_large, regression1_small, regression2_large, regre...
## check 8 items: coef_regression1_large, coef_regression1_small, coef_regressio...
## check 1 item: 'report.md'
## All targets are already up to date.
But a nontrivial change in reg2()
triggers updates to all the affected downstream targets.
reg2 <- function(d){
d$x3 <- d$x ^ 3
lm(y ~ x3, data = d)
}
make(my_plan)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 23 imports: reg1, error, reg2, envir, b, rules, files, command, simul...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', data.frame, knit, lm, mtcars, nrow, sample.int, ...
## check 3 items: reg1, reg2, simulate
## check 2 items: large, small
## check 4 items: regression1_large, regression1_small, regression2_large, regre...
## check 4 items: coef_regression1_large, coef_regression1_small, summ_regressio...
## load 2 items: large, small
## target regression2_large
## target regression2_small
## check 4 items: coef_regression2_large, coef_regression2_small, summ_regressio...
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression2_large
## target summ_regression2_small
## check 1 item: 'report.md'
## unload 5 items: regression2_small, regression2_large, summ_regression2_small,...
## target 'report.md'
Drake
has built-in example projects. You can generate the code files for an example with drake_example()
, and you can list the available examples with drake_examples()
. For instance, drake_example("gsp")
generates the R script and R Markdown report for the built-in econometrics data analysis project. See below for the currently supported examples.
drake
.basic
: A tiny, minimal example with the mtcars
dataset to demonstrate how to use drake
. Use load_basic_example()
to set up the project in your workspace. The quickstart vignette is a parallel walkthrough of the same example. gsp
: A more concrete, practical example using real econometrics data. It explores the relationships between gross state product and other quantities, and it shows off drake
's ability to generate lots of reproducibly-tracked tasks with ease.packages
: A concrete, practical example using data on R package downloads. It demonstrates how drake
can refresh a project based on new incoming data without restarting everything from scratch.Docker-psock
: demonstrates how to deploy targets to a Docker container using a specialized PSOCK cluster.Makefile-cluster
: uses Makefiles to deploy targets to a generic cluster (configurable).sge
: uses "future_lapply"
parallelism to deploy targets to a Sun/Univa Grid Engine cluster. Other clusters are similar. See the batchtools/inst/templates and future.batchtools/inst/templates for more example *.tmpl
template files.slurm
: similar to sge
, but for SLURM.torque
: similar to sge
, but for TORQUE.Regarding the high-performance computing examples, there is no one-size-fits-all *.tmpl
configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand how to use your job scheduler and batchtools.