This tutorial is the abridged version of the quickstart vignette. For more information, see the documentation website, which includes the rendered online version of the quickstart vignette.

The motivation of the basic example

Is there an association between the weight and the fuel efficiency of cars? To find out, we use the mtcars dataset from the datasets package. The mtcars dataset originally came from the 1974 Motor Trend US magazine, and it contains design and performance data on 32 models of automobile.

# ?mtcars # more info
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Here, wt is weight in tons, and mpg is fuel efficiency in miles per gallon. We want to figure out if there is an association between wt and mpg. The mtcars dataset itself only has 32 rows, so we generate two larger bootstrapped datasets and then analyze them with regression models. We summarize the regression models to see if there is an association.

A taste of drake's basic example

Your workspace begins with a bunch of imports: functions, pre-loaded data objects, and saved files available before the real work begins.

load_basic_example(verbose = FALSE) # Get the code with drake_example("basic").

# Drake looks for data objects and functions in your R session environment
ls()
##  [1] "b"          "bad_plan"   "command"    "config"     "datasets"  
##  [6] "debug_plan" "envir"      "error"      "f"          "files"     
## [11] "good_plan"  "my_plan"    "myplan"     "reg1"       "reg2"      
## [16] "rules"      "simulate"   "tmp"        "x"

# and saved files in your file system.
list.files()
##  [1] "best-practices.R"     "best-practices.Rmd"   "best-practices.html" 
##  [4] "best-practices.md"    "caution.R"            "caution.Rmd"         
##  [7] "caution.html"         "caution.md"           "debug.R"             
## [10] "debug.Rmd"            "debug.html"           "debug.md"            
## [13] "drake.R"              "drake.Rmd"            "example-gsp.Rmd"     
## [16] "example-packages.Rmd" "graph.Rmd"            "logo-vignettes.png"  
## [19] "parallelism.Rmd"      "quickstart.Rmd"       "report.R"            
## [22] "report.Rmd"           "storage.Rmd"          "timing.Rmd"

Your real work is outlined in a data frame of data analysis steps called “targets”. The targets depend on the imports, and drake will figure out how they are all connected.

my_plan
##                    target
## 1             'report.md'
## 2                   small
## 3                   large
## 4       regression1_small
## 5       regression1_large
## 6       regression2_small
## 7       regression2_large
## 8  summ_regression1_small
## 9  summ_regression1_large
## 10 summ_regression2_small
## 11 summ_regression2_large
## 12 coef_regression1_small
## 13 coef_regression1_large
## 14 coef_regression2_small
## 15 coef_regression2_large
##                                                      command
## 1                           knit('report.Rmd', quiet = TRUE)
## 2                                               simulate(48)
## 3                                               simulate(64)
## 4                                                reg1(small)
## 5                                                reg1(large)
## 6                                                reg2(small)
## 7                                                reg2(large)
## 8     suppressWarnings(summary(regression1_small$residuals))
## 9     suppressWarnings(summary(regression1_large$residuals))
## 10    suppressWarnings(summary(regression2_small$residuals))
## 11    suppressWarnings(summary(regression2_large$residuals))
## 12 suppressWarnings(summary(regression1_small))$coefficients
## 13 suppressWarnings(summary(regression1_large))$coefficients
## 14 suppressWarnings(summary(regression2_small))$coefficients
## 15 suppressWarnings(summary(regression2_large))$coefficients

Wildcard templating generates these data frames at scale.

library(magrittr)
dataset_plan <- drake_plan(
  small = simulate(5),
  large = simulate(50)
)
dataset_plan
##   target      command
## 1  small  simulate(5)
## 2  large simulate(50)

analysis_methods <- drake_plan(
  regression = regNUMBER(dataset__) # nolint
) %>%
  evaluate_plan(wildcard = "NUMBER", values = 1:2)
analysis_methods
##         target         command
## 1 regression_1 reg1(dataset__)
## 2 regression_2 reg2(dataset__)

analysis_plan <- plan_analyses(
  plan = analysis_methods,
  datasets = dataset_plan
)
analysis_plan
##               target     command
## 1 regression_1_small reg1(small)
## 2 regression_1_large reg1(large)
## 3 regression_2_small reg2(small)
## 4 regression_2_large reg2(large)

whole_plan <- rbind(dataset_plan, analysis_plan)
whole_plan
##               target      command
## 1              small  simulate(5)
## 2              large simulate(50)
## 3 regression_1_small  reg1(small)
## 4 regression_1_large  reg1(large)
## 5 regression_2_small  reg2(small)
## 6 regression_2_large  reg2(large)

Using static code analysis, drake detects the dependencies of all your targets. The result is an interactive network diagram.

vis_drake_graph(my_plan)

At this point, all your targets are out of date because the project is new.

config <- drake_config(my_plan, verbose = FALSE) # Master configuration list
outdated(config)
##  [1] "'report.md'"            "coef_regression1_large"
##  [3] "coef_regression1_small" "coef_regression2_large"
##  [5] "coef_regression2_small" "large"                 
##  [7] "regression1_large"      "regression1_small"     
##  [9] "regression2_large"      "regression2_small"     
## [11] "small"                  "summ_regression1_large"
## [13] "summ_regression1_small" "summ_regression2_large"
## [15] "summ_regression2_small"

The make() function traverses the network and builds the targets that require updates.

make(my_plan)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 23 imports: reg1, error, reg2, envir, b, rules, files, command, simul...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', data.frame, knit, lm, mtcars, nrow, sample.int, ...
## check 3 items: reg1, reg2, simulate
## check 2 items: large, small
## target large
## target small
## check 4 items: regression1_large, regression1_small, regression2_large, regre...
## target regression1_large
## target regression1_small
## target regression2_large
## target regression2_small
## check 8 items: coef_regression1_large, coef_regression1_small, coef_regressio...
## target coef_regression1_large
## target coef_regression1_small
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression1_large
## target summ_regression1_small
## target summ_regression2_large
## target summ_regression2_small
## check 1 item: 'report.md'
## unload 11 items: regression1_small, regression1_large, regression2_small, reg...
## target 'report.md'

For the reg2() model on the small dataset, the p-value on x2 is so small that there may be an association between weight and fuel efficiency after all.

readd(coef_regression2_small)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 28.2685834 1.04298306 27.10359 6.285517e-30
## x2          -0.6517132 0.07053517 -9.23955 4.732794e-12

The project is currently up to date, so the next make() does nothing.

make(my_plan)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## Unloading targets from environment:
##   coef_regression2_small
##   small
##   large
## connect 23 imports: reg1, error, reg2, envir, b, rules, files, command, simul...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', data.frame, knit, lm, mtcars, nrow, sample.int, ...
## check 3 items: reg1, reg2, simulate
## check 2 items: large, small
## check 4 items: regression1_large, regression1_small, regression2_large, regre...
## check 8 items: coef_regression1_large, coef_regression1_small, coef_regressio...
## check 1 item: 'report.md'
## All targets are already up to date.

But a nontrivial change in reg2() triggers updates to all the affected downstream targets.

reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}

make(my_plan)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 23 imports: reg1, error, reg2, envir, b, rules, files, command, simul...
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', data.frame, knit, lm, mtcars, nrow, sample.int, ...
## check 3 items: reg1, reg2, simulate
## check 2 items: large, small
## check 4 items: regression1_large, regression1_small, regression2_large, regre...
## check 4 items: coef_regression1_large, coef_regression1_small, summ_regressio...
## load 2 items: large, small
## target regression2_large
## target regression2_small
## check 4 items: coef_regression2_large, coef_regression2_small, summ_regressio...
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression2_large
## target summ_regression2_small
## check 1 item: 'report.md'
## unload 5 items: regression2_small, regression2_large, summ_regression2_small,...
## target 'report.md'

Built-in example projects

Drake has built-in example projects. You can generate the code files for an example with drake_example(), and you can list the available examples with drake_examples(). For instance, drake_example("gsp") generates the R script and R Markdown report for the built-in econometrics data analysis project. See below for the currently supported examples.

Learn how to use drake.

High-performance computing

Regarding the high-performance computing examples, there is no one-size-fits-all *.tmpl configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand how to use your job scheduler and batchtools.