Cross-validating model selection

John Fox and Georges Monette

2024-09-22

As Hastie, Tibshirani, & Friedman (2009, sec. 7.10.2: “The Wrong and Right Way to Do Cross-validation”) explain, if the whole data are used to select or fine-tune a statistical model, subsequent cross-validation of the model is intrinsically misleading, because the model is selected to fit the whole data, including the part of the data that remains when each fold is removed.

A preliminary example

The following example is similar in spirit to one employed by Hastie et al. (2009). Suppose that we randomly generate \(n = 1000\) independent observations for a response variable variable \(y \sim N(\mu = 10, \sigma^2 = 0)\), and independently sample \(1000\) observations for \(p = 100\) “predictors,” \(x_1, \ldots, x_{100}\), each from \(x_j \sim N(0, 1)\). The response has nothing to do with the predictors and so the population linear-regression model \(y_i = \alpha + \beta_1 x_{i1} + \cdots + \beta_{100} x_{i,100} + \varepsilon_i\) has \(\alpha = 10\) and all \(\beta_j = 0\).

set.seed(24361) # for reproducibility
D <- data.frame(y = rnorm(1000, mean = 10),
                X = matrix(rnorm(1000 * 100), 1000, 100))
head(D[, 1:6])
#>         y      X.1      X.2      X.3       X.4       X.5
#> 1 10.0316 -1.23886 -0.26487 -0.03539 -2.576973  0.811048
#> 2  9.6650  0.12287 -0.17744  0.37290 -0.935138  0.628673
#> 3 10.0232 -0.95052 -0.73487 -1.05978  0.882944  0.023918
#> 4  8.9910  1.13571  0.32411  0.11037  1.376303 -0.422114
#> 5  9.0712  1.49474  1.87538  0.10575  0.292140 -0.184568
#> 6 11.3493 -0.18453 -0.78037 -1.23804 -0.010949  0.691034

Least-squares provides accurate estimates of the regression constant \(\alpha = 10\) and the error variance \(\sigma^2 = 1\) for the “null model” including only the regression constant; moreover, the omnibus \(F\)-test of the correct null hypothesis that all of the \(\beta\)s are 0 for the “full model” with all 100 \(x\)s is associated with a large \(p\)-value:

m.full <- lm(y ~ ., data = D)
m.null <- lm(y ~ 1, data = D)
anova(m.null, m.full)
#> Analysis of Variance Table
#> 
#> Model 1: y ~ 1
#> Model 2: y ~ X.1 + X.2 + X.3 + X.4 + X.5 + X.6 + X.7 + X.8 + X.9 + X.10 + 
#>     X.11 + X.12 + X.13 + X.14 + X.15 + X.16 + X.17 + X.18 + X.19 + 
#>     X.20 + X.21 + X.22 + X.23 + X.24 + X.25 + X.26 + X.27 + X.28 + 
#>     X.29 + X.30 + X.31 + X.32 + X.33 + X.34 + X.35 + X.36 + X.37 + 
#>     X.38 + X.39 + X.40 + X.41 + X.42 + X.43 + X.44 + X.45 + X.46 + 
#>     X.47 + X.48 + X.49 + X.50 + X.51 + X.52 + X.53 + X.54 + X.55 + 
#>     X.56 + X.57 + X.58 + X.59 + X.60 + X.61 + X.62 + X.63 + X.64 + 
#>     X.65 + X.66 + X.67 + X.68 + X.69 + X.70 + X.71 + X.72 + X.73 + 
#>     X.74 + X.75 + X.76 + X.77 + X.78 + X.79 + X.80 + X.81 + X.82 + 
#>     X.83 + X.84 + X.85 + X.86 + X.87 + X.88 + X.89 + X.90 + X.91 + 
#>     X.92 + X.93 + X.94 + X.95 + X.96 + X.97 + X.98 + X.99 + X.100
#>   Res.Df RSS  Df Sum of Sq    F Pr(>F)
#> 1    999 974                          
#> 2    899 888 100      85.2 0.86   0.82

summary(m.null)
#> 
#> Call:
#> lm(formula = y ~ 1, data = D)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -3.458 -0.681  0.019  0.636  2.935 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.9370     0.0312     318   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.987 on 999 degrees of freedom

Next, using the stepAIC() function in the MASS package (Venables & Ripley, 2002), let us perform a forward stepwise regression to select a “best” model, starting with the null model, and using AIC as the model-selection criterion (see the help page for stepAIC() for details):1

library("MASS")  # for stepAIC()
m.select <- stepAIC(
  m.null,
  direction = "forward",
  trace = FALSE,
  scope = list(lower =  ~ 1, upper = formula(m.full))
)
summary(m.select)
#> 
#> Call:
#> lm(formula = y ~ X.99 + X.90 + X.87 + X.40 + X.65 + X.91 + X.53 + 
#>     X.45 + X.31 + X.56 + X.61 + X.60 + X.46 + X.35 + X.92, data = D)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -3.262 -0.645  0.024  0.641  3.118 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.9372     0.0310  320.80   <2e-16 ***
#> X.99         -0.0910     0.0308   -2.95   0.0032 ** 
#> X.90         -0.0820     0.0314   -2.62   0.0090 ** 
#> X.87         -0.0694     0.0311   -2.24   0.0256 *  
#> X.40         -0.0476     0.0308   -1.55   0.1221    
#> X.65         -0.0552     0.0315   -1.76   0.0795 .  
#> X.91          0.0524     0.0308    1.70   0.0894 .  
#> X.53         -0.0492     0.0305   -1.61   0.1067    
#> X.45          0.0554     0.0318    1.74   0.0818 .  
#> X.31          0.0452     0.0311    1.46   0.1457    
#> X.56          0.0543     0.0327    1.66   0.0972 .  
#> X.61         -0.0508     0.0317   -1.60   0.1091    
#> X.60         -0.0513     0.0319   -1.61   0.1083    
#> X.46          0.0516     0.0327    1.58   0.1153    
#> X.35          0.0470     0.0315    1.49   0.1358    
#> X.92          0.0443     0.0310    1.43   0.1533    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.973 on 984 degrees of freedom
#> Multiple R-squared:  0.0442, Adjusted R-squared:  0.0296 
#> F-statistic: 3.03 on 15 and 984 DF,  p-value: 8.34e-05

library("cv")
mse(D$y, fitted(m.select))
#> [1] 0.93063
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

The resulting model has 15 predictors, a very modest \(R^2 = .044\), but a small \(p\)-value for its omnibus \(F\)-test (which, of course, is entirely spurious because the same data were used to select and test the model). The MSE for the selected model is smaller than the true error variance \(\sigma^2 = 1\), as is the estimated error variance for the selected model, \(\widehat{\sigma}^2 = 0.973^2 = 0.947\).

If we cross-validate the selected model, we also obtain an optimistic estimate of its predictive power (although the confidence interval for the bias-adjusted MSE includes 1):

library("cv")

summary(cv(m.select, seed = 2529))
#> R RNG seed set to 2529
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 0.95937
#> bias-adjusted cross-validation criterion = 0.95785
#> 95% CI for bias-adjusted CV criterion = (0.87661, 1.0391)
#> full-sample criterion = 0.93063

The "function" method of cv() allows us to cross-validate the whole model-selection procedure, where first argument to cv() is a model-selection function capable of refitting the model with a fold omitted and returning a CV criterion. The selectStepAIC() function, in the cv package and based on stepAIC(), is suitable for use with cv():

cv.select <- cv(
  selectStepAIC,
  data = D,
  seed = 3791,
  working.model = m.null,
  direction = "forward",
  scope = list(lower =  ~ 1, upper = formula(m.full))
)
#> R RNG seed set to 3791
summary(cv.select)
#> 10-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 1.0687
#> bias-adjusted cross-validation criterion = 1.0612
#> 95% CI for bias-adjusted CV criterion = (0.97172, 1.1506)
#> full-sample criterion = 0.93063

The other arguments to cv() are:

By default, cv() performs 10-fold CV, and produces an estimate of MSE for the model-selection procedure even larger than the true error variance, \(\sigma^2 = 1\).

Also by default, when the number of folds is 10 or fewer, cv() saves details data about the folds. In this example, the compareFolds() function reveals that the variables retained by the model-selection process in the several folds are quite different:

compareFolds(cv.select)
#> CV criterion by folds:
#>  fold.1  fold.2  fold.3  fold.4  fold.5  fold.6  fold.7  fold.8  fold.9 fold.10 
#> 1.26782 1.12837 1.04682 1.31007 1.06899 0.87916 0.88380 0.95026 1.21070 0.94130 
#> 
#> Coefficients by folds:
#>         (Intercept)    X.87    X.90    X.99    X.91    X.54    X.53    X.56
#> Fold 1       9.9187 -0.0615 -0.0994 -0.0942  0.0512  0.0516                
#> Fold 2       9.9451 -0.0745 -0.0899 -0.0614          0.0587          0.0673
#> Fold 3       9.9423 -0.0783 -0.0718 -0.0987  0.0601                  0.0512
#> Fold 4       9.9410 -0.0860 -0.0831 -0.0867  0.0570         -0.0508        
#> Fold 5       9.9421 -0.0659 -0.0849 -0.1004  0.0701  0.0511 -0.0487  0.0537
#> Fold 6       9.9633 -0.0733 -0.0874 -0.0960  0.0555  0.0629 -0.0478        
#> Fold 7       9.9279 -0.0618 -0.0960 -0.0838  0.0533         -0.0464        
#> Fold 8       9.9453 -0.0610 -0.0811 -0.0818          0.0497 -0.0612  0.0560
#> Fold 9       9.9173 -0.0663 -0.0894 -0.1100  0.0504  0.0524          0.0747
#> Fold 10      9.9449 -0.0745 -0.0906 -0.0891  0.0535  0.0482 -0.0583  0.0642
#>            X.40    X.45    X.65    X.68    X.92    X.15    X.26    X.46    X.60
#> Fold 1                  -0.0590                 -0.0456  0.0658  0.0608        
#> Fold 2                                   0.0607          0.0487                
#> Fold 3  -0.0496         -0.0664          0.0494                                
#> Fold 4  -0.0597  0.0579 -0.0531          0.0519 -0.0566                 -0.0519
#> Fold 5                           0.0587                          0.0527 -0.0603
#> Fold 6  -0.0596  0.0552          0.0474                                        
#> Fold 7           0.0572          0.0595                                        
#> Fold 8           0.0547 -0.0617  0.0453  0.0493 -0.0613  0.0591  0.0703 -0.0588
#> Fold 9  -0.0552  0.0573 -0.0635  0.0492         -0.0513  0.0484         -0.0507
#> Fold 10 -0.0558                          0.0529                  0.0710        
#>            X.61     X.8    X.28    X.29    X.31    X.35    X.70    X.89    X.17
#> Fold 1  -0.0490          0.0616 -0.0537                  0.0638                
#> Fold 2           0.0671                  0.0568                  0.0523        
#> Fold 3  -0.0631          0.0616                                                
#> Fold 4           0.0659         -0.0549          0.0527                  0.0527
#> Fold 5           0.0425                  0.0672  0.0613          0.0493        
#> Fold 6           0.0559         -0.0629  0.0498          0.0487                
#> Fold 7                                                           0.0611  0.0472
#> Fold 8  -0.0719                                          0.0586                
#> Fold 9                   0.0525                                                
#> Fold 10 -0.0580                                  0.0603                        
#>            X.25     X.4    X.64    X.81    X.97    X.11     X.2    X.33    X.47
#> Fold 1                                   0.0604          0.0575                
#> Fold 2   0.0478          0.0532  0.0518                                        
#> Fold 3                           0.0574                          0.0473        
#> Fold 4                   0.0628                                                
#> Fold 5   0.0518                                                                
#> Fold 6                                           0.0521                        
#> Fold 7           0.0550                                                        
#> Fold 8                                                                         
#> Fold 9                                   0.0556                          0.0447
#> Fold 10          0.0516                                                        
#>             X.6    X.72    X.73    X.77    X.79 X.88
#> Fold 1   0.0476                                     
#> Fold 2                   0.0514                     
#> Fold 3                                              
#> Fold 4                                  -0.0473     
#> Fold 5           0.0586                         0.07
#> Fold 6                          -0.0489             
#> Fold 7                                              
#> Fold 8                                              
#> Fold 9                                              
#> Fold 10

Polynomial regression for the Auto data revisited: meta cross-validation

In the introductory vignette on cross-validating regression models, following James, Witten, Hastie, & Tibshirani (2021, secs. 5.1, 5.3), we fit polynomial regressions up to degree 10 to the relationship of mpg to horsepower for the Auto data, saving the results in m.1 through m.10. We then used cv() to compare the cross-validated MSE for the 10 models, discovering that the 7th degree polynomial had the smallest MSE (by a small margin); repeating the relevant graph:
Cross-validated 10-fold and LOO MSE as a function of polynomial degree, $p$ (repeated)

Cross-validated 10-fold and LOO MSE as a function of polynomial degree, \(p\) (repeated)

If we then select the 7th degree polynomial model, intending to use it for prediction, the CV estimate of the MSE for this model will be optimistic. One solution is to cross-validate the process of using CV to select the “best” model—that is, to apply CV to CV recursively, a process that we term “meta cross-validation.” The function selectModelList(), which is suitable for use with cv(), implements this idea.

Applying selectModelList() to the Auto polynomial-regression models, and using 10-fold CV, we obtain:

metaCV.auto <- cv(
  selectModelList,
  Auto,
  working.model = models(m.1, m.2, m.3, m.4, m.5,
                         m.6, m.7, m.8, m.9, m.10),
  save.model = TRUE,
  seed = 2120
)
#> R RNG seed set to 2120
summary(metaCV.auto)
#> 10-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 19.105
#> bias-adjusted cross-validation criterion = 19.68
#> full-sample criterion = 18.746
(m.sel <- cvInfo(metaCV.auto, "selected model"))
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 7), data = Auto)
#> 
#> Coefficients:
#>          (Intercept)  poly(horsepower, 7)1  poly(horsepower, 7)2  
#>                23.45               -120.14                 44.09  
#> poly(horsepower, 7)3  poly(horsepower, 7)4  poly(horsepower, 7)5  
#>                -3.95                 -5.19                 13.27  
#> poly(horsepower, 7)6  poly(horsepower, 7)7  
#>                -8.55                  7.98
cv(m.sel, seed = 2120) # same seed for same folds
#> R RNG seed set to 2120
#> cross-validation criterion (mse) = 18.898

As expected, meta CV produces a larger estimate of MSE for the selected 7th degree polynomial model than CV applied directly to this model.

We can equivalently call cv() with the list of models as the first argument and set meta=TRUE:

metaCV.auto.alt <- cv(
  models(m.1, m.2, m.3, m.4, m.5,
         m.6, m.7, m.8, m.9, m.10),
  data = Auto,
  seed = 2120,
  meta = TRUE,
  save.model = TRUE
)
#> R RNG seed set to 2120
all.equal(metaCV.auto, metaCV.auto.alt)
#> [1] "Component \"criterion\": 1 string mismatch"

Mroz’s logistic regression revisited

Next, let’s apply model selection to Mroz’s logistic regression for married women’s labor-force participation, also discussed in the introductory vignette on cross-validating regression models. First, recall the logistic regression model that we fit to the Mroz data:

data("Mroz", package = "carData")
m.mroz <- glm(lfp ~ ., data = Mroz, family = binomial)
summary(m.mroz)
#> 
#> Call:
#> glm(formula = lfp ~ ., family = binomial, data = Mroz)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  3.18214    0.64438    4.94  7.9e-07 ***
#> k5          -1.46291    0.19700   -7.43  1.1e-13 ***
#> k618        -0.06457    0.06800   -0.95  0.34234    
#> age         -0.06287    0.01278   -4.92  8.7e-07 ***
#> wcyes        0.80727    0.22998    3.51  0.00045 ***
#> hcyes        0.11173    0.20604    0.54  0.58762    
#> lwg          0.60469    0.15082    4.01  6.1e-05 ***
#> inc         -0.03445    0.00821   -4.20  2.7e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  905.27  on 745  degrees of freedom
#> AIC: 921.3
#> 
#> Number of Fisher Scoring iterations: 4

Applying stepwise model selection Mroz’s logistic regression, using BIC as the model-selection criterion (via the argument k=log(nrow(Mroz)) to stepAIC()) selects 5 of the 7 original predictors:

m.mroz.sel <- stepAIC(m.mroz, k = log(nrow(Mroz)),
                      trace = FALSE)
summary(m.mroz.sel)
#> 
#> Call:
#> glm(formula = lfp ~ k5 + age + wc + lwg + inc, family = binomial, 
#>     data = Mroz)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   2.9019     0.5429    5.35  9.0e-08 ***
#> k5           -1.4318     0.1932   -7.41  1.3e-13 ***
#> age          -0.0585     0.0114   -5.13  2.9e-07 ***
#> wcyes         0.8724     0.2064    4.23  2.4e-05 ***
#> lwg           0.6157     0.1501    4.10  4.1e-05 ***
#> inc          -0.0337     0.0078   -4.32  1.6e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  906.46  on 747  degrees of freedom
#> AIC: 918.5
#> 
#> Number of Fisher Scoring iterations: 3
BayesRule(Mroz$lfp == "yes",
          predict(m.mroz.sel, type = "response"))
#> [1] 0.31873
#> attr(,"casewise loss")
#> [1] "y != round(yhat)"

Bayes rule applied to the selected model misclassifies 32% of the cases in the Mroz data.

Cross-validating the selected model produces a similar, slightly larger, estimate of misclassification, about 33%:

summary(cv(m.mroz.sel, criterion = BayesRule, seed = 345266))
#> R RNG seed set to 345266
#> 10-Fold Cross Validation
#> method: exact
#> criterion: BayesRule
#> cross-validation criterion = 0.33068
#> bias-adjusted cross-validation criterion = 0.33332
#> 95% CI for bias-adjusted CV criterion = (0.2997, 0.36695)
#> full-sample criterion = 0.31873

Is this estimate of predictive performance optimistic?

We proceed to apply the model-selection procedure by cross-validation, producing more or less the same result:

m.mroz.sel.cv <- cv(
  selectStepAIC,
  Mroz,
  seed = 6681,
  criterion = BayesRule,
  working.model = m.mroz,
  AIC = FALSE
)
#> R RNG seed set to 6681
summary(m.mroz.sel.cv)
#> 10-Fold Cross Validation
#> criterion: BayesRule
#> cross-validation criterion = 0.33068
#> bias-adjusted cross-validation criterion = 0.33452
#> 95% CI for bias-adjusted CV criterion = (0.3009, 0.36815)
#> full-sample criterion = 0.31873

Setting AIC=FALSE in the call to cv() uses the BIC rather than the AIC as the model-selection criterion. As it turns out, exactly the same predictors are selected when each of the 10 folds are omitted, and the several coefficient estimates are very similar, as we show using compareFolds():

compareFolds(m.mroz.sel.cv)
#> CV criterion by folds:
#>  fold.1  fold.2  fold.3  fold.4  fold.5  fold.6  fold.7  fold.8  fold.9 fold.10 
#> 0.27632 0.40789 0.34211 0.36000 0.28000 0.37333 0.32000 0.29333 0.33333 0.32000 
#> 
#> Coefficients by folds:
#>         (Intercept)     age     inc      k5     lwg wcyes
#> Fold 1       2.5014 -0.0454 -0.0388 -1.3613  0.5653  0.85
#> Fold 2       3.0789 -0.0659 -0.0306 -1.5335  0.6923  0.79
#> Fold 3       3.0141 -0.0595 -0.0305 -1.3994  0.5428  0.86
#> Fold 4       2.7251 -0.0543 -0.0354 -1.4474  0.6298  1.09
#> Fold 5       2.7617 -0.0566 -0.0320 -1.4752  0.6324  0.74
#> Fold 6       3.0234 -0.0621 -0.0348 -1.4537  0.6618  0.94
#> Fold 7       2.9615 -0.0600 -0.0351 -1.4127  0.5835  0.97
#> Fold 8       2.9598 -0.0603 -0.0329 -1.3865  0.6210  0.69
#> Fold 9       3.2481 -0.0650 -0.0381 -1.4138  0.6093  0.94
#> Fold 10      2.7724 -0.0569 -0.0295 -1.4503  0.6347  0.85

In this example, therefore, we appear to obtain a realistic estimate of model performance directly from the selected model, because there is little added uncertainty induced by model selection.

Cross-validating choice of transformations in regression

The cv package also provides a cv() procedure, selectTrans(), for choosing transformations of the predictors and the response in regression.

Some background: As Weisberg (2014, sec. 8.2) explains, there are technical advantages to having (numeric) predictors in linear regression analysis that are themselves linearly related. If the predictors aren’t linearly related, then the relationships between them can often be straightened by power transformations. Transformations can be selected after graphical examination of the data, or by analytic methods. Once the relationships between the predictors are linearized, it can be advantageous similarly to transform the response variable towards normality.

Selecting transformations analytically raises the possibility of automating the process, as would be required for cross-validation. One could, in principle, apply graphical methods to select transformations for each fold, but because a data analyst couldn’t forget the choices made for previous folds, the process wouldn’t really be applied independently to the folds.

To illustrate, we adapt an example appearing in several places in Fox & Weisberg (2019) (for example in Chapter 3 on transforming data), using data on the prestige and other characteristics of 102 Canadian occupations circa 1970. The data are in the Prestige data frame in the carData package:

data("Prestige", package = "carData")
head(Prestige)
#>                     education income women prestige census type
#> gov.administrators      13.11  12351 11.16     68.8   1113 prof
#> general.managers        12.26  25879  4.02     69.1   1130 prof
#> accountants             12.77   9271 15.70     63.4   1171 prof
#> purchasing.officers     11.42   8865  9.11     56.8   1175 prof
#> chemists                14.62   8403 11.68     73.5   2111 prof
#> physicists              15.64  11030  5.13     77.6   2113 prof
summary(Prestige)
#>    education         income          women          prestige        census    
#>  Min.   : 6.38   Min.   :  611   Min.   : 0.00   Min.   :14.8   Min.   :1113  
#>  1st Qu.: 8.45   1st Qu.: 4106   1st Qu.: 3.59   1st Qu.:35.2   1st Qu.:3120  
#>  Median :10.54   Median : 5930   Median :13.60   Median :43.6   Median :5135  
#>  Mean   :10.74   Mean   : 6798   Mean   :28.98   Mean   :46.8   Mean   :5402  
#>  3rd Qu.:12.65   3rd Qu.: 8187   3rd Qu.:52.20   3rd Qu.:59.3   3rd Qu.:8312  
#>  Max.   :15.97   Max.   :25879   Max.   :97.51   Max.   :87.2   Max.   :9517  
#>    type   
#>  bc  :44  
#>  prof:31  
#>  wc  :23  
#>  NA's: 4  
#>           
#> 

The variables in the Prestige data set are:

The object of a regression analysis for the Prestige data (and their original purpose) is to predict occupational prestige from the other variables in the data set.

A scatterplot matrix (using the scatterplotMatrix() function in the car package) of the numeric variables in the data reveals that the distributions of income and women are positively skewed, and that some of the relationships among the three predictors, and between the predictors and the response (i.e., prestige), are nonlinear:

library("car")
#> Loading required package: carData
#> 
#> Attaching package: 'car'
#> The following object is masked from 'package:dplyr':
#> 
#>     recode
scatterplotMatrix(
  ~ prestige + income + education + women,
  data = Prestige,
  smooth = list(spread = FALSE)
)
Scatterplot matrix for the `Prestige` data.

Scatterplot matrix for the Prestige data.

The powerTransform() function in the car package transforms variables towards multivariate normality by a generalization of Box and Cox’s maximum-likelihood-like approach (Box & Cox, 1964). Several “families” of power transformations can be used, including the original Box-Cox family, simple powers (and roots), and two adaptations of the Box-Cox family to data that may include negative values and zeros: the Box-Cox-with-negatives family and the Yeo-Johnson family; see Weisberg (2014, Chapter 8), and Fox & Weisberg (2019, Chapter 3) for details. Because women has some zero values, we use the Yeo-Johnson family:

trans <- powerTransform(cbind(income, education, women) ~ 1,
                        data = Prestige,
                        family = "yjPower")
summary(trans)
#> yjPower Transformations to Multinormality 
#>           Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> income       0.2678        0.33       0.1051       0.4304
#> education    0.5162        1.00      -0.2822       1.3145
#> women        0.1630        0.16       0.0112       0.3149
#> 
#>  Likelihood ratio test that all transformation parameters are equal to 0
#>                              LRT df    pval
#> LR test, lambda = (0 0 0) 15.739  3 0.00128

We thus have evidence of the desirability of transforming income (by the \(1/3\) power) and women (by the \(0.16\) power—which is close to the “0” power, i.e., the log transformation), but not education. Applying the “rounded” power transformations makes the predictors better-behaved:

P <- Prestige[, c("prestige", "income", "education", "women")]
(lambdas <- trans$roundlam)
#>    income education     women 
#>   0.33000   1.00000   0.16302
names(lambdas) <- c("income", "education", "women")
for (var in c("income", "education", "women")) {
  P[, var] <- yjPower(P[, var], lambda = lambdas[var])
}
summary(P)
#>     prestige        income       education         women     
#>  Min.   :14.8   Min.   :22.2   Min.   : 6.38   Min.   :0.00  
#>  1st Qu.:35.2   1st Qu.:44.2   1st Qu.: 8.45   1st Qu.:1.73  
#>  Median :43.6   Median :50.3   Median :10.54   Median :3.36  
#>  Mean   :46.8   Mean   :50.8   Mean   :10.74   Mean   :3.50  
#>  3rd Qu.:59.3   3rd Qu.:56.2   3rd Qu.:12.65   3rd Qu.:5.59  
#>  Max.   :87.2   Max.   :83.6   Max.   :15.97   Max.   :6.83

scatterplotMatrix(
  ~ prestige + income + education + women,
  data = P,
  smooth = list(spread = FALSE)
)
Scatterplot matrix for the `Prestige` data with the predictors transformed.

Scatterplot matrix for the Prestige data with the predictors transformed.

Comparing the MSE for the regressions with the original and transformed predictors shows a advantage to the latter:

m.pres <- lm(prestige ~ income + education + women, data = Prestige)
m.pres.trans <- lm(prestige ~ income + education + women, data = P)
mse(Prestige$prestige, fitted(m.pres))
#> [1] 59.153
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"
mse(P$prestige, fitted(m.pres.trans))
#> [1] 50.6
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

Similarly, component+residual plots for the two regressions, produced by the crPlots() function in the car package, suggest that the partial relationship of prestige to income is more nearly linear in the transformed data, but the transformation of women fails to capture what appears to be a slight quadratic partial relationship; the partial relationship of prestige to education is close to linear in both regressions:

crPlots(m.pres)
Component+residual plots for the `Prestige` regression with the original predictors.

Component+residual plots for the Prestige regression with the original predictors.

crPlots(m.pres.trans)
Component+residual plots for the `Prestige` regression with transformed predictors.

Component+residual plots for the Prestige regression with transformed predictors.

Having transformed the predictors towards multinormality, we now consider whether there’s evidence for transforming the response (using powerTransform() for Box and Cox’s original method), and we discover that there’s not:

summary(powerTransform(m.pres.trans))
#> bcPower Transformation to Normality 
#>    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> Y1    1.0194           1       0.6773       1.3615
#> 
#> Likelihood ratio test that transformation parameter is equal to 0
#>  (log transformation)
#>                          LRT df     pval
#> LR test, lambda = (0) 32.217  1 1.38e-08
#> 
#> Likelihood ratio test that no transformation is needed
#>                            LRT df  pval
#> LR test, lambda = (1) 0.012384  1 0.911

The selectTrans() function in the cv package automates the process of selecting predictor and response transformations. The function takes a data set and “working” model as arguments, along with the candidate predictors and response for transformation, and the transformation family to employ. If the predictors argument is missing then only the response is transformed, and if the response argument is missing, only the supplied predictors are transformed. The default family for transforming the predictors is "bcPower"—the original Box-Cox family—as is the default family.y for transforming the response; here we specify family="yjPower because of the zeros in women. selectTrans() returns the result of applying a lack-of-fit criterion to the model after the selected transformation is applied, with the default criterion=mse:

selectTrans(
  data = Prestige,
  model = m.pres,
  predictors = c("income", "education", "women"),
  response = "prestige",
  family = "yjPower"
)
#> $criterion
#> [1] 50.6
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"
#> 
#> $model
#> NULL

selectTrans() also takes an optional indices argument, making it suitable for doing computations on a subset of the data (i.e., a CV fold), and hence for use with cv() (see ?selectTrans for details):

cvs <- cv(
  selectTrans,
  data = Prestige,
  working.model = m.pres,
  seed = 1463,
  predictors = c("income", "education", "women"),
  response = "prestige",
  family = "yjPower"
)
#> R RNG seed set to 1463
summary(cvs)
#> 10-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 54.487
#> bias-adjusted cross-validation criterion = 54.308
#> full-sample criterion = 50.6

cv(m.pres, seed = 1463) # untransformed model with same folds
#> R RNG seed set to 1463
#> cross-validation criterion (mse) = 63.293

compareFolds(cvs)
#> CV criterion by folds:
#>  fold.1  fold.2  fold.3  fold.4  fold.5  fold.6  fold.7  fold.8  fold.9 fold.10 
#>  63.453  79.257  20.634  94.569  19.902  55.821  26.555  75.389  55.702  50.215 
#> 
#> Coefficients by folds:
#>         lam.education lam.income lam.women lambda
#> Fold 1          1.000      0.330     0.330      1
#> Fold 2          1.000      0.330     0.169      1
#> Fold 3          1.000      0.330     0.330      1
#> Fold 4          1.000      0.330     0.330      1
#> Fold 5          1.000      0.330     0.000      1
#> Fold 6          1.000      0.330     0.330      1
#> Fold 7          1.000      0.330     0.330      1
#> Fold 8          1.000      0.330     0.000      1
#> Fold 9          1.000      0.330     0.000      1
#> Fold 10         1.000      0.330     0.000      1

The results suggest that the predictive power of the transformed regression is reliably greater than that of the untransformed regression (though in both case, the cross-validated MSE is considerably higher than the MSE computed for the whole data). Examining the selected transformations for each fold reveals that the predictor education and the response prestige are never transformed; that the \(1/3\) power is selected for income in all of the folds; and that the transformation selected for women varies narrowly across the folds between the \(0\)th power (i.e., log) and the \(1/3\) power.

Selecting both transformations and predictors2

As we mentioned, Hastie et al. (2009, sec. 7.10.2: “The Wrong and Right Way to Do Cross-validation”) explain that honest cross-validation has to take account of model specification and selection. Statistical modeling is at least partly a craft, and one could imagine applying that craft to successive partial data sets, each with a fold removed. The resulting procedure would be tedious, though possibly worth the effort, but it would also be difficult to realize in practice: After all, we can hardly erase our memory of statistical modeling choices between analyzing partial data sets.

Alternatively, if we’re able to automate the process of model selection, then we can more realistically apply CV mechanically. That’s what we did in the preceding two sections, first for predictor selection and then for selection of transformations in regression. In this section, we consider the case where we both select variable transformations and then proceed to select predictors. It’s insufficient to apply these steps sequentially, first, for example, using cv() with selectTrans() and then with selectStepAIC(); rather we should apply the whole model-selection procedure with each fold omitted. The selectTransAndStepAIC() function, also supplied by the cv package, does exactly that.

To illustrate this process, we return to the Auto data set:

summary(Auto)
#>       mpg         cylinders     displacement   horsepower        weight    
#>  Min.   : 9.0   Min.   :3.00   Min.   : 68   Min.   : 46.0   Min.   :1613  
#>  1st Qu.:17.0   1st Qu.:4.00   1st Qu.:105   1st Qu.: 75.0   1st Qu.:2225  
#>  Median :22.8   Median :4.00   Median :151   Median : 93.5   Median :2804  
#>  Mean   :23.4   Mean   :5.47   Mean   :194   Mean   :104.5   Mean   :2978  
#>  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:276   3rd Qu.:126.0   3rd Qu.:3615  
#>  Max.   :46.6   Max.   :8.00   Max.   :455   Max.   :230.0   Max.   :5140  
#>                                                                            
#>   acceleration       year        origin                     name    
#>  Min.   : 8.0   Min.   :70   Min.   :1.00   amc matador       :  5  
#>  1st Qu.:13.8   1st Qu.:73   1st Qu.:1.00   ford pinto        :  5  
#>  Median :15.5   Median :76   Median :1.00   toyota corolla    :  5  
#>  Mean   :15.5   Mean   :76   Mean   :1.58   amc gremlin       :  4  
#>  3rd Qu.:17.0   3rd Qu.:79   3rd Qu.:2.00   amc hornet        :  4  
#>  Max.   :24.8   Max.   :82   Max.   :3.00   chevrolet chevette:  4  
#>                                             (Other)           :365
xtabs( ~ year, data = Auto)
#> year
#> 70 71 72 73 74 75 76 77 78 79 80 81 82 
#> 29 27 28 40 26 30 34 28 36 29 27 28 30
xtabs( ~ origin, data = Auto)
#> origin
#>   1   2   3 
#> 245  68  79
xtabs( ~ cylinders, data = Auto)
#> cylinders
#>   3   4   5   6   8 
#>   4 199   3  83 103

We previously used the Auto here in a preliminary example where we employed CV to inform the selection of the order of a polynomial regression of mpg on horsepower. Here, we consider more generally the problem of predicting mpg from the other variables in the Auto data. We begin with a bit of data management, and then examine the pairwise relationships among the numeric variables in the data set:

Auto$cylinders <- factor(Auto$cylinders,
                         labels = c("3.4", "3.4", "5.6", "5.6", "8"))
Auto$year <- as.factor(Auto$year)
Auto$origin <- factor(Auto$origin,
                      labels = c("America", "Europe", "Japan"))
rownames(Auto) <- make.names(Auto$name, unique = TRUE)
Auto$name <- NULL

scatterplotMatrix(
  ~ mpg + displacement + horsepower + weight + acceleration,
  smooth = list(spread = FALSE),
  data = Auto
)
Scatterplot matrix for the numeric variables in the `Auto` data

Scatterplot matrix for the numeric variables in the Auto data

A comment before we proceed: origin is clearly categorical and so converting it to a factor is natural, but we could imagine treating cylinders and year as numeric predictors. There are, however, only 5 distinct values of cylinders (ranging from 3 to 8), but cars with 3 or 5 cylinders are rare. and none of the cars has 7 cylinders. There are similarly only 13 distinct years between 1970 and 1982 in the data, and the relationship between mpg and year is difficult to characterize.3 It’s apparent that most these variables are positively skewed and that many of the pairwise relationships among them are nonlinear.

We begin with a “working model” that specifies linear partial relationships of the response to the numeric predictors:

m.auto <- lm(mpg ~ ., data = Auto)
summary(m.auto)
#> 
#> Call:
#> lm(formula = mpg ~ ., data = Auto)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -9.006 -1.745 -0.092  1.525 10.950 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  37.034132   1.969393   18.80  < 2e-16 ***
#> cylinders5.6 -2.602941   0.655200   -3.97  8.5e-05 ***
#> cylinders8   -0.582458   1.171452   -0.50  0.61934    
#> displacement  0.017425   0.006734    2.59  0.01004 *  
#> horsepower   -0.041353   0.013379   -3.09  0.00215 ** 
#> weight       -0.005548   0.000632   -8.77  < 2e-16 ***
#> acceleration  0.061527   0.088313    0.70  0.48643    
#> year71        0.968058   0.837390    1.16  0.24841    
#> year72       -0.601435   0.825115   -0.73  0.46652    
#> year73       -0.687689   0.740272   -0.93  0.35351    
#> year74        1.375576   0.876500    1.57  0.11741    
#> year75        0.929929   0.859072    1.08  0.27974    
#> year76        1.559893   0.822505    1.90  0.05867 .  
#> year77        2.909416   0.841729    3.46  0.00061 ***
#> year78        3.175198   0.798940    3.97  8.5e-05 ***
#> year79        5.019299   0.845759    5.93  6.8e-09 ***
#> year80        9.099763   0.897293   10.14  < 2e-16 ***
#> year81        6.688660   0.885218    7.56  3.3e-13 ***
#> year82        8.071125   0.870668    9.27  < 2e-16 ***
#> originEurope  2.046664   0.517124    3.96  9.1e-05 ***
#> originJapan   2.144887   0.507717    4.22  3.0e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.92 on 371 degrees of freedom
#> Multiple R-squared:  0.867,  Adjusted R-squared:  0.86 
#> F-statistic:  121 on 20 and 371 DF,  p-value: <2e-16

Anova(m.auto)
#> Anova Table (Type II tests)
#> 
#> Response: mpg
#>              Sum Sq  Df F value  Pr(>F)    
#> cylinders       292   2   17.09 7.9e-08 ***
#> displacement     57   1    6.70  0.0100 *  
#> horsepower       82   1    9.55  0.0021 ** 
#> weight          658   1   76.98 < 2e-16 ***
#> acceleration      4   1    0.49  0.4864    
#> year           3017  12   29.40 < 2e-16 ***
#> origin          190   2   11.13 2.0e-05 ***
#> Residuals      3173 371                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

crPlots(m.auto)