---
title: "Nested Resampling using rsample"
author: "Max Kuhn"
date: "9/1/2017"
output: html_document
---
```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
```
A typical scheme for splitting the data when developing a predictive model is to create an initial split of the data into a training and test set. If resampling is used, it is executed on the training set. A series of binary splits is created. In `rsample`, we use the term _analysis set_ for the data that are used to fit the model and the _assessment set_ is used to compute performance:
![](diagram.png)
A common method for tuning models is grid search where a candidate set of tuning parameters is created. The full set of models for every combination of the tuning parameter grid and the resamples is created. Each time, the assessment data are used to measure performance and the average value is determined for each tuning parameter.
The potential problem is, once we pick the tuning parameter associated with the best performance, this performance value is usually quoted as the performance of the model. There is serious potential for _optimization bias_ since we uses the same data to tune the model and quote performance. This would result in an optimistic estimate of performance.
Nested resampling does an additional layer of resampling that separates the tuning activities from the process used to estimate the efficacy of the model. An _outer_ resampling scheme is used and, for every split in the outer resample, another full set of resampling splits are created on the original analysis set. For example, if 10-fold cross-validation is used on the outside and 5-fold cross-validation on the inside, a total of 500 models will be fit. The parameter tuning will be conducted 10 times and the best parameters are determined from the average of the 5 assessment sets. This process occurs 10 times.
Once the tuning results are complete, a model is fit to each of the outer resampling splits using the best parameter associated with that resample. The average of the outer method's assessment sets are a unbiased estimate of the model.
To get started, let's load the packages that will be used in this post.
```{r load, warning=FALSE,message=FALSE}
library(rsample)
library(purrr)
library(dplyr)
library(ggplot2)
library(scales)
library(mlbench)
library(kernlab)
library(sessioninfo)
theme_set(theme_bw())
```
We will simulate some regression data to illustrate the methods. The function [`mlbench::mlbench.friedman1`] can simulate a complex regression data structure from the [original MARS publication](https://scholar.google.com/scholar?hl=en&q=%22Multivariate+adaptive+regression+splines%22&btnG=&as_sdt=1%2C7&as_sdtp=). A training set size of 100 data points are generated as well as a large set that will be used to characterize how well the resampling procedure performed.
```{r sim-data}
sim_data <- function(n) {
tmp <- mlbench.friedman1(n, sd=1)
tmp <- cbind(tmp$x, tmp$y)
tmp <- as.data.frame(tmp)
names(tmp)[ncol(tmp)] <- "y"
tmp
}
set.seed(9815)
train_dat <- sim_data(100)
large_dat <- sim_data(10^5)
```
To get started, the types of resampling methods need to be specified. This isn't a large data set, so 5 repeats of 10-fold cross validation will be used as the _outer_ resampling method that will be used to generate the estimate of overall performance. To tune the model, it would be good to have precise estimates for each of the values of the tuning parameter so 25 iterations of the bootstrap will be used. This means that there will eventually be `5 * 10 * 25 = 1250` models that are fit to the data _per tuning parameter_. These will be discarded once the performance of the model has been quantified.
To create the tibble with the resampling specifications:
```{r tibble-gen}
results <- nested_cv(train_dat,
outside = vfold_cv(repeats = 5),
inside = bootstraps(25))
results
```
The splitting information for each resample is contained in the `split` objects. Focusing on the second fold of the first repeat:
```{r split-example}
results$splits[[2]]
```
`<90/10/100>` indicates the number of data in the analysis set, assessment set, and the original data.
Each element of `inner_resamples` has its own tibble with the bootstrapping splits.
```{r inner-splits}
results$inner_resamples[[5]]
```
These are self-contained, meaning that the bootstrap sample is aware that it is a sample of a specific 90% of the data:
```{r inner-boot-split}
results$inner_resamples[[5]]$splits[[1]]
```
To start, we need to define how the model will be created and measured. For our example, a radial basis support vector machine model will be created using the function `kernlab::ksvm`. This model is generally thought of as having _two_ tuning parameters: the SVM cost value and the kernel parameter `sigma`. For illustration, only the cost value will be tuned and the function `kernlab:sigest` will be used to estimate `sigma` during each model fit. This is automatically done by `ksvm`.
After the model is fit to the analysis set, the root-mean squared error (RMSE) is computed on the assessment set. One important note: for this model, it is critical to center and scale the predictors before computing dot products. We don't do this operation here because `mlbench.friedman1` simulates all of the predictors to be standard uniform random variables.
Our function to fit the model and compute the RMSE is:
```{r rmse-func}
# `object` will be an `rsplit` object from our `results` tibble
# `cost` is the tuning parameter
svm_rmse <- function(object, cost = 1) {
y_col <- ncol(object$data)
mod <- ksvm(y ~ ., data = analysis(object), C = cost)
holdout_pred <- predict(mod, assessment(object)[-y_col])
rmse <- sqrt(mean((assessment(object)$y - holdout_pred)^2, na.rm = TRUE))
rmse
}
# In some case, we want to parameterize the function over the tuning parameter:
rmse_wrapper <- function(cost, object) svm_rmse(object, cost)
```
For the nested resampling, a model needs to be fit for each tuning parameter and each bootstrap split. To do this, a wrapper can be created:
```{r inner-tune-func}
# `object` will be an `rsplit` object for the bootstrap samples
tune_over_cost <- function(object) {
results <- tibble(cost = 2^seq(-2, 8, by = 1))
results$RMSE <- map_dbl(results$cost,
rmse_wrapper,
object = object)
results
}
```
Since this will be called across the set of outer cross-validation splits, another wrapper is required:
```{r inner-func}
# `object` is an `rsplit` object in `results$inner_resamples`
summarize_tune_results <- function(object) {
# Return row-bound tibble that has the 25 bootstrap results
map_df(object$splits, tune_over_cost) %>%
# For each value of the tuning parameter, compute the
# average RMSE which is the inner bootstrap estimate.
group_by(cost) %>%
summarize(mean_RMSE = mean(RMSE, na.rm = TRUE),
n = length(RMSE))
}
```
Now that those functions are defined, we can execute all the inner resampling loops:
```{r inner-runs}
tuning_results <- map(results$inner_resamples, summarize_tune_results)
```
`tuning_results` is a list of data frames for each of the 50 outer resamples.
Let's make a plot of the averaged results to see what the relationship is between the RMSE and the tuning parameters for each of the inner bootstrapping operations:
```{r rmse-plot, fig.height=4}
pooled_inner <- tuning_results %>% bind_rows
best_cost <- function(dat) dat[which.min(dat$mean_RMSE),]
p <- ggplot(pooled_inner, aes(x = cost, y = mean_RMSE)) +
scale_x_continuous(trans='log2') +
xlab("SVM Cost") + ylab("Inner RMSE")
for(i in 1:length(tuning_results))
p <- p +
geom_line(data = tuning_results[[i]], alpha = .2) +
geom_point(data = best_cost(tuning_results[[i]]), pch = 16)
p <- p + geom_smooth(data = pooled_inner, se = FALSE)
p
```
Each grey line is a separate bootstrap resampling curve created from a different 90% of the data. The blue line is a loess smooth of all the results pooled together.
To determine the best parameter estimate for each of the outer resampling iterations:
```{r choose, fig.height=4}
cost_vals <- tuning_results %>% map_df(best_cost) %>% select(cost)
results <- bind_cols(results, cost_vals)
ggplot(results, aes(x = factor(cost))) + geom_bar() + xlab("SVM Cost")
```
Most of the resamples produced and optimal cost values of 2.0 but the distribution is right-skewed due to the flat trend in the resampling profile once the cost value becomes 10 or larger.
Now that we have these estimates, we can compute the outer resampling results for each of the `r nrow(results)` splits using the corresponding tuning parameter value:
```{r run-outer}
results$RMSE <- map2_dbl(results$splits, results$cost, svm_rmse)
summary(results$RMSE)
```
The estimated RMSE for the model tuning process is `r round(mean(results$RMSE), 2)`.
What is the RMSE estimate for the non-nested procedure when only the outer resampling method is used? For each cost value in the tuning grid, `r nrow(results)` SVM models are fit and their RMSE values are averaged. The table of cost values and mean RMSE estimates is used to determine the best cost value. The associated RMSE is the biased estimate.
```{r not-nested, fig.height=4}
not_nested <- map(results$splits, tune_over_cost) %>%
bind_rows
outer_summary <- not_nested %>%
group_by(cost) %>%
summarize(outer_RMSE = mean(RMSE),
n = length(RMSE))
outer_summary
ggplot(outer_summary, aes(x = cost, y = outer_RMSE)) +
geom_point() +
geom_line() +
scale_x_continuous(trans='log2') +
xlab("SVM Cost") + ylab("RMSE")
```
The non-nested procedure estimates the RMSE to be `r round(min(outer_summary$outer_RMSE), 2)`. Both estimates are fairly close.
The approximately true RMSE for an SVM model with a cost value of 2.0 and be approximated with the large sample that was simulated at the beginning.
```{r large-sample-estimate}
finalModel <- ksvm(y ~ ., data = train_dat, C = 2)
large_pred <- predict(finalModel, large_dat[, -ncol(large_dat)])
sqrt(mean((large_dat$y - large_pred)^2, na.rm = TRUE))
```
The nested procedure produces a closer estimate to the approximate truth but the non-nested estimate is very similar.
The R markdown document used to create this post can be found here.
The session information is:
```{r session}
session_info()
```