Caret vs. tidymodels - comparing the old and new
In this post I will make a comparison between the most popular (by number of monthly downloads from Github) ML framework available for R to date: caret and its successor packages being written by the same author (Max Kuhn) that are wrapped together in a so called tidymodels framework. Tidymodels
is a collection of different packages such as: rsample, recipes, parsnip, dials and more, that allow running an entire ML project in a tidy format end-to-end.
Many of them are still in a development phase, which will still take a couple good months before they settle down, so I’ll try to keep this post up-to-date over time. Nevertheless, I’ve wanted to take a closer look at what tidymodels
have to offer for a while already, and thought a blogpost would be a great way to demonstrate that.
Update 16.02.2020 - the following parts of this blogpost were updated:
- since the initial write-up of this post many
tidymodels
packages were updated and the following, new packages were released: tune & workflows, which significantly simplified the overall modelling workflow - updated the entire tidymodels implementation using new functions instead of the handler functions I had to write before. Previous post content is kept in its original form at the very end if anyone is interested
- added variable importance for the tidymodel implementation using the
vip
package
In order to write this blog I’ve been reading carefully all individual package websites and this excellent blogpost from Alex Hayes helped me a lot to put things together.
Initial setup
In the beginning, let’s load all the required packages and the credit_data
dataset available from recipes
that we will use for modelling. Note also that I’m setting the random seed to make sampling reproducible, as well as set the furrr plan to multicore
. It’s important unless you want this script to run really long on your machine - we’ll be fitting many different models, so making sure you utilize all your local resources will speed things up a lot.
set.seed(42)
options(max.print = 150)
library(modeldata)
library(tidymodels)
library(tidyverse)
library(caret)
library(magrittr)
library(naniar)
library(furrr)
library(skimr)
library(vip)
library(workflows)
library(tune)
plan(multicore)
data("credit_data")
Data preparation
In this example, I’m building a classification model to distinguish between good and bad loans indicated by column ‘Status’. We have relatively many observations compared to the number of variables available for modelling. Before making any other steps let’s convert all columns to lowercase.
credit_data %<>%
set_names(., tolower(names(.)))
glimpse(credit_data)
## Observations: 4,454
## Variables: 14
## $ status <fct> good, good, bad, good, good, good, good, good, good, b…
## $ seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15…
## $ home <fct> rent, rent, owner, rent, rent, owner, owner, parents, …
## $ time <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36…
## $ age <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37…
## $ marital <fct> married, widow, married, single, single, married, marr…
## $ records <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, n…
## $ job <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixe…
## $ expenses <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75…
## $ income <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 1…
## $ assets <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3…
## $ debt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0…
## $ amount <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200,…
## $ price <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 14…
With the help of the excellent naniar package I’m checking the percentage of missing data per each variable. For this particular dataset there are very few missing values so they won’t pose a problem for us during modelling.
credit_data %>% miss_var_summary()
## # A tibble: 14 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 income 381 8.55
## 2 assets 47 1.06
## 3 debt 18 0.404
## 4 home 6 0.135
## 5 job 2 0.0449
## 6 marital 1 0.0225
## 7 status 0 0
## 8 seniority 0 0
## 9 time 0 0
## 10 age 0 0
## 11 records 0 0
## 12 expenses 0 0
## 13 amount 0 0
## 14 price 0 0
Another important step would be to make some basic numerical summaries of the data in order to catch any unusual observations. I will do it using the skimr package. Apart from the fact that many numerical variables show high skewness and some categorical variables have levels with very low frequency, it doesn’t seem that we will have to deal with any special encoded numbers or other problems.
credit_data %>% skim()
## Skim summary statistics
## n obs: 4454
## n variables: 14
##
## ── Variable type:factor ────────────────────────────────────────────────────────
## variable missing complete n n_unique
## home 6 4448 4454 6
## job 2 4452 4454 4
## marital 1 4453 4454 5
## records 0 4454 4454 2
## status 0 4454 4454 2
## top_counts ordered
## own: 2107, ren: 973, par: 783, oth: 319 FALSE
## fix: 2805, fre: 1024, par: 452, oth: 171 FALSE
## mar: 3241, sin: 977, sep: 130, wid: 67 FALSE
## no: 3681, yes: 773, NA: 0 FALSE
## goo: 3200, bad: 1254, NA: 0 FALSE
##
## ── Variable type:integer ───────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## age 0 4454 4454 37.08 10.98 18 28 36 45
## amount 0 4454 4454 1038.92 474.55 100 700 1000 1300
## assets 47 4407 4454 5403.98 11574.42 0 0 3000 6000
## debt 18 4436 4454 343.03 1245.99 0 0 0 0
## expenses 0 4454 4454 55.57 19.52 35 35 51 72
## income 381 4073 4454 141.69 80.75 6 90 125 170
## price 0 4454 4454 1462.78 628.13 105 1117.25 1400 1691.5
## seniority 0 4454 4454 7.99 8.17 0 2 5 12
## time 0 4454 4454 46.44 14.66 6 36 48 60
## p100 hist
## 68 ▅▇▇▇▅▃▂▁
## 5000 ▅▇▃▁▁▁▁▁
## 3e+05 ▇▁▁▁▁▁▁▁
## 30000 ▇▁▁▁▁▁▁▁
## 180 ▇▃▃▁▁▁▁▁
## 959 ▇▆▁▁▁▁▁▁
## 11140 ▇▆▁▁▁▁▁▁
## 48 ▇▃▂▁▁▁▁▁
## 72 ▁▁▂▃▁▃▇▁
Another point we need to keep in mind when dealing with credit scoring problems is something called a target class imbalance, but in this particular case it’s not that severe. For the sake of comparing programming frameworks and not implementing the best ML model I will ignore it.
table(credit_data$status)
##
## bad good
## 1254 3200
round(prop.table(table(credit_data$status)), 2)
##
## bad good
## 0.28 0.72
Data preparation
Let’s finally move on and start modelling! In the beginning I’ll start with dividing our dataset into training and testing sets with the help of the rsample
package. Let’s set an initial, stratified split where 80% of the data is dedicated to training and the rest to evaluating both models.
Furthermore, I’m creating cross-validation splits from the testing data of 5 folds. For compatibility with caret
I’m using the rsample2caret
function to make use of the same splits in both frameworks - otherwise both solutions wouldn’t be 100% comparable.
split <- initial_split(credit_data, prop = 0.80, strata = "status")
df_train <- training(split)
df_test <- testing(split)
train_cv <- vfold_cv(df_train, v = 5, strata = "status")
train_cv_caret <- rsample2caret(train_cv)
# write_rds(split, "split.rds")
# write_rds(train_cv, "train_cv.rds")
I would like to fit a Random Forest model for which I will specify a simple recipe
. In principle, tree-based models require very little preprocessing, and in this particular example I mainly focus on imputting missing data or assigning them a new categorical level, infrequent/ unobserved values and hot-encoding them. The same recipe will be used for both: caret
and tidymodels
model.
Normally I would do much more feature engineering, try to assess potential interactions etc., but I will write a separate post dedicated for that to so see how much further we can improve the model!
recipe <- df_train %>%
recipe(status ~ .) %>%
# Imputation: assigning NAs to a new level for categorical and median imputation for numeric
step_unknown(all_nominal(), -status) %>%
step_medianimpute(all_numeric()) %>%
# Combining infrequent categorical levels and introducing a new level for prediction time
step_other(all_nominal(), -status, other = "infrequent_combined") %>%
step_novel(all_nominal(), -status, new_level = "unrecorded_observation") %>%
# Hot-encoding categorical variables
step_dummy(all_nominal(), -status, one_hot = TRUE)
Let’s take a quick look at the output of the recipe:
(recipe_preped <- prep(recipe, retain = TRUE))
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 13
##
## Training data contained 3565 data points and 335 incomplete rows.
##
## Operations:
##
## Unknown factor level assignment for home, marital, records, job [trained]
## Median Imputation for seniority, time, age, expenses, ... [trained]
## Collapsing factor levels for home, marital, records, job [trained]
## Novel factor level assignment for home, marital, records, job [trained]
## Dummy variables from home, marital, records, job [trained]
tidy(recipe_preped)
## # A tibble: 5 x 6
## number operation type trained skip id
## <int> <chr> <chr> <lgl> <lgl> <chr>
## 1 1 step unknown TRUE FALSE unknown_XU845
## 2 2 step medianimpute TRUE FALSE medianimpute_tS05C
## 3 3 step other TRUE FALSE other_FHtHk
## 4 4 step novel TRUE FALSE novel_ncrtZ
## 5 5 step dummy TRUE FALSE dummy_VMDKC
Fitting our models
Caret
In the code below I’m setting control parameters for the caret
model fit, as well as the grid of hyperparameters that will be assessed in order to pick the best performing combination. Note that I’m using the very original observation indexes for cross-validation to ensure reproducability. The trainControl
function will also ensure that final hold-out predictions from cross-validation will be persisted for further assessment thanks to savePredictions = "final"
.
We have 5 different CV folds and 30 grid combinations to assess, which results in 150 models that will be fit and each comprising of 500 individual trees! All models will be assessed based on the prSummary
function which is know as the AUC.
control_caret <- trainControl(
method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
returnResamp = "final",
savePredictions = "final",
index = train_cv_caret$index,
indexOut = train_cv_caret$indexOut,
)
(grid_caret <- expand.grid(
mtry = seq(1, ncol(df_train) - 1, 3),
splitrule = c("extratrees", "gini"),
min.node.size = c(1, 3, 5)
))
## mtry splitrule min.node.size
## 1 1 extratrees 1
## 2 4 extratrees 1
## 3 7 extratrees 1
## 4 10 extratrees 1
## 5 13 extratrees 1
## 6 1 gini 1
## 7 4 gini 1
## 8 7 gini 1
## 9 10 gini 1
## 10 13 gini 1
## 11 1 extratrees 3
## 12 4 extratrees 3
## 13 7 extratrees 3
## 14 10 extratrees 3
## 15 13 extratrees 3
## 16 1 gini 3
## 17 4 gini 3
## 18 7 gini 3
## 19 10 gini 3
## 20 13 gini 3
## 21 1 extratrees 5
## 22 4 extratrees 5
## 23 7 extratrees 5
## 24 10 extratrees 5
## 25 13 extratrees 5
## 26 1 gini 5
## 27 4 gini 5
## 28 7 gini 5
## 29 10 gini 5
## 30 13 gini 5
The great advantage of caret
is that it wraps a lot of small code pieces in just one, high-level API call that does all the job for you - fits all individual models across CV folds and resamples, selects the best one and fits it already on the entire training dataset. It also makes sure it’s done as fast as possible thanks to parallel processing whenever it’s an enabled option.
The drawback on the other hand is that it’s quite monolythic, untidy and at the end doesn’t offer a great deal of granularity to the end user.
model_caret <- train(
status ~ .,
data = juice(recipe_preped),
method = "ranger",
metric = "ROC",
trControl = control_caret,
tuneGrid = grid_caret,
importance = "impurity",
num.trees = 500
)
print(model_caret)
## Random Forest
##
## 3565 samples
## 29 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2851, 2852, 2852, 2852, 2853
## Resampling results across tuning parameters:
##
## mtry splitrule min.node.size ROC Sens Spec
## 1 extratrees 1 0.7901464 0.0000000000 1.0000000
## 1 extratrees 3 0.7902729 0.0000000000 1.0000000
## 1 extratrees 5 0.7897509 0.0000000000 1.0000000
## 1 gini 1 0.8222071 0.0009950249 1.0000000
## 1 gini 3 0.8203006 0.0000000000 1.0000000
## 1 gini 5 0.8194996 0.0000000000 1.0000000
## 4 extratrees 1 0.8066284 0.4492786070 0.9191734
## 4 extratrees 3 0.8077006 0.4422786070 0.9226852
## 4 extratrees 5 0.8064328 0.4442587065 0.9226844
## 4 gini 1 0.8326194 0.4671791045 0.9238563
## 4 gini 3 0.8336397 0.4602089552 0.9254188
## 4 gini 5 0.8351950 0.4711691542 0.9281509
## 7 extratrees 1 0.8089259 0.4841144279 0.9086280
## 7 extratrees 3 0.8101340 0.4831144279 0.9094100
## 7 extratrees 5 0.8108062 0.4791343284 0.9117515
## 7 gini 1 0.8325371 0.4960696517 0.9125335
## 7 gini 3 0.8334290 0.4910696517 0.9160491
## 7 gini 5 0.8321708 0.4990447761 0.9140975
## 10 extratrees 1 0.8088898 0.4880945274 0.9039405
## 10 extratrees 3 0.8097806 0.4821044776 0.9074569
## 10 extratrees 5 0.8116245 0.4821194030 0.9066787
## 10 gini 1 0.8304535 0.5050248756 0.9074622
## 10 gini 3 0.8318021 0.5040348259 0.9098022
## 10 gini 5 0.8326123 0.5070248756 0.9121429
## 13 extratrees 1 0.8081539 0.4880845771 0.9031615
## [ reached getOption("max.print") -- omitted 5 rows ]
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
## and min.node.size = 5.
Caret
also comes with built-in handy functions for assessing model’s individual predictors strength. By setting the importance = "impurity"
in the ranger
engine we ensure that variable importance will be returned by the final train object. As of now there is no such possibility directly within the tidymodels
ecosystem, but this can be solved using another great package called vip
.
# Accessing most predictive attributes from caret
varImp(model_caret, scale = TRUE)$importance %>%
rownames_to_column() %>%
arrange(-Overall)
## rowname Overall
## 1 income 100.000000
## 2 seniority 97.871716
## 3 amount 85.421899
## 4 price 79.249485
## 5 age 66.333280
## 6 assets 65.518786
## 7 records_yes 50.558920
## 8 records_no 50.378803
## 9 expenses 49.052474
## 10 time 34.953502
## 11 job_partime 29.341679
## 12 job_fixed 27.799490
## 13 debt 23.744981
## 14 home_owner 19.524707
## 15 home_rent 12.910345
## 16 job_freelance 11.051452
## 17 home_other 10.008306
## 18 home_parents 9.289318
## 19 marital_married 8.808904
## 20 marital_single 7.715428
## 21 marital_infrequent_combined 6.194835
## 22 home_priv 4.499495
## 23 job_infrequent_combined 3.441151
## 24 home_infrequent_combined 1.463759
## 25 home_unrecorded_observation 0.000000
## 26 marital_unrecorded_observation 0.000000
## 27 records_infrequent_combined 0.000000
## 28 records_unrecorded_observation 0.000000
## 29 job_unrecorded_observation 0.000000
Final cross-validated and test results are easily available with just a couple lines of code. Note that cross-validation performance is aggregated per each index (observation) and averaged out before the final performance metric is calculated.
Getting the test performance is a matter of baking
the test set with the already prepped recipe and then making the prediction using the train object. 83.1% AUC for cross-validated training performance and 82.1% for testing - not a bad result for so little preprocessing! Close results also suggest that our model is likely to generalize well to new samples.
df_train_pred_caret <- model_caret$pred %>%
group_by(rowIndex) %>%
summarise(bad = mean(bad)) %>%
transmute(estimate = bad) %>%
add_column(truth = df_train$status)
# Cross-validated training performance
percent(roc_auc(df_train_pred_caret, truth, estimate)$.estimate)
## [1] "83.5%"
df_test_pred_caret <- predict(
model_caret,
newdata = bake(recipe_preped, df_test),
type = "prob") %>%
as_tibble() %>%
transmute(estimate = bad) %>%
add_column(truth = df_test$status)
# Test performance
percent(roc_auc(df_test_pred_caret, truth, estimate)$.estimate)
## [1] "81.3%"
Tidymodels
First, let’s use parsnip
to define our ‘modelling engine’ - just like before we’re setting it as a classification problem, using Random Forest running on the ranger
engine. On top of that I’m using dials
to define a grid of parameters to optimize. dials
provides a set of handy functions, such as: grid_random
or grid_regular
, that let you choose the range of parameters in a very flexible way. Lastly, I will use tune
and workflows
to optimize parameters, build the overal modelling workflow and finalize it with the best parameter values.
From what I can see the parameters that could be optimized slightly differ between both frameworks: caret
allows for tunning the ‘min.node.size’ while keeping the ‘trees’ constant, while parsnip
allows for tuning ‘trees’ while keeping ‘min.node.size’ constant (I assume it’s using the default ranger
values). Nevertheless, the total amount of combinations is same in both cases and equal to 30.
# Specifying the modelling engine
(engine_tidym <- rand_forest(
mode = "classification",
mtry = tune(),
trees = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")) # you can provide additional, engine specific arguments to '...'
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
# Specifying the grid of hyperparameters that should be tested
(gridy_tidym <- grid_random(
mtry() %>% range_set(c(1, 20)),
trees() %>% range_set(c(500, 1000)),
min_n() %>% range_set(c(2, 10)),
size = 30
))
## # A tibble: 30 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 9 525 4
## 2 7 970 7
## 3 20 835 4
## 4 1 567 5
## 5 15 813 8
## 6 9 589 4
## 7 14 821 8
## 8 11 763 6
## 9 1 500 3
## 10 19 861 7
## # … with 20 more rows
Then we can combine the model recipe we specified before with the modelling engine to form a so called workflow
. A workflow
puts together all pieces of the overall modelling pipeline, which makes it easier to manipulate and control them.
wkfl_tidym <- workflow() %>%
add_recipe(recipe) %>%
add_model(engine_tidym)
Next, I combine the grid of parameters and workflow together for tuning to find the best performing combination of hyperparameters.
grid_tidym <- tune_grid(
wkfl_tidym,
resamples = train_cv,
grid = gridy_tidym,
metrics = metric_set(roc_auc),
control = control_grid(save_pred = TRUE)
)
print(grid_tidym)
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 5
## splits id .metrics .notes .predictions
## * <list> <chr> <list> <list> <list>
## 1 <split [2.9K/71… Fold1 <tibble [30 × … <tibble [30 ×… <tibble [21,420 × …
## 2 <split [2.9K/71… Fold2 <tibble [30 × … <tibble [0 × … <tibble [21,390 × …
## 3 <split [2.9K/71… Fold3 <tibble [30 × … <tibble [30 ×… <tibble [21,390 × …
## 4 <split [2.9K/71… Fold4 <tibble [30 × … <tibble [30 ×… <tibble [21,390 × …
## 5 <split [2.9K/71… Fold5 <tibble [30 × … <tibble [30 ×… <tibble [21,360 × …
You can aggregate the performance metrics for each parameter combination across all cross-validation folds to find the best performing set, which I will use in the final model.
collect_metrics(grid_tidym)
## # A tibble: 30 x 8
## mtry trees min_n .metric .estimator mean n std_err
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 1 500 3 roc_auc binary 0.823 5 0.0101
## 2 1 567 5 roc_auc binary 0.820 5 0.0111
## 3 1 568 5 roc_auc binary 0.820 5 0.0109
## 4 1 644 2 roc_auc binary 0.822 5 0.0114
## 5 1 676 2 roc_auc binary 0.821 5 0.0102
## 6 2 591 4 roc_auc binary 0.831 5 0.0118
## 7 6 592 7 roc_auc binary 0.834 5 0.0109
## 8 7 970 7 roc_auc binary 0.834 5 0.0104
## 9 9 525 4 roc_auc binary 0.832 5 0.0102
## 10 9 589 4 roc_auc binary 0.832 5 0.0106
## # … with 20 more rows
We can propagate the best combination of parameters into the workflow by using the finalize_workflow
function. At this point, we’re almost finished with finalizing our pipeline. The last step involves refitting that workflow on the entire training data.
grid_tidym_best <- select_best(grid_tidym)
(wkfl_tidym_best <- finalize_workflow(wkfl_tidym, grid_tidym_best))
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## ● step_unknown()
## ● step_medianimpute()
## ● step_other()
## ● step_novel()
## ● step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 7
## trees = 970
## min_n = 7
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
This can be easily achieved using the last_fit
function which fits the finalized workflow on the entire training data and at the same time provides test data performance metrics. This makes the workflow object complete and provides the data scientist with comprehensive insights into overall model performnce, as well as a fully operational model pipeline that can be deployed to production.
(wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = split))
## ! Resample1: model (predictions): Novel levels found in column 'home': NA. The l...
## # # Monte Carlo cross-validation (0.8/0.2) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## * <list> <chr> <list> <list> <list> <list>
## 1 <split [3.6… train/test… <tibble [2 … <tibble [… <tibble [889 … <workflo…
We can then easily check both cross-validated training performance, as well as test set performance with just two lines of code.
# Cross-validated training performance
percent(show_best(grid_tidym, n = 1)$mean)
## [1] "83.4%"
# Test performance
percent(wkfl_tidym_final$.metrics[[1]]$.estimate[[2]])
## [1] "81.2%"
At the moment there’s no package in the tidymodels
universe for calculating model importance metrics (I assume that will change at some point), but this can be achieved either with the vip or DALEX package. It would be fantastic if either of these packages seemlessly worked with tidymodels
objects!
In this case I decided to use model-specific metrics (by passing importance = ‘impurity’ to the engine before) and then simply used the vip
function on the model
object extracted from the workflow. However, you might need to change this for every specific model type that you decide to use. Additionally, at this link you can find how to achieve the same using DALEX
.
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]))$data
## # A tibble: 10 x 2
## Variable Importance
## <chr> <dbl>
## 1 income 150.
## 2 seniority 139.
## 3 amount 128.
## 4 price 123.
## 5 age 99.5
## 6 assets 89.8
## 7 expenses 71.5
## 8 records_no 63.4
## 9 records_yes 61.0
## 10 time 47.3
Wrapping up
I’ve fit a credit scoring classification Random Forest model using both caret
and tidymodels
frameworks. I need to admit that before I started writing this post I expected a lot more additional code to be written in the tidymodels
framework to achieve the same goal, but to my surprise those packages already offer a very concise (and tidy!) way of doing ML in R, and things will be even more streamlined in the upcoming months. That’s definitely a really big step-up for the entire R community when it comes to doing ML in R.
Future considerations
I still haven’t fully explored the
tidyposterior
andprobably
packages - I will do that in one of me next posts.On the `
rsample
page there’s an interesting article listed on so called: nested resampling. I’ve never used it in practice but I’m curious to check it out and compare my model’s current cross-validated performance estimate with the one obtained through nested resampling.There’s also a lot of buzz in the R community regarding a BETA release of the successor of the
mlr
package (second most popular ML framework in R) - mlr3.mlr3
could be very strong competition to thetidymodels
framework, and since I’ve never really usedmlr
it’s an excellent opportunity to put it to a test. It is also modular in design liketidymodels
, but is built on top of data.table and uses R6 object-oriented class system which could give it substantial speed advantage overtidymodels
at the expenses of ‘tidyness’.
Previous tidymodels
workflow
In case you’re inteterested you can find the original content of this blogpost below. Please note that sections below are not evaluated to avoid potential errors when renderring this blogpost due to deprecations/ changes.
In the beginning, when I saw some of the very first articles about doing ML the tidy way by combining recipes
and rsample
my thoughts were that it was all way too complicated compared to what caret
offered. I was very surprised now when I discovered how clean and simple it became over the last year, and apparently things will be further simplified over the next months (link)!
First let’s define two helper functions that will be used later during the modelling process. I imagine these might be wrapped into predefined helper functions in tidymodels packages instead of having to do that every time.
# Defining helper functions that will be used later on
fit_on_fold <- function(spec, prepped) {
x <- juice(prepped, all_predictors())
y <- juice(prepped, all_outcomes())
fit_xy(spec, x, y)
}
predict_helper <- function(split, recipe, fit) {
new_x <- bake(recipe, new_data = assessment(split), all_predictors())
predict(fit, new_x, type = "prob") %>%
bind_cols(assessment(split) %>% select(status))
}
First, let’s use parsnip
to define our ‘modelling engine’ - just like before we’re setting it as a classification problem, using Random Forest running on the ranger
engine. On top of that I’m using dials
to define a grid of parameters to optimize. dials
provides a set of handy functions, such as: grid_random
or grid_regular
, that let you choose the range of parameters in a very flexible way.
From what I can see the parameters that could be optimized slightly differ between both frameworks: caret
allows for tunning the ‘min.node.size’ while keeping the ‘trees’ constant, while parsnip
allows for tuning ‘trees’ while keeping ‘min.node.size’ constant (I assume it’s using the default ranger
values). Nevertheless, the total amount of combinations is same in both cases and equal to 30.
# Specifying the modelling engine
(engine_tidym <- rand_forest(mode = "classification") %>%
set_engine("ranger"))
# Specifying the grid of hyperparameters that should be tested
(gridy_tidym <- grid_random(
mtry %>% range_set(c( 1, 20)),
trees %>% range_set(c( 500, 1000)),
min_n %>% range_set(c(2, 10)),
size = 30
))
Now comes the really interesting part of tidymodels: we’re using a merge helper function from dials to bind our predefined ‘modelling engine’ with all grid combinations of the hyperparameters to tune.
merge(engine_tidym, gridy_tidym)[1:3] # just to see the top 3
Subsequently, I’m putting it into a tidy data frame structure where each model-parameters combination is bound together and assigned a model id that will be used later to make a distinction between consequtive fits.
# Merging all possibilities with our cross-validated data frame
(spec_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym)) %>%
mutate(model_id = row_number()))
Lastly, I’m adding the last component into this tidy structure: all cross-validation splits that were specified before with the use of the crossing
function. This part is very likely to evolve and be simplified in the upcoming months. Now we’re all set to start the actual tidy-modelling!
(spec_tidym <- crossing(train_cv, spec_tidym))
To speed thigs up let’s use the furrr
package and fit many models simultaneously. In the following code our original recipe is first prepped on each split’s training set and than it’s used by the fit_on_fold
helper function to fit a given model-parameter combination.
# Fitting each model-fold pair
fits_tidym <- spec_tidym %>%
mutate(
prepped = future_map(splits, prepper, recipe),
fit = future_map2(spec, prepped, fit_on_fold)
)
The last step of modelling involves usage of the other predict_helper
function that bakes the already prepped split’s recipe and applies it on the testing set of the split, in order to make a prediction of the given model-parameters combination.
# Making predictions of each fitted model on the testing set
fits_pred_tidym <- fits_tidym %>%
mutate(
preds = future_pmap(list(splits, prepped, fit), predict_helper)
)
# Top row of the entire structure as example
fits_pred_tidym[1, ]
After training is done I would like to assess which model performs the best based on cross-validated hold-out performance. In order to do that, let’s calculate the AUC of all test sets across all model-parameter combinations. By averaging the results up, I can see the entire performance profile of all possibilities.
Tidymodels
includes also two very handy packages: probably and tidyposterior, which are very usefull for analysing model estimated probabilities and it’s resampled performance profile. I will make an introduction to those packages in one of my next posts.
# Assessing individual model-fold performance and averaging performance across all folds for each model
(perf_summary_tidym <- fits_pred_tidym %>%
unnest(preds) %>%
group_by(id, model_id) %>%
roc_auc(truth = status, .pred_bad) %>%
group_by(model_id, .metric, .estimator) %>%
summarize(mean = mean(.estimate, na.rm = TRUE))) %>%
arrange(-mean)
Just by sorting the previous results we can easly see what is the best performing model. Let’s now take a step back and filter only that model specification, and fit it on the entire training set. As of now I’m not 100% sure what the recommended and most efficient way of doing that would be, but I decided to go for something like that:
# Selecting the best model with:
# perf_summary_tidym$model_id[which.max(perf_summary_tidym$mean)]
# Fitting the best model on the full training set
(model_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym)) %>%
mutate(model_id = row_number()) %>%
filter(model_id == perf_summary_tidym$model_id[which.max(perf_summary_tidym$mean)]) %>%
pull(spec) %>%
.[[1]] %>%
fit(status ~ ., juice(recipe_preped)))
Similarly like before with caret
, I can now summarize our cross-validated and test performances.
# Cross-validated training performance"
percent(perf_summary_tidym$mean[which.max(perf_summary_tidym$mean)])
# Test performance
df_train_pred_tidym <- predict(
model_tidym,
new_data = bake(recipe_preped, df_test),
type = "prob"
) %>%
transmute(estimate = .pred_bad) %>%
add_column(truth = df_test$status)
percent(roc_auc(df_train_pred_tidym, truth, estimate)$.estimate)
The entire tidymodels
code that was scattered across above sections could be easily squeezed in one longer pipeline. Note that I limited the grid to just one row gridy_tidym[1, ]
in order to demonstrate the solution and save on processing time.
df_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym[1:2, ])) %>%
mutate(model_id = row_number()) %>%
crossing(train_cv, .) %>%
mutate(
prepped = future_map(splits, prepper, recipe),
fit = future_map2(spec, prepped, fit_on_fold),
preds = future_pmap(list(splits, prepped, fit), predict_helper)
)