In this post I will make a comparison between the most popular (by number of monthly downloads from Github) ML framework available for R to date: caret and its successor packages being written by the same author (Max Kuhn) that are wrapped together in a so called tidymodels framework. Tidymodels is a collection of different packages such as: rsample, recipes, parsnip, dials and more, that allow running an entire ML project in a tidy format end-to-end.

Many of them are still in a development phase, which will still take a couple good months before they settle down, so I’ll try to keep this post up-to-date over time. Nevertheless, I’ve wanted to take a closer look at what tidymodels have to offer for a while already, and thought a blogpost would be a great way to demonstrate that.

In order to write this blog I’ve been reading carefully all individual package websites and this excellent blogpost from Alex Hayes helped me a lot to put things together.

Initial setup

In the beginning, let’s load all the required packages and the credit_data dataset available from recipes that we will use for modelling. Note also that I’m setting the random seed to make sampling reproducible, as well as set the furrr plan to multicore. It’s important unless you want this script to run really long on your machine - we’ll be fitting many different models, so making sure you utilize all your local resources will speed things up a lot.

set.seed(42)
options(max.print = 150)

library(tidymodels)
library(tidyverse)
library(caret)
library(magrittr)
library(naniar)
library(furrr)
library(skimr)

plan(multicore)  
data("credit_data")

Data preparation

In this example, I’m building a classification model to distinguish between good and bad loans indicated by column ‘Status’. We have relatively many observations compared to the number of variables available for modelling. Before making any other steps let’s convert all columns to lowercase.

glimpse(credit_data)
## Observations: 4,454
## Variables: 14
## $ Status    <fct> good, good, bad, good, good, good, good, good, good, b…
## $ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15…
## $ Home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, …
## $ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36…
## $ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37…
## $ Marital   <fct> married, widow, married, single, single, married, marr…
## $ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, n…
## $ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixe…
## $ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75…
## $ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 1…
## $ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3…
## $ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0…
## $ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200,…
## $ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 14…
credit_data %<>%
  set_names(., tolower(names(.)))

With the help of the excellent naniar package I’m checking the percentage of missing data per each variable. For this particular dataset there are very few missing values so they won’t pose a problem for us during modelling.

credit_data %>% miss_var_summary()
## # A tibble: 14 x 3
##    variable  n_miss pct_miss
##    <chr>      <int>    <dbl>
##  1 income       381   8.55  
##  2 assets        47   1.06  
##  3 debt          18   0.404 
##  4 home           6   0.135 
##  5 job            2   0.0449
##  6 marital        1   0.0225
##  7 status         0   0     
##  8 seniority      0   0     
##  9 time           0   0     
## 10 age            0   0     
## 11 records        0   0     
## 12 expenses       0   0     
## 13 amount         0   0     
## 14 price          0   0

Another important step would be to make some basic numerical summaries of the data in order to catch any unusual observations. I will do it using the skimr package. Apart from the fact that many numerical variables show high skewness and some categorical variables have levels with very low frequency, it doesn’t seem that we will have to deal with any special encoded numbers or other problems.

credit_data %>% skim()
## Skim summary statistics
##  n obs: 4454 
##  n variables: 14 
## 
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────
##  variable missing complete    n n_unique
##      home       6     4448 4454        6
##       job       2     4452 4454        4
##   marital       1     4453 4454        5
##   records       0     4454 4454        2
##    status       0     4454 4454        2
##                                top_counts ordered
##   own: 2107, ren: 973, par: 783, oth: 319   FALSE
##  fix: 2805, fre: 1024, par: 452, oth: 171   FALSE
##    mar: 3241, sin: 977, sep: 130, wid: 67   FALSE
##                 no: 3681, yes: 773, NA: 0   FALSE
##               goo: 3200, bad: 1254, NA: 0   FALSE
## 
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────
##   variable missing complete    n    mean       sd  p0     p25  p50    p75
##        age       0     4454 4454   37.08    10.98  18   28      36   45  
##     amount       0     4454 4454 1038.92   474.55 100  700    1000 1300  
##     assets      47     4407 4454 5403.98 11574.42   0    0    3000 6000  
##       debt      18     4436 4454  343.03  1245.99   0    0       0    0  
##   expenses       0     4454 4454   55.57    19.52  35   35      51   72  
##     income     381     4073 4454  141.69    80.75   6   90     125  170  
##      price       0     4454 4454 1462.78   628.13 105 1117.25 1400 1691.5
##  seniority       0     4454 4454    7.99     8.17   0    2       5   12  
##       time       0     4454 4454   46.44    14.66   6   36      48   60  
##   p100     hist
##     68 ▅▇▇▇▅▃▂▁
##   5000 ▅▇▃▁▁▁▁▁
##  3e+05 ▇▁▁▁▁▁▁▁
##  30000 ▇▁▁▁▁▁▁▁
##    180 ▇▃▃▁▁▁▁▁
##    959 ▇▆▁▁▁▁▁▁
##  11140 ▇▆▁▁▁▁▁▁
##     48 ▇▃▂▁▁▁▁▁
##     72 ▁▁▂▃▁▃▇▁

Another point we need to keep in mind when dealing with credit scoring problems is something called a target class imbalance, but in this particular case it’s not that severe. For the sake of comparing programming frameworks and not implementing the best ML model I will ignore it.

table(credit_data$status)
## 
##  bad good 
## 1254 3200
round(prop.table(table(credit_data$status)), 2)
## 
##  bad good 
## 0.28 0.72

Data preparation

Let’s finally move on and start modelling! In the beginning I’ll start with dividing our dataset into training and testing sets with the help of the rsample package. Let’s set an initial, stratified split where 80% of the data is dedicated to training and the rest to evaluating both models.

Furthermore, I’m creating cross-validation splits from the testing data of 5 folds. For compatibility with caret I’m using the rsample2caret function to make use of the same splits in both frameworks - otherwise both solutions wouldn’t be 100% comparable.

split <- initial_split(credit_data, prop = 0.80, strata = "status")

df_train <- training(split)
df_test  <- testing(split)

train_cv <- vfold_cv(df_train, v = 5, strata = "status")
train_cv_caret <- rsample2caret(train_cv)

# write_rds(split, "split.rds")
# write_rds(train_cv, "train_cv.rds")

I would like to fit a Random Forest model for which I will specify a simple recipe. In principle, tree-based models require very little preprocessing, and in this particular example I mainly focus on imputting missing data or assigning them a new categorical level, infrequent/ unobserved values and hot-encoding them. The same recipe will be used for both: caret and tidymodels model.

Normally I would do much more feature engineering, try to assess potential interactions etc., but I will write a separate post dedicated for that to so see how much further we can improve the model!

recipe <- df_train %>%
  recipe(status ~ .) %>%

  # Imputation: assigning NAs to a new level for categorical and median imputation for numeric
  step_unknown(all_nominal(), -status) %>% 
  step_medianimpute(all_numeric()) %>%

  # Combining infrequent categorical levels and introducing a new level for prediction time
  step_other(all_nominal(), -status, other = "infrequent_combined") %>%
  step_novel(all_nominal(), -status, new_level = "unrecorded_observation") %>%

  # Hot-encoding categorical variables
  step_dummy(all_nominal(), -status, one_hot = TRUE) %>%

  # Taking care of output consistency
  check_missing(all_predictors())

Let’s take a quick look at the output of the recipe:

(recipe_preped <- prep(recipe, retain = TRUE))
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         13
## 
## Training data contained 3565 data points and 333 incomplete rows. 
## 
## Operations:
## 
## Unknown factor level assignment for home, marital, records, job [trained]
## Median Imputation for seniority, time, age, expenses, ... [trained]
## Collapsing factor levels for home, marital, records, job [trained]
## Novel factor level assignment for home, marital, records, job [trained]
## Dummy variables from home, marital, records, job [trained]
## Check missing values for seniority, time, age, expenses, ... [trained]
tidy(recipe_preped)
## # A tibble: 6 x 6
##   number operation type         trained skip  id                
##    <int> <chr>     <chr>        <lgl>   <lgl> <chr>             
## 1      1 step      unknown      TRUE    FALSE unknown_5CFHt     
## 2      2 step      medianimpute TRUE    FALSE medianimpute_Hkncr
## 3      3 step      other        TRUE    FALSE other_tZVMD       
## 4      4 step      novel        TRUE    FALSE novel_KC08V       
## 5      5 step      dummy        TRUE    FALSE dummy_05QZl       
## 6      6 check     missing      TRUE    FALSE missing_pfTs9

Fitting our models

Caret

In the code below I’m setting control parameters for the caret model fit, as well as the grid of hyperparameters that will be assessed in order to pick the best performing combination. Note that I’m using the very original observation indexes for cross-validation to ensure reproducability. The trainControl function will also ensure that final hold-out predictions from cross-validation will be persisted for further assessment thanks to savePredictions = "final".

We have 5 different CV folds and 30 grid combinations to assess, which results in 150 models that will be fit and each comprising of 500 individual trees! All models will be assessed based on the prSummary function which is know as the AUC.

control_caret <- trainControl(
  method = "cv",
  verboseIter = FALSE,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  returnResamp = "final",
  savePredictions = "final",
  index = train_cv_caret$index,
  indexOut = train_cv_caret$indexOut,
)

(grid_caret <- expand.grid(
  mtry = seq(1, ncol(df_train) - 1, 3),
  splitrule = c("extratrees", "gini"),
  min.node.size = c(1, 3, 5)
))
##    mtry  splitrule min.node.size
## 1     1 extratrees             1
## 2     4 extratrees             1
## 3     7 extratrees             1
## 4    10 extratrees             1
## 5    13 extratrees             1
## 6     1       gini             1
## 7     4       gini             1
## 8     7       gini             1
## 9    10       gini             1
## 10   13       gini             1
## 11    1 extratrees             3
## 12    4 extratrees             3
## 13    7 extratrees             3
## 14   10 extratrees             3
## 15   13 extratrees             3
## 16    1       gini             3
## 17    4       gini             3
## 18    7       gini             3
## 19   10       gini             3
## 20   13       gini             3
## 21    1 extratrees             5
## 22    4 extratrees             5
## 23    7 extratrees             5
## 24   10 extratrees             5
## 25   13 extratrees             5
## 26    1       gini             5
## 27    4       gini             5
## 28    7       gini             5
## 29   10       gini             5
## 30   13       gini             5

The great advantage of caret is that it wraps a lot of small code pieces in just one, high-level API call that does all the job for you - fits all individual models across CV folds and resamples, selects the best one and fits it already on the entire training dataset. It also makes sure it’s done as fast as possible thanks to parallel processing whenever it’s an enabled option.

The drawback on the other hand is that it’s quite monolythic, untidy and at the end doesn’t offer a great deal of granularity to the end user.

(
  model_caret <- train(
    status ~ .,
    data = juice(recipe_preped),
    method = "ranger",
    metric = "ROC",
    trControl = control_caret,
    tuneGrid = grid_caret,
    importance = "impurity",
    num.trees = 500
  )
)
## Random Forest 
## 
## 3565 samples
##   29 predictor
##    2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2851, 2852, 2852, 2852, 2853 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   min.node.size  ROC        Sens       Spec     
##    1    extratrees  1              0.7891988  0.0000000  1.0000000
##    1    extratrees  3              0.7876724  0.0000000  1.0000000
##    1    extratrees  5              0.7899745  0.0000000  1.0000000
##    1    gini        1              0.8215118  0.0000000  1.0000000
##    1    gini        3              0.8208377  0.0000000  1.0000000
##    1    gini        5              0.8192741  0.0000000  1.0000000
##    4    extratrees  1              0.8040691  0.4074179  0.9219047
##    4    extratrees  3              0.8041195  0.4173781  0.9183914
##    4    extratrees  5              0.8047283  0.4103881  0.9203414
##    4    gini        1              0.8296663  0.4482388  0.9215141
##    4    gini        3              0.8299291  0.4522289  0.9215133
##    4    gini        5              0.8308536  0.4542239  0.9250297
##    7    extratrees  1              0.8046291  0.4522338  0.9094055
##    7    extratrees  3              0.8063292  0.4512239  0.9082343
##    7    extratrees  5              0.8083740  0.4562139  0.9078445
##    7    gini        1              0.8274552  0.4801095  0.9125358
##    7    gini        3              0.8271239  0.4850846  0.9078483
##    7    gini        5              0.8297186  0.4801045  0.9109718
##   10    extratrees  1              0.8057941  0.4701791  0.9078437
##   10    extratrees  3              0.8074283  0.4661542  0.9058906
##   10    extratrees  5              0.8098395  0.4651791  0.9090148
##   10    gini        1              0.8272193  0.5020299  0.9035545
##   10    gini        3              0.8271979  0.4910647  0.9058967
##   10    gini        5              0.8289744  0.4970348  0.9039443
##   13    extratrees  1              0.8065994  0.4681542  0.9031593
##  [ reached getOption("max.print") -- omitted 5 rows ]
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 5.

Caret also comes with built-in handy functions for assessing model’s individual predictors strength. By setting the importance = "impurity" in the ranger engine we ensure that variable importance will be returned by the final train object. As of now there is no such possibility in the tidymodels packages and I’m curious to see how it will be solved!

# Accessing most predictive attributes from caret 
varImp(model_caret, scale = TRUE)$importance %>% 
  rownames_to_column() %>% 
  arrange(-Overall)
##                           rowname    Overall
## 1                          income 100.000000
## 2                       seniority  96.553085
## 3                          amount  87.403163
## 4                           price  80.094041
## 5                             age  66.985275
## 6                          assets  65.523351
## 7                      records_no  49.543132
## 8                        expenses  49.329710
## 9                     records_yes  46.802153
## 10                           time  34.894449
## 11                    job_partime  31.038063
## 12                      job_fixed  29.243620
## 13                           debt  22.225013
## 14                     home_owner  18.784470
## 15                      home_rent  12.899322
## 16                  job_freelance  10.660324
## 17                     home_other   9.827716
## 18                   home_parents   9.616754
## 19                marital_married   9.103606
## 20                 marital_single   7.485771
## 21    marital_infrequent_combined   6.411147
## 22                      home_priv   5.263018
## 23        job_infrequent_combined   3.994922
## 24       home_infrequent_combined   1.431941
## 25    home_unrecorded_observation   0.000000
## 26 marital_unrecorded_observation   0.000000
## 27    records_infrequent_combined   0.000000
## 28 records_unrecorded_observation   0.000000
## 29     job_unrecorded_observation   0.000000

Final cross-validated and test results are easily available with just a couple lines of code. Note that cross-validation performance is aggregated per each index (observation) and averaged out before the final performance metric is calculated.

Getting the test performance is a matter of baking the test set with the already prepped recipe and then making the prediction using the train object. 83.1% AUC for cross-validated training performance and 82.1% for testing - not a bad result for so little preprocessing! Close results also suggest that our model is likely to generalize well to new samples.

df_train_pred_caret <- model_caret$pred %>% 
  group_by(rowIndex) %>% 
  summarise(bad = mean(bad)) %>% 
  transmute(estimate = bad) %>% 
  add_column(truth = df_train$status)

# Cross-validated training performance
percent(roc_auc(df_train_pred_caret, truth, estimate)$.estimate)
## [1] "83.1%"
df_test_pred_caret <- predict(
    model_caret,
    newdata = bake(recipe_preped, df_test),
    type = "prob") %>%
  as_tibble() %>%
  transmute(estimate = bad) %>%
  add_column(truth = df_test$status)

# Test performance
percent(roc_auc(df_test_pred_caret, truth, estimate)$.estimate)
## [1] "82.1%"

Tidymodels

In the beginning, when I saw some of the very first articles about doing ML the tidy way by combining recipes and rsample my thoughts were that it was all way too complicated compared to what caret offered. I was very surprised now when I discovered how clean and simple it became over the last year, and apparently things will be further simplified over the next months (link)!

First let’s define two helper functions that will be used later during the modelling process. I imagine these might be wrapped into predefined helper functions in tidymodels packages instead of having to do that every time.

# Defining helper functions that will be used later on
fit_on_fold <- function(spec, prepped) {
  
  x <- juice(prepped, all_predictors())
  y <- juice(prepped, all_outcomes())
  
  fit_xy(spec, x, y)
}

predict_helper <- function(split, recipe, fit) {
  
  new_x <- bake(recipe, new_data = assessment(split), all_predictors())
  
  predict(fit, new_x, type = "prob") %>% 
    bind_cols(assessment(split) %>% select(status)) 
}

First, let’s use parsnip to define our ‘modelling engine’ - just like before we’re setting it as a classification problem, using Random Forest running on the ranger engine. On top of that I’m using dials to define a grid of parameters to optimize. Dials provides a set of handy functions, such as: grid_random or grid_regular, that let you choose the range of parameters in a very flexible way.

From what I can see the parameters that could be optimized slightly differ between both frameworks: caret allows for tunning the ‘min.node.size’ while keeping the ‘trees’ constant, while parsnip allows for tuning ‘trees’ while keeping ‘min.node.size’ constant (I assume it’s using the default ranger values). Nevertheless, the total amount of combinations is same in both cases and equal to 30.

# Specifying the modelling engine
(engine_tidym <- rand_forest(mode = "classification") %>% 
  set_engine("ranger"))
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger
# Specifying the grid of hyperparameters that should be tested
(gridy_tidym <- grid_random(
  mtry %>% range_set(c( 1,  20)),
  trees %>% range_set(c( 500, 1000)), 
  min_n %>% range_set(c(2,  10)),
  size = 30
  ))
## # A tibble: 30 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1   806     4
##  2     6   681     6
##  3     7   589     2
##  4     3   923     6
##  5     4   822     5
##  6    19   681    10
##  7     5   572     3
##  8     2   744     2
##  9    12   903     5
## 10    16   616     3
## # … with 20 more rows

Now comes the really interesting part of tidymodels: we’re using a merge helper function from dials to bind our predefined ‘modelling engine’ with all grid combinations of the hyperparameters to tune.

merge(engine_tidym, gridy_tidym)[1:3] # just to see the top 3
## [[1]]
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 1
##   trees = 806
##   min_n = 4
## 
## Computational engine: ranger 
## 
## 
## [[2]]
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 6
##   trees = 681
##   min_n = 6
## 
## Computational engine: ranger 
## 
## 
## [[3]]
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 7
##   trees = 589
##   min_n = 2
## 
## Computational engine: ranger

Subsequently, I’m putting it into a tidy data frame structure where each model-parameters combination is bound together and assigned a model id that will be used later to make a distinction between consequtive fits.

# Merging all possibilities with our cross-validated data frame
(spec_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym)) %>% 
  mutate(model_id = row_number()))
## # A tibble: 30 x 2
##    spec      model_id
##    <list>       <int>
##  1 <spec[+]>        1
##  2 <spec[+]>        2
##  3 <spec[+]>        3
##  4 <spec[+]>        4
##  5 <spec[+]>        5
##  6 <spec[+]>        6
##  7 <spec[+]>        7
##  8 <spec[+]>        8
##  9 <spec[+]>        9
## 10 <spec[+]>       10
## # … with 20 more rows

Lastly, I’m adding the last component into this tidy structure: all cross-validation splits that were specified before with the use of the crossing function. This part is very likely to evolve and be simplified in the upcoming months. Now we’re all set to start the actual tidy-modelling!

(spec_tidym <- crossing(train_cv, spec_tidym))
## # A tibble: 150 x 4
##    splits             id    spec      model_id
##    <named list>       <chr> <list>       <int>
##  1 <split [2.9K/714]> Fold1 <spec[+]>        1
##  2 <split [2.9K/714]> Fold1 <spec[+]>        2
##  3 <split [2.9K/714]> Fold1 <spec[+]>        3
##  4 <split [2.9K/714]> Fold1 <spec[+]>        4
##  5 <split [2.9K/714]> Fold1 <spec[+]>        5
##  6 <split [2.9K/714]> Fold1 <spec[+]>        6
##  7 <split [2.9K/714]> Fold1 <spec[+]>        7
##  8 <split [2.9K/714]> Fold1 <spec[+]>        8
##  9 <split [2.9K/714]> Fold1 <spec[+]>        9
## 10 <split [2.9K/714]> Fold1 <spec[+]>       10
## # … with 140 more rows

To speed thigs up let’s use the furrr package and fit many models simultaneously. In the following code our original recipe is first prepped on each split’s training set and than it’s used by the fit_on_fold helper function to fit a given model-parameter combination.

# Fitting each model-fold pair
fits_tidym <- spec_tidym %>%
  mutate(
    prepped = future_map(splits, prepper, recipe),
    fit = future_map2(spec, prepped, fit_on_fold)
  )

The last step of modelling involves usage of the other predict_helper function that bakes the already prepped split’s recipe and applies it on the testing set of the split, in order to make a prediction of the given model-parameters combination.

# Making predictions of each fitted model on the testing set
fits_pred_tidym <- fits_tidym %>%
  mutate(
    preds = future_pmap(list(splits, prepped, fit), predict_helper)
  )
# Top row of the entire structure as example
fits_pred_tidym[1, ]
## # A tibble: 1 x 7
##   splits          id    spec     model_id prepped    fit     preds         
## * <named list>    <chr> <list>      <int> <named li> <list>  <named list>  
## 1 <split [2.9K/7… Fold1 <spec[+…        1 <recipe>   <fit[+… <tibble [714 …

After training is done I would like to assess which model performs the best based on cross-validated hold-out performance. In order to do that, let’s calculate the AUC of all test sets across all model-parameter combinations. By averaging the results up, I can see the entire performance profile of all possibilities.

Tidymodels includes also two very handy packages: probably and tidyposterior, which are very usefull for analysing model estimated probabilities and it’s resampled performance profile. I will make an introduction to those packages in one of my next posts.

# Assessing individual model-fold performance and averaging performance across all folds for each model
(perf_summary_tidym <- fits_pred_tidym %>% 
  unnest(preds) %>% 
  group_by(id, model_id) %>% 
  roc_auc(truth = status, .pred_bad) %>% 
  group_by(model_id, .metric, .estimator) %>% 
  summarize(mean = mean(.estimate, na.rm = TRUE))) %>% 
  arrange(-mean)
## # A tibble: 30 x 4
## # Groups:   model_id, .metric [30]
##    model_id .metric .estimator  mean
##       <int> <chr>   <chr>      <dbl>
##  1        5 roc_auc binary     0.831
##  2       21 roc_auc binary     0.831
##  3        4 roc_auc binary     0.831
##  4        2 roc_auc binary     0.830
##  5       14 roc_auc binary     0.830
##  6       29 roc_auc binary     0.830
##  7        7 roc_auc binary     0.830
##  8       27 roc_auc binary     0.830
##  9       28 roc_auc binary     0.829
## 10        8 roc_auc binary     0.829
## # … with 20 more rows

Just by sorting the previous results we can easly see what is the best performing model. Let’s now take a step back and filter only that model specification, and fit it on the entire training set. As of now I’m not 100% sure what the recommended and most efficient way of doing that would be, but I decided to go for something like that:

# Selecting the best model with:
# perf_summary_tidym$model_id[which.max(perf_summary_tidym$mean)]

# Fitting the best model on the full training set
(model_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym)) %>% 
  mutate(model_id = row_number()) %>% 
  filter(model_id == perf_summary_tidym$model_id[which.max(perf_summary_tidym$mean)]) %>% 
  pull(spec) %>% 
  .[[1]] %>% 
  fit(status ~ ., juice(recipe_preped)))
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~4L, num.trees = ~822L,      min.node.size = ~5L, num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  822 
## Sample size:                      3565 
## Number of independent variables:  29 
## Mtry:                             4 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1431664

Similarly like before with caret, I can now summarize our cross-validated and test performances.

# Cross-validated training performance"
percent(perf_summary_tidym$mean[which.max(perf_summary_tidym$mean)])
## [1] "83.1%"
# Test performance
df_train_pred_tidym <- predict(
  model_tidym, 
  new_data = bake(recipe_preped, df_test), 
  type = "prob"
  ) %>% 
  transmute(estimate = .pred_bad) %>%
  add_column(truth = df_test$status)

percent(roc_auc(df_train_pred_tidym, truth, estimate)$.estimate)
## [1] "82.2%"

The entire tidymodels code that was scattered across above sections could be easily squeezed in one longer pipeline. Note that I limited the grid to just one row gridy_tidym[1, ] in order to demonstrate the solution and save on processing time.

df_tidym <- tibble(spec = merge(engine_tidym, gridy_tidym[1:2, ])) %>% 
  mutate(model_id = row_number()) %>% 
  crossing(train_cv, .) %>%
  mutate(
    prepped = future_map(splits, prepper, recipe),
    fit = future_map2(spec, prepped, fit_on_fold),
    preds = future_pmap(list(splits, prepped, fit), predict_helper)
  )

Wrapping up

I’ve fit a credit scoring classification Random Forest model using both caret and tidymodels frameworks. I need to admit that before I started writing this post I expected a lot more additional code to be written in the tidymodels framework to achieve the same goal, but to my surprise those packages already offer a very concise (and tidy!) way of doing ML in R, and things will be even more streamlined in the upcoming months. That’s definitely a really big step-up for the entire R community when it comes to doing ML in R. I think R users can expect most of the releases happening still this year and a bigger announcement on tidymodels taking place during the upcoming R Conference in SF in January 2020.

As long it’s already quite nice & easy to use tidymodels as your main ML tool in R, I probably wouldn’t recommend incorporating it in your day-to-day work as it’s still quite unstable, however, I’m definitely looking forward to do that in some time in the future!

Future considerations

  1. I still haven’t fully explored the tidyposterior and probably packages - I will do that in one of me next posts.

  2. On the `rsample page there’s an interesting article listed on so called: nested resampling. I’ve never used it in practice but I’m curious to check it out and compare my model’s current cross-validated performance estimate with the one obtained through nested resampling.

  3. There’s also a lot of buzz in the R community regarding a BETA release of the successor of the mlr package (second most popular ML framework in R) - mlr3. mlr3 could be very strong competition to the tidymodels framework, and since I’ve never really used mlr it’s an excellent opportunity to put it to a test. It is also modular in design like tidymodels, but is built on top of data.table and uses R6 object-oriented class system which could give it substantial speed advantage over tidymodels at the expenses of ‘tidyness’.