And how I placed top 10% in Europe’s largest machine learning competition with them!
Image generated by DALL·E 3, depicting a stacked landscape
We all know that ensemble models outperform any singular model at predictive modeling. You’ve probably heard all about Bagging and Boosting as common ensemble methods, with Random Forests and Gradient Boosting Machines as respective examples.
But what about ensembling different models together under a separate higher-level model? This is where stacked ensembles comes in. This article is step-by-step guide on how to train stacked ensembles using the popular machine learning library, H2O.
To demonstrate the power of stacked ensembles, I will provide a walk-through of my full code for training a stacked ensemble of 40 Deep Neural Network, XGBoost and LightGBM models for the prediction task posed in the 2023 Cloudflight Coding Competition (AI Category), one of the largest coding competitions in Europe, where I placed top 10% on the competition leaderboard within a training time of 1 hour!
This guide will cover:
What are stacked ensembles and how do they work?How to train stacked ensembles with H2O.aiComparing the performance of a stacked ensemble versus standalone models
1. What are Stacked Ensembles?
A stacked ensemble combines predictions from multiple models through another, higher-level model, with the aim being to increase overall predictive performance by capitalizing on the unique strengths of each constituent model. It involves 2 stages:
Stage 1: Multiple Base Models
First, multiple base models are independently trained on the same training dataset. These models should ideally be diverse, ranging from simple linear regressions to complex deep learning models. The key is that they should differ from each other in some way, either in terms of algorithm or hyperparameter settings.
The more diverse the base models are, the more powerful the eventual stacked ensemble. This is because different models are able to capture different patterns in the data. For example, a tree-based model might be good at capturing non-linear relationships, while a linear model excels at understanding linear trends. When these diverse base models are combined, the stacked ensemble can then leverage the strengths of each base model while mitigating their individual weaknesses, increasing predictive performance.
Stage 2: One Meta-Model
After all the base models are trained, each base model’s predictions for the target is used as a feature for training a higher-level model, termed a meta-model. This means that the meta-model is not trained on the original dataset’s features, but instead on the predictions of the base models. If there are n base models, there are n predictions generated, and these are the n features used for training the meta-model.
While the training features differ between the base models and the meta-model, the target however stays the same, which is the original target from the dataset.
The meta-model learns how to best combine the predictions from the base models to make a final, better prediction.
Detailed Steps for Training a Stacked Ensemble
For each base model:
1. Pick an algorithm (eg. Random Forest).
2. Use cross-validation to obtain the best set of hyperparameters for the algorithm.
3. Obtain cross-validation predictions for the target in the training set. These will be used to train the meta-model subsequently.
To illustrate this, say a Random Forest algorithm was chosen in Step 1, and its optimal hyperparameters were determined as h in Step 2.The cross-validation predictions are obtained through the following, assuming 5-fold cross-validation:
1. Train a Random Forest with hyperparamters h on Folds 1–4.
2. Used the trained Random Forest to make predictions for Fold 5. These are the cross-validation predictions for Fold 5.
3. Repeat the above to obtain cross-validation predictions for each fold. After which, cross-validation predictions for the target will be obtained for the entire training set.
For the meta-model:
1. Obtain the features for training the meta-model. These are the predictions of each of the base models.
2. Obtain the target for training the meta-model. This is the original target from the training set.
3. Pick an algorithm (eg. Linear Regression).
4. Use cross-validation to obtain the best set of hyperparameters for the algorithm.
And voila! You now have:
– Multiple base models that are trained with optimal hyperparameters
– One meta-model that is also trained with optimal hyperparameters
Which means you have successfully trained a stacked ensemble!
2. How to Train Stacked Ensembles with H2O.ai
Now, let’s jump into coding it out!
As mentioned, this section covers my full code for training a stacked ensemble for the prediction task posed in the 2023 Cloudflight Coding Competition (AI Category), which is a regression task using tabular data. Within the competition’s time constraints, I created a stacked ensemble from 40 base models of 3 algorithm types — Deep Neural Network, XGBoost, and LightGBM, with these algorithms chosen as they often achieve superior performance in practice.
2.1. Data Preparation
First, let’s import the necessary libraries.
import pandas as pd
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators import H2OXGBoostEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
import optuna
from tqdm import tqdm
seed = 1
And initialize the H2O cluster.
h2o.init()
Next, load in the dataset.
data = pd.read_csv(‘path_to_your_tabular_dataset’)
Before moving on to model building using H2O, let’s first understand the following traits of H2O models:
H2O models cannot take in Pandas DataFrame objects, so data must be converted from a Pandas DataFrame to its H2O equivalent, which is a H2OFrame.H2O models can encode categorical features automatically, which is great as it takes this preprocessing step out of our hands. To ensure that such features are understood by the models to be categorical, they must be explicitly converted into the factor (categorical) data type.data_h2o = h2o.H2OFrame(data)
categorical_cols = […] #insert the names of the categorical features here
for col in categorical_cols:
data_h2o[col] = data_h2o[col].asfactor()
Now we can proceed to split our dataset into train (90%) and validation (10%) sets, using the split_frame() method of H2OFrame objects.
splits = data_h2o.split_frame(ratios=[0.9], seed=seed)
train = splits[0]
val = splits[1]
Lastly, let’s obtain the features and target for modelling. Unlike Scikit-Learn models which take as input the values of the features and the target, H2O models take as input the names of the features and the target.
y = ‘…’ #insert name of the target column here
x = list(train.columns)
x.remove(y)
Now, let the model training fun begin!
2.2. Training Deep Neural Networks (DNN) as Base Models
Let’s start by training the DNNs that will form our set of base models for the stacked ensemble, using H2O’s H2ODeepLearningEstimator.
Aside: Why train DNNs in H2O, instead of Tensorflow, Keras, or PyTorch?Before jumping into the code for this, you might be wondering why I chose to train DNNs using H2O’s H2ODeepLearningEstimator, as opposed to using Tensorflow, Keras, or PyTorch, which are the common libraries used to build DNNs.The straightforward answer is that building a stacked ensemble in H2O uses the H2OStackedEnsembleEstimator, which can only accept base models that are part of the H2O model family. However, the more critical reason is that H2O’s H2ODeepLearningEstimator enables far easier tuning of DNNs than these other frameworks, and here’s why.In TensorFlow, Keras, or PyTorch, regularization effects like dropout layers must be manually added into the model architecture, such as using keras.layers.Dropout(). This allows for greater customization, but also requires more detailed knowledge and effort. For example, you have to decide where and how many times to include the keras.layers.Dropout() layer within your model architecture.On the other hand, H2O’s H2ODeepLearningEstimator is more abstracted and accessible to the layman. Regularization can be enabled in a straightforward manner through model hyperparameters, reducing the need for manual setup of these components. Furthermore, the default model hyperparameters already includes regularization. The common feature preprocessing steps, such as scaling of numerical features and encoding of categorical features, are also included as model hyperparameters for automatic feature preprocessing. These enable the tuning of DNNs to be a far more straightforward and easy process, without having to dive into the complexities of deep learning model architecture. In the context of a time crunch in the competition, this was extremely useful for me!
Now, let’s get into the code. We’ll use optuna to tune the hyperparameters of H2O’s H2ODeepLearningEstimator, and keep track of all the trained models inside the list dnn_models.
dnn_models = []
def objective(trial):
#params to tune
num_hidden_layers = trial.suggest_int(‘num_hidden_layers’, 1, 10)
hidden_layer_size = trial.suggest_int(‘hidden_layer_size’, 100, 300, step=50)
params = {
‘hidden’: [hidden_layer_size]*num_hidden_layers,
‘epochs’: trial.suggest_int(‘epochs’, 5, 100),
‘input_dropout_ratio’: trial.suggest_float(‘input_dropout_ratio’, 0.1, 0.3), #dropout for input layer
‘l1’: trial.suggest_float(‘l1’, 1e-5, 1e-1, log=True), #l1 regularization
‘l2’: trial.suggest_float(‘l2’, 1e-5, 1e-1, log=True), #l2 regularization
‘activation’: trial.suggest_categorical(‘activation’, [‘rectifier’, ‘rectifierwithdropout’, ‘tanh’, ‘tanh_with_dropout’, ‘maxout’, ‘maxout_with_dropout’])
}
#param ‘hidden_dropout_ratios’ is applicable only if the activation type is rectifier_with_dropout, tanh_with_dropout, or maxout_with_dropout
if params[‘activation’] in [‘rectifierwithdropout’, ‘tanh_with_dropout’, ‘maxout_with_dropout’]:
hidden_dropout_ratio = trial.suggest_float(‘hidden_dropout_ratio’, 0.1, 1.0)
params[‘hidden_dropout_ratios’] = [hidden_dropout_ratio]*num_hidden_layers #dropout for hidden layers
#train model
model = H2ODeepLearningEstimator(**params,
standardize=True,
categorical_encoding=’auto’,
nfolds=5,
keep_cross_validation_predictions=True, #need this for training the meta-model later
seed=seed)
model.train(x=x, y=y, training_frame=train)
#store model
dnn_models.append(model)
#get cross-validation rmse
cv_metrics_df = model.cross_validation_metrics_summary().as_data_frame()
cv_rmse_index = cv_metrics_df[cv_metrics_df[”] == ‘rmse’].index
cv_rmse = cv_metrics_df[‘mean’].iloc[cv_rmse_index]
return cv_rmse
study = optuna.create_study(direction=’minimize’)
study.optimize(objective, n_trials=20)
Above, an optuna study is created to search for the best set of H2ODeepLearningEstimator hyperparameters that minimizes the cross-validation RMSE (as this is a regression task), with the optimization process running for 20 trials using the parameter n_trials=20. This means that 20 DNNs are trained and stored in the list dnn_models, for usage as base models for the stacked ensemble later on. In the interest of time under the competition’s time constraints, I chose to train 20 DNNs, but you can set n_trials to be however many DNNs you wish to train for your stacked ensemble.
Importantly, the H2ODeepLearningEstimator must be trained with keep_cross_validation_predictions=True, as these cross-validation predictions will be used as features for training the meta-model later.
2.3. Training XGBoost and LightGBM as Base Models
Next, let’s train the XGBoost and LightGBM models that will also form our set of base models for the stacked ensemble. We’ll again use optuna to tune the hyperparameters of H2O’s H2OXGBoostEstimator, and keep track of all the trained models inside the list xgboost_lightgbm_models.
Before diving into the code for this, we must first understand that H2OXGBoostEstimator is the integration of the XGBoost framework from the popular xgboost library into H2O. On the other hand, H2O does not integrate the lightgbm library. However, it does provide a method for emulating the LightGBM framework using a certain set of parameters within H2OXGBoostEstimator- and this is exactly what we will implement in order to train both XGBoost and LightGBM models using H2OXGBoostEstimator.
xgboost_lightgbm_models = []
def objective(trial):
#common params between xgboost and lightgbm
params = {
‘ntrees’: trial.suggest_int(‘ntrees’, 50, 5000),
‘max_depth’: trial.suggest_int(‘max_depth’, 1, 9),
‘min_rows’: trial.suggest_int(‘min_rows’, 1, 5),
‘sample_rate’: trial.suggest_float(‘sample_rate’, 0.8, 1.0),
‘col_sample_rate’: trial.suggest_float(‘col_sample_rate’, 0.2, 1.0),
‘col_sample_rate_per_tree’: trial.suggest_float(‘col_sample_rate_per_tree’, 0.5, 1.0)
}
grow_policy = trial.suggest_categorical(‘grow_policy’, [‘depthwise’, ‘lossguide’])
#######################################################################################################################
#from H2OXGBoostEstimator’s documentation, (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html) #
#lightgbm is emulated when grow_policy=lossguide and tree_method=hist #
#so we will tune lightgbm-specific hyperparameters when this set of hyperparameters is used #
#and tune xgboost-specific hyperparameters otherwise #
#######################################################################################################################
#add lightgbm-specific params
if grow_policy == ‘lossguide’:
tree_method = ‘hist’
params[‘max_bins’] = trial.suggest_int(‘max_bins’, 20, 256)
params[‘max_leaves’] = trial.suggest_int(‘max_leaves’, 31, 1024)
#add xgboost-specific params
else:
tree_method = ‘auto’
params[‘booster’] = trial.suggest_categorical(‘booster’, [‘gbtree’, ‘gblinear’, ‘dart’])
params[‘reg_alpha’] = trial.suggest_float(‘reg_alpha’, 0.001, 1)
params[‘reg_lambda’] = trial.suggest_float(‘reg_lambda’, 0.001, 1)
params[‘min_split_improvement’] = trial.suggest_float(‘min_split_improvement’, 1e-10, 1e-3, log=True)
#add grow_policy and tree_method into params dict
params[‘grow_policy’] = grow_policy
params[‘tree_method’] = tree_method
#train model
model = H2OXGBoostEstimator(**params,
learn_rate=0.1,
categorical_encoding=’auto’,
nfolds=5,
keep_cross_validation_predictions=True, #need this for training the meta-model later
seed=seed)
model.train(x=x, y=y, training_frame=train)
#store model
xgboost_lightgbm_models.append(model)
#get cross-validation rmse
cv_metrics_df = model.cross_validation_metrics_summary().as_data_frame()
cv_rmse_index = cv_metrics_df[cv_metrics_df[”] == ‘rmse’].index
cv_rmse = cv_metrics_df[‘mean’].iloc[cv_rmse_index]
return cv_rmse
study = optuna.create_study(direction=’minimize’)
study.optimize(objective, n_trials=20)
Similarly, 20 XGBoost and LightGBM models are trained and stored for usage as base models for the stacked ensemble later on, but you can set n_trials to be however many XGBoost/LightGBM models you wish to train for your stacked ensemble.
Importantly, the H2OXGBoostEstimator must also be trained with keep_cross_validation_predictions=True, as these cross-validation predictions will be used as features for training the meta-model later.
2.4. Training the Meta-Model
We will use all of the Deep Neural Network, XGBoost and LightGBM models trained above as base models. However, this does not mean that all of them will be used in the stacked ensemble, as we will perform automatic base model selection when tuning our meta-model (more on this later)!
Recall that we had stored each trained base model inside the lists dnn_models (20 models) and xgboost_lightgbm_models (20 models), giving a total of 40 base models for our stacked ensemble. Let’s combine them into a final list of base models, base_models.
base_models = dnn_models + xgboost_lightgbm_models
Now, we are ready to train the meta-model using these base models. But first, we have to decide on the meta-model algorithm, where a few concepts come into play:
Most academic papers on stacked ensembles recommend choosing a simple linear-based algorithm for the meta-model. This is to avoid the meta-model overfitting to the predictions from the base models.H2O recommends the usage of a Generalized Linear Model (GLM) over a Linear Regression (for regression tasks) or Logistic Regression (for classification tasks). This is because the GLM is a flexible linear model that does not impose the key assumptions of normality and homoscedasticity that the latter do, allowing it to model the true behavior of the target values better, since such assumptions can be difficult to be met in practice. Further explanations on this can be found in this academic thesis, on which H2O’s work was based upon.
As such, we will instantiate the meta-model using H2OStackedEnsembleEstimator with metalearner_algorithm=’glm’, and use optuna to tune the hyperparameters of the GLM meta-model to optimize performance.
def objective(trial):
#GLM params to tune
meta_model_params = {
‘alpha’: trial.suggest_float(‘alpha’, 0, 1), #regularization distribution between L1 and L2
‘family’: trial.suggest_categorical(‘family’, [‘gaussian’, ‘tweedie’]), #read the documentation here on which family your target may fall into: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html
‘standardize’: trial.suggest_categorical(‘standardize’, [True, False]),
‘non_negative’: True #predictions of each base model cannot be subtracted from one another
}
ensemble = H2OStackedEnsembleEstimator(metalearner_algorithm=’glm’,
metalearner_params=meta_model_params,
metalearner_nfolds=5,
base_models=base_models,
seed=seed)
ensemble.train(x=x, y=y, training_frame=train)
#get cross-validation rmse
cv_metrics_df = ensemble.cross_validation_metrics_summary().as_data_frame()
cv_rmse_index = cv_metrics_df[cv_metrics_df[”] == ‘rmse’].index
cv_rmse = cv_metrics_df[‘mean’].iloc[cv_rmse_index]
return cv_rmse
study = optuna.create_study(direction=’minimize’)
study.optimize(objective, n_trials=20)
Notice that the cross-validation predictions of each base model were not explicitly passed into H2OStackedEnsembleEstimator. This is because H2O does this automatically under the hood, making things easier for us! All we had to do was set keep_cross_validation_predictions=True when training our base models previously, and instantiate H2OStackedEnsembleEstimator with the parameter base_models=base_models.
Now, we can finally build the best_ensemble model, using the optimal hyperparameters found by optuna.
best_meta_model_params = study.best_params
best_ensemble = H2OStackedEnsembleEstimator(metalearner_algorithm=’glm’,
metalearner_params=best_meta_model_params,
base_models=base_models,
seed=seed)
best_ensemble.train(x=x, y=y, training_frame=train)
And voila, we have successfully trained a stacked ensemble in H2O! Let’s take a look at it.
best_ensemble.summary()Image by author
Notice that the stacked ensemble uses only 16 out of the 40 base models we passed to it, of which 3 are XGBoost/LightGBM and 13 are Deep Neural Networks. This is due to the hyperparameter alpha that we tuned for the GLM meta-model, which represents the distribution of regularization between L1 (LASSO) and L2 (Ridge). A value of 1 entails only L1 regularization, while a value of 0 entails only L2 regularization.
As reflected above, its optimal value was found to be alpha=0.16, thus a mix of L1 and L2 was employed. Some of the base models’ predictions had their coefficients in the regression set to 0 under L1 regularization, meaning that these base models were not used in the stacked ensemble at all, therefore fewer than 40 base models ended up being used in the stacked ensemble.
The key takeaway here is that our setup above also performs automatic selection of which base models to use for optimal performance, through the meta-model’s regularization hyperparameters, instead of simply using all 40 base models provided.
3. Comparing Performance: Stacked Ensemble Versus Standalone Base Models
To demonstrate the power of stacked ensembles, let’s use it to generate predictions for the validation set, which was held out from the beginning. The RMSE figures below are specific only to the dataset I am using, but feel free to run this article’s codes on your own dataset too, and see the difference in model performance for yourself!
ensemble_val_rmse = best_ensemble.model_performance(val).rmse()
ensemble_val_rmse #0.31475634111745304
The stacked ensemble produces an RMSE of 0.31 on the validation set.
Next, let’s dig into the performance of each of the base models on this same validation set.
base_val_rmse = []
for i in range(len(base_models)):
base_val_rmse = base_models[i].model_performance(val).rmse()
models = [‘H2ODeepLearningEstimator’] * len(dnn_models) + [‘H2OXGBoostEstimator’] * len(xgboost_lightgbm_models)
base_val_rmse_df = pd.DataFrame([models, base_val_rmse]).T
base_val_rmse_df.columns = [‘model’, ‘val_rmse’]
base_val_rmse_df = base_val_rmse_df.sort_values(by=’val_rmse’, ascending=True).reset_index(drop=True)
base_val_rmse_df.head(15) #show only the top 15 in terms of lowest val_rmseImage by author
Compared to the stacked ensemble which achieved an RMSE of 0.31, the best-performing standalone base model achieved an RMSE of 0.35.
This means that Stacking was able to improve predictive performance by 11% on unseen data!
Now that you’ve witnessed the power of stacked ensembles, it’s your turn to try them out! Thank you for reading. I had a lot of fun writing this, and if you had fun reading, I would really appreciate if you took a second to leave some claps and a follow! 🤗
If you have any questions or thoughts, feel free to comment them here or reach out to me on LinkedIn. See you in my next article very soon!
Happy learning,
Sheila 🤓
Stacked Ensembles for Advanced Predictive Modeling With H2O.ai and Optuna was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.