Stacked Generalization

2019-04-22

Introduced by Wolpert in 19921, this generalization technique consists of combining nonlinear estimators to correct their biases to a given training set, adding their capabilities for better prediction2.

In a previous post I presented the linear combination of estimators, in it we adjusted \(N\) models to a \(D\) dataset and a priori we defined \(W\) weights for them by combining into one summation:

$$\sum_{i=1}^{N} w_{i}M_{i}$$

$$\text{given a priori} \ W = (w_1,w_2,...w_N) \ \text{and} \sum W = 1$$

With this the weighted average of the predictions in general will be less biased for certain regions and may generalize more, but this method has two limitations, the weights cannot be changed after verifying the performance (if we would not be acting as a meta-estimator in test data) and is an extremely simple combination, not leveraging the strengths of the \(M\) estimators for certain regions.

Wolpert then proposes an alternative to this, what if we make \(W\) pesos a learning problem? or rather, not only learn how to combine our predictions but also combine them nonlinearly using a meta estimator?

Meta estimators are those who use base models to combine them or select them to improve on a performance metric, for example you reader when deciding between using a random forest or a logistic regression to predict your model you are being a meta estimator. But here the problem of generalization arises, if you continue to improve your regression or rforest you may end up overfitting the data and not being able to generalize, here then it is necessary to apply cross validation techniques to select the model, the same will happen for the stacking.

For stacking it is ideal that the dataset is relatively large, the author's advice is at least one thousand records. We start our example by loading a relatively large dataset, 20,000 records, this dataset has as characteristic attributes of california houses and as a target value its price, the data is already normalized and we will not make any changes to it.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

dataset = fetch_california_housing()
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df['Price'] = dataset.target
df.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Price
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Here we separate into training and testing in a (pseudo) random way to finally evaluate performance.

xtrain, xtest, ytrain, ytest =\
    train_test_split(df.drop('Price', axis=1), df.Price, test_size=.3,
                     random_state=42)
xtrain.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
7061 4.1312 35.0 5.882353 0.975490 1218.0 2.985294 33.93 -118.02
14689 2.8631 20.0 4.401210 1.076613 999.0 2.014113 32.79 -117.09
17323 4.2026 24.0 5.617544 0.989474 731.0 2.564912 34.59 -120.14
10056 3.1094 14.0 5.869565 1.094203 302.0 2.188406 39.26 -121.00
15750 3.3068 52.0 4.801205 1.066265 1526.0 2.298193 37.77 -122.45

We now load cross-validation specifically into KFold so that we don't "lose" a lot of data, and the templates that will be used , here there is no rule of thumb about the base models, it is up to you, but for the meta-estimator is usually applied boosting trees. Here I arbitrarily chose kNN and ElasticNet, but as a meta-estimator I will use xgboost.

from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold

from xgboost import XGBRFRegressor

en = ElasticNet()
knn = KNeighborsRegressor()
# we will early stop to not overfit
gbm = XGBRFRegressor(n_jobs=-1, objective='reg:squarederror')

Now the creation of the stacked attributes begins, to make sure that there are no biases and we don't have little data to train the meta-estimator we create them by kfolds, being generated the training and test subsets, we train the model in the training set and we predict the value for the test set as follows:

kf = KFold(20, shuffle=True)
xtrain['en'] = 0
for train_index, test_index in kf.split(xtrain):
    en.fit(xtrain.iloc[train_index, :-1], ytrain.iloc[train_index])
    xtrain.iloc[test_index,8] = en.predict(xtrain.iloc[test_index, :-1])

We do the same for the other model.

kf = KFold(20, shuffle=True)
xtrain['knn'] = 0

for train_index, test_index in kf.split(xtrain):
    knn.fit(xtrain.iloc[train_index, :-2], ytrain.iloc[train_index])
    xtrain.iloc[test_index,9] = knn.predict(xtrain.iloc[test_index, :-2])

xtrain.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude en knn
7061 4.1312 35.0 5.882353 0.975490 1218.0 2.985294 33.93 -118.02 2.208931 2.108000
14689 2.8631 20.0 4.401210 1.076613 999.0 2.014113 32.79 -117.09 1.705684 1.809200
17323 4.2026 24.0 5.617544 0.989474 731.0 2.564912 34.59 -120.14 2.098392 1.683200
10056 3.1094 14.0 5.869565 1.094203 302.0 2.188406 39.26 -121.00 1.694140 1.792000
15750 3.3068 52.0 4.801205 1.066265 1526.0 2.298193 37.77 -122.45 2.194403 2.388002

Now that we have created the features let's evaluate the models in the raw data without the stacked features to check their performances:

from sklearn.metrics import mean_squared_error
en.fit(xtrain.iloc[:,:-2], ytrain)
ypred_en = en.predict(xtest)
print(mean_squared_error(ytest, ypred_en))

knn.fit(xtrain.iloc[:,:-2], ytrain)
ypred_knn = knn.predict(xtest)
print(mean_squared_error(ytest, ypred_knn))
0.7562926012142382
1.136942049088978

Now we create the features with the trained models for the test data:

xtest['en'] = ypred_en
xtest['knn'] = ypred_knn
xtest.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude en knn
20046 1.6812 25.0 4.192201 1.022284 1392.0 3.877437 36.06 -119.01 1.470084 1.6230
3024 2.5313 30.0 5.039384 1.193493 1565.0 2.679795 35.14 -119.46 1.744788 1.0822
15663 3.4801 52.0 3.977155 1.185877 1310.0 1.360332 37.80 -122.44 2.233643 2.8924
20484 5.7376 17.0 6.163636 1.020202 1705.0 3.444444 34.28 -118.72 2.413336 2.2456
9814 3.7250 34.0 5.492991 1.028037 1063.0 2.483645 36.62 -121.93 2.088660 1.6690

With the stacked attributes in hand now we train two models, one without using them, for comparison and another using, let's compare the results:

#Without stacked features
gbm.fit(xtrain.iloc[:,:-2], ytrain.values,
        eval_set=[(xtest.iloc[:,:-2],ytest.values)],
        early_stopping_rounds=20,
        verbose=False)
ypred = gbm.predict(xtest.iloc[:,:-2])
print("Without stacked features", mean_squared_error(ytest, ypred))
# With stacked features
gbm.fit(xtrain, ytrain.values,
        eval_set=[(xtest,ytest.values)],
        early_stopping_rounds=20,
        verbose=False)
ypred = gbm.predict(xtest)
print("With stacked features", mean_squared_error(ytest, ypred))
Without stacked features 0.5828429815199971
With stacked features 0.5359477372727965

We've had a significant improvement using "stacked" attributes, concluding our meta estimator learns the best way to combine the features of other estimators, learning their generalization errors and how to correct them, ensuring a much better generalization.

References

Stacked Generalization

1

https://www.sciencedirect.com/science/article/pii/S0893608005800231

2

HASTIE, Trevor et al. The elements of statistical learning: data mining, inference and prediction. P. 252, 2005