Stacked Generalization

2019-04-22

Introduced by Wolpert in 1992¹, this generalization technique consists of combining nonlinear estimators to correct their biases to a given training set, adding their capabilities for better prediction².

In a previous post I presented the linear combination of estimators, in it we adjusted $N$ models to a $D$ dataset and a priori we defined $W$ weights for them by combining into one summation:

$$\sum_{i=1}^{N} w_{i}M_{i}$$

$$\text{given a priori} \ W = (w_1,w_2,...w_N) \ \text{and} \sum W = 1$$

With this the weighted average of the predictions in general will be less biased for certain regions and may generalize more, but this method has two limitations, the weights cannot be changed after verifying the performance (if we would not be acting as a meta-estimator in test data) and is an extremely simple combination, not leveraging the strengths of the $M$ estimators for certain regions.

Wolpert then proposes an alternative to this, what if we make $W$ pesos a learning problem? or rather, not only learn how to combine our predictions but also combine them nonlinearly using a meta estimator?

Meta estimators are those who use base models to combine them or select them to improve on a performance metric, for example you reader when deciding between using a random forest or a logistic regression to predict your model you are being a meta estimator. But here the problem of generalization arises, if you continue to improve your regression or rforest you may end up overfitting the data and not being able to generalize, here then it is necessary to apply cross validation techniques to select the model, the same will happen for the stacking.

For stacking it is ideal that the dataset is relatively large, the author's advice is at least one thousand records. We start our example by loading a relatively large dataset, 20,000 records, this dataset has as characteristic attributes of california houses and as a target value its price, the data is already normalized and we will not make any changes to it.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

dataset = fetch_california_housing()
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df['Price'] = dataset.target
df.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	Price
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Here we separate into training and testing in a (pseudo) random way to finally evaluate performance.

xtrain, xtest, ytrain, ytest =\
    train_test_split(df.drop('Price', axis=1), df.Price, test_size=.3,
                     random_state=42)
xtrain.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
7061	4.1312	35.0	5.882353	0.975490	1218.0	2.985294	33.93	-118.02
14689	2.8631	20.0	4.401210	1.076613	999.0	2.014113	32.79	-117.09
17323	4.2026	24.0	5.617544	0.989474	731.0	2.564912	34.59	-120.14
10056	3.1094	14.0	5.869565	1.094203	302.0	2.188406	39.26	-121.00
15750	3.3068	52.0	4.801205	1.066265	1526.0	2.298193	37.77	-122.45

We now load cross-validation specifically into KFold so that we don't "lose" a lot of data, and the templates that will be used , here there is no rule of thumb about the base models, it is up to you, but for the meta-estimator is usually applied boosting trees. Here I arbitrarily chose kNN and ElasticNet, but as a meta-estimator I will use xgboost.

from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold

from xgboost import XGBRFRegressor

en = ElasticNet()
knn = KNeighborsRegressor()
# we will early stop to not overfit
gbm = XGBRFRegressor(n_jobs=-1, objective='reg:squarederror')

Now the creation of the stacked attributes begins, to make sure that there are no biases and we don't have little data to train the meta-estimator we create them by kfolds, being generated the training and test subsets, we train the model in the training set and we predict the value for the test set as follows:

kf = KFold(20, shuffle=True)
xtrain['en'] = 0
for train_index, test_index in kf.split(xtrain):
    en.fit(xtrain.iloc[train_index, :-1], ytrain.iloc[train_index])
    xtrain.iloc[test_index,8] = en.predict(xtrain.iloc[test_index, :-1])

We do the same for the other model.

kf = KFold(20, shuffle=True)
xtrain['knn'] = 0

for train_index, test_index in kf.split(xtrain):
    knn.fit(xtrain.iloc[train_index, :-2], ytrain.iloc[train_index])
    xtrain.iloc[test_index,9] = knn.predict(xtrain.iloc[test_index, :-2])

xtrain.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	en	knn
7061	4.1312	35.0	5.882353	0.975490	1218.0	2.985294	33.93	-118.02	2.208931	2.108000
14689	2.8631	20.0	4.401210	1.076613	999.0	2.014113	32.79	-117.09	1.705684	1.809200
17323	4.2026	24.0	5.617544	0.989474	731.0	2.564912	34.59	-120.14	2.098392	1.683200
10056	3.1094	14.0	5.869565	1.094203	302.0	2.188406	39.26	-121.00	1.694140	1.792000
15750	3.3068	52.0	4.801205	1.066265	1526.0	2.298193	37.77	-122.45	2.194403	2.388002

Now that we have created the features let's evaluate the models in the raw data without the stacked features to check their performances:

from sklearn.metrics import mean_squared_error
en.fit(xtrain.iloc[:,:-2], ytrain)
ypred_en = en.predict(xtest)
print(mean_squared_error(ytest, ypred_en))

knn.fit(xtrain.iloc[:,:-2], ytrain)
ypred_knn = knn.predict(xtest)
print(mean_squared_error(ytest, ypred_knn))

0.7562926012142382
1.136942049088978

Now we create the features with the trained models for the test data:

xtest['en'] = ypred_en
xtest['knn'] = ypred_knn
xtest.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	en	knn
20046	1.6812	25.0	4.192201	1.022284	1392.0	3.877437	36.06	-119.01	1.470084	1.6230
3024	2.5313	30.0	5.039384	1.193493	1565.0	2.679795	35.14	-119.46	1.744788	1.0822
15663	3.4801	52.0	3.977155	1.185877	1310.0	1.360332	37.80	-122.44	2.233643	2.8924
20484	5.7376	17.0	6.163636	1.020202	1705.0	3.444444	34.28	-118.72	2.413336	2.2456
9814	3.7250	34.0	5.492991	1.028037	1063.0	2.483645	36.62	-121.93	2.088660	1.6690

With the stacked attributes in hand now we train two models, one without using them, for comparison and another using, let's compare the results:

#Without stacked features
gbm.fit(xtrain.iloc[:,:-2], ytrain.values,
        eval_set=[(xtest.iloc[:,:-2],ytest.values)],
        early_stopping_rounds=20,
        verbose=False)
ypred = gbm.predict(xtest.iloc[:,:-2])
print("Without stacked features", mean_squared_error(ytest, ypred))
# With stacked features
gbm.fit(xtrain, ytrain.values,
        eval_set=[(xtest,ytest.values)],
        early_stopping_rounds=20,
        verbose=False)
ypred = gbm.predict(xtest)
print("With stacked features", mean_squared_error(ytest, ypred))

Without stacked features 0.5828429815199971
With stacked features 0.5359477372727965

We've had a significant improvement using "stacked" attributes, concluding our meta estimator learns the best way to combine the features of other estimators, learning their generalization errors and how to correct them, ensuring a much better generalization.

References

Stacked Generalization

https://www.sciencedirect.com/science/article/pii/S0893608005800231

HASTIE, Trevor et al. The elements of statistical learning: data mining, inference and prediction. P. 252, 2005