Building Machine Learning Ensembles in Python

Ensemble learning is a process used in deep learning wherein multiple models, or experts or classifiers, are combined in an ensemble to improve forecasting results. Each individual model in the ensemble, once trained, produces a prediction on an unseen data point. These predictions are then aggregated in some way. For regression problems, the aggregation is typically the arithmetic mean, while the mode of the predictions is usually used for classification problems, i.e. the most predicted class. The quintessential example of an ensemble model is the random forest. In this case, a multitude of decision tree classifiers are trained on different subsets of the training data (so that the trees’ split points are different for each model) and added to a single ensemble to make predictions.

Ensemble methods provide an effective way to learn in an environment with too much data as well as too little data. The ‘majority vote’ technique used to aggregate ensemble member predictions is also effective at reducing generalization error. This is because it’s unlikely that each member of the ensemble will make the same errors on the test set or on unseen data.

PyML-Ensemble

This project uses a very basic framework called PyML-Ensemble. PyML-Ensemble provides a framework to create adaptable machine learning models. The ensembles are adaptable in that models can be added to or removed from the ensemble at any time, even after training is complete. This allows the creation of real-time ensembles that are retrained or otherwise modified as more data become available. Some examples of adaptability in ensemble methods are provided below.

The code for this project can be found on GitHub and is open to contributions to extend or fix its functionality.

Inspiration and Overview

Creating and using ensembles is fairly straightforward: create n subsets of the train and test data, train n different models, create an assembly of the models (e.g. a list of models in Python), pass a data point into each model, and record its prediction, and finally aggregate the predictions in some way to get the ensemble’s prediction.

In a past research project, Adaptive Deep Learning Ensembles (ADLE), I used ensemble learning in an attempt to improve the forecasting accuracy of neural networks when training on non-stationary time-series data. To do this, I used the Python programming language along with the typical suite of data science and machine learning libraries (e.g. Keras, Numpy, Pandas, etc.) Recently, I decided to take another look at the project and managing the ensembles with Python, although straightforward, turned out to be a bit of a mess. Because of this, I decided to create an easy-to-use framework for creating ensembles of any type of learning algorithm in Python. The framework provides some basic functionality for working with ensembles and a few abstract base classes (ABCs) that can be used to create custom aggregation techniques and base models.

The two ABCs provided by the framework are the aggregator and model classes.

import abc

class Aggregator(abc.ABC):
    def __init__(self):
        pass

    @abc.abstractmethod
    def combine(self, predictions):
        pass

As seen here the aggregator ABC requires that any derived aggregator class provides only a combine() method which takes as input the predictions from each ensemble member and outputs the actual prediction (the combined, or aggregated, prediction).

import abc

class Model(abc.ABC):
    def __init__(self):
        pass

    @abc.abstractmethod
    def train(self, x, y):
        pass

    @abc.abstractmethod
    def get_prediction(self, x):
        pass

Similarly, the model ABC forces any subclasses to provide methods to create and manage a model. The classes must be able to train the underlying model given some training inputs (x) and targets (y) and to produce a prediction given additional input data.

Although simple, these two classes provide a basic framework to keep the development of an ensemble focused on the task at hand. Furthermore, basic implementations for both the model and aggregator classes are provided by the framework. The implementations of the aggregator ABC are the MeanAggregator which averages the predictions (best for use in regression tasks) and the ModeAggregator which uses the most frequently predicted value as the ensemble’s prediction (created with classification tasks in mind). The framework provides one built-in model: the decision tree. This tree is built off of the scikit-learn decision tree model which, unfortunately, forces the installation of scikit-learn. To keep dependencies limited, no other models were built into the framework.

Below, example usage of these base classes, the built-in implementations, and the ensemble class are provided.

Creating a Basic Ensemble

Random Forest

Perhaps the most popular ensemble is the random forest which is a collection (ensemble) of classification or regression trees that are trained on different subsets of the data which, when combined, typically yield more predictive power than a single decision tree. Although random forests are already implemented in popular machine learning libraries, such as scikit-learn, creating one using only the built-in TreeModel from pyml_ensemble is a good first example. The breast cancer dataset, which is built into scikit-learn, will be used for training and testing.

For starters, the classes from pyml_ensemble and some helper methods from scikit-learn need to be imported.

# methods to manage the ensemble
from pyml_ensemble import Ensemble  
# classificaton task aggregator
from pyml_ensemble.aggregator import ModeAggregator 
# the tree model built from sklearn's DecisionTreeClassifier
from pyml_ensemble.model import TreeModel

# the built-in breast cancer dataset
from sklearn.datasets import load_breast_cancer
# easily split train and test data
from sklearn.model_selection import train_test_split
# determine how well the ensemble performs 
from sklearn import metrics

With the required classes imported, the ensemble can be built and trained and predictions can be found for the out-of-sample test data.

if __name__ == '__main__':
    bc = load_breast_cancer()  # load the dataset

    # split the data into test and train sets
    bc_trainx, bc_testx, bc_trainy, bc_testy = \
        train_test_split(bc.data, bc.target, test_size=0.33)

    # create an ensemble and set the aggregator to use the
    # most predicted class
    ensemble = Ensemble()
    ensemble.set_aggregator(ModeAggregator())

    # add 10 models to the ensemble
    number_models = 10
    for i in range(number_models):
        ensemble.add_model(TreeModel())
   
    # train the models, all on the same data for now
    ensemble.train([bc_trainx for _ in range(number_models)], \
                   [bc_trainy for _ in range(number_models)])

    # get the predictions from the ensemble, this method
    # uses the aggregator previously set on the ensemble 
    y_hat = ensemble.predict(bc_testx)

    # print the accuracy of the ensemble
    print(metrics.accuracy_score(bc_testy, y_hat))

As seen here, in just a few lines of code an ensemble can be created which uses 10 decision trees (a Random Forest) to make predictions on the dataset. In the following sections, more intricate ensembles and their implementations in this framework will be explored.

Neural Network Ensemble

Ensembles of more expressive base models are sometimes necessary for complicated machine learning tasks. In this section, a neural network model is created with the framework’s model ABC. This model is then used to create ensembles each with a different number of neural network constituents. The results of these ensembles is then compared.

To begin the artificial neural network model is created (ANNModel). Keras was used to create this model which is why it is not included in the framework as this requires a slew of other large dependencies (e.g. TensorFlow and NumPy).

from keras.models import Sequential
from keras.layers import Dense, Activation

from pyml_ensemble.model import Model

class ANNModel(Model):  # "extends" the ABC
    def __init__(self, input_size, num_hidden_layers, hidden_layer_sizes,
                output_size, epochs=50, batch_size=1, fit_verbose=2,
                variables=None, weight_file=''):
        super().__init__()
        self.input_size = input_size
        self.num_hidden_layers = num_hidden_layers
        self.hidden_layer_sizes = hidden_layer_sizes
        self.output_size = output_size
        self.epochs = epochs
        self.batch_size = batch_size
        self.verbose = fit_verbose

        self.weight_file = weight_file

        self.build_model()

    def build_model(self):
        self.model = Sequential()
        self.model.add(Dense(self.hidden_layer_sizes[0], 
                             input_shape=(self.input_size, ),
                             activation='sigmoid'))
        for i in range(1, self.num_hidden_layers - 1):
            self.model.add(Dense(self.hidden_layer_sizes[i], activation='sigmoid'))
        self.model.add(Dense(self.hidden_layer_sizes[len(self.hidden_layer_sizes) - 1],
                             activation='sigmoid'))
        self.model.add(Dense(self.output_size, activation='sigmoid'))
        self.model.compile(loss='mean_squared_error', optimizer='adam')

    def train(self, x, y):
        self.history = self.model.fit(x, y, epochs=self.epochs,
                                      batch_size=self.batch_size,
                                      verbose=self.verbose, shuffle=False)

    def get_prediction(self, x):
        return self.model.predict(x)

    def load_weights(self):
        self.model.set_weights(self.weight_file)

    def save_weights(self):
        self.model.save_weights(self.weight_file)

    def set_weight_filename(self, filename):
        self.weight_file = filename

The model is very straightforward. The input size, number of hidden layers, number of neurons per hidden layer, and the output size are required as input to the model which are then used to build the neural network. In this case, the hidden layers and the output layer use sigmoidal activations to clamp the values between 0 and 1. This is probably not the BEST way to build the network for classification tasks (especially the output layer and loss function) but works well enough for this quick example.

Next, the ensemble is created using this model:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_boston
from sklearn import metrics
import numpy as np

from pyml_ensemble import Ensemble
from pyml_ensemble.aggregator import MeanAggregator

from ann_model import ANNModel

ensemble = Ensemble()
aggregator = MeanAggregator()
ensemble.set_aggregator(aggregator)

bc = load_breast_cancer()
trainx, testx, trainy, testy = train_test_split(bc.data, bc.target, test_size=0.33)

num_models = 5
input_size = trainx.shape[1]
output_size = 1     # class
num_hidden_layers = 5  # 5 hidden layers
hidden_layer_sizes = 10  # 10 nodes per hidden layer
for i in range(num_models):
    ann = ANNModel(input_size, num_hidden_layers, \
                   [hidden_layer_sizes for _ in range(num_hidden_layers)], \
                   output_size, epochs=512, batch_size=8, fit_verbose=0)
    ensemble.add_model(ann)

ensemble.train([trainx for _ in range(num_models)], [trainy for _ in range(num_models)])
y_hat = ensemble.predict(testx)    # get predictions
y_hat = np.round(y_hat)   # set class to 0 or 1
rint(metrics.accuracy_score(testy, y_hat))

With the ensemble created the num_models parameter is set to either 1, 5, or 10 and the ensemble is trained, predictions are made, and the accuracy is measured. A comparison of the results is shown below and the main advantage of ensemble learning is accentuated as adding more networks increases predictive accuracy, even on this small dataset. However, when adding even more networks the improvement diminishes. Note that, a larger network (more hidden layers with more nodes) could have also been used but may not have given as good results and could, potentially, take longer to train.

Number of NetworksAccuracy (% Predicted Correctly)
188.83% Correct
592.71% Correct
1092.77% correct
Results from creating an ensemble with 1, 5, and 10 neural network models. Each model is trained and predictions are taken 10 times. The average accuracy of these independent runs is used as the accuracy above.

Adaptive Ensembles

For some applications, e.g. complex time-series forecasting, it is beneficial, and in some cases necessary, to have an adaptive ensemble. Two examples of adaptive ensembles are i) an ensemble that is retrained as new training data become available and ii) an ensemble where new models are added and trained on new training data. These two types of ensembles allow the models to remain strong predictors of the dataset as, for example, the statistical properties of the underlying data-generating function changes, as is the case with non-stationary time-series data. PyML-Ensemble was created for just these types of ensembles and examples of both are shown below. Note that, although these are the only types of dynamic ensembles being shown, the real purpose of this section is to show how this framework can be used to create any type of dynamic ensemble rather than being stuck with a static set of trained models.

For simplicity’s sake, the ensembles are still trained using the built-in breast cancer dataset.

In the first adaptive ensemble, the training and test data are split into smaller subsets. Each network in the ensemble is trained on a subset of the data. Testing data is predicted in chunks rather than en masse and, as new data becomes available (i.e. old test data is used as new training data which could be done in practice), new networks are added to the ensemble in an attempt to improve predictive accuracy. Comparisons of the performance are not presented here due to the length of this post and because this setup is strictly for demonstrative purposes; this probably shouldn’t be done on this dataset in the real-world.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_boston
from sklearn import metrics
import numpy as np

from pyml_ensemble import Ensemble
from pyml_ensemble.aggregator import MeanAggregator

from ann_model import ANNModel

# returns the model to be added to the ensemble
def get_ann_model(input_size, num_hidden_layers, hidden_layer_sizes, output_size):
    return ANNModel(input_size, num_hidden_layers, \
                      [hidden_layer_sizes for _ in range(num_hidden_layers)], \
                      output_size, epochs=1024, batch_size=4, fit_verbose=0)

ensemble = Ensemble()
aggregator = MeanAggregator()
ensemble.set_aggregator(aggregator)

bc = load_breast_cancer()
trainx, testx, trainy, testy = train_test_split(bc.data, bc.target, test_size=0.33)

num_models = 5
chunk_size = 80 # ~80 rows of data per network (all except last network)
trainx_chunks = [trainx[(i*chunk_size):((i+1)*chunk_size)] for i in range(num_models)]
trainy_chunks = [trainy[(i*chunk_size):((i+1)*chunk_size)] for i in range(num_models)]

input_size = trainx.shape[1]
output_size = 1     # class
num_hidden_layers = 5
hidden_layer_sizes = 10
for i in range(num_models):
    ann = get_ann_model(input_size, num_hidden_layers, hidden_layer_sizes, output_size)
    ensemble.add_model(ann)

# train on individual chunks rather than on the same data
ensemble.train(trainx_chunks, trainy_chunks)

# Iterate the test data. Get predictions and use the new chunks of data to add
# members to the ensemble.
#
# Unfortunately, the entire ensemble has to be re-trained when adding a
# new member. Training individual members is slated for a future release.
y_hat = None
for i in range(0, testx.shape[0], chunk_size):
    # get the current data subsets
    testx_chunk = testx[i:(i+chunk_size)]
    testy_chunk = testy[i:(i+chunk_size)]
    preds = ensemble.predict(testx_chunk) # get predicitons
    y_hat = preds if y_hat is None else np.vstack([y_hat, preds])

    # create, add, and train the new ensemble member
    ann = get_ann_model(input_size, num_hidden_layers, hidden_layer_sizes, output_size)
    ensmeble.add_model(ann)
    trainx_chunks.append(testx_chunk)
    trainy_chunks.append(testy_chunk)
    ensemble.train(trainx_chunks, trainy_chunks)

y_hat = np.round(y_hat)
print(metrics.accuracy_score(testy, y_hat))

In the next example of an adaptive ensemble, all of the neural networks in an ensemble will be trained on the entirety of the training dataset. Then subsets of the test dataset will be predicted and this newly seen data will then be added to the training dataset. All of the ensemble members will be re-trained using the newly available data combined with the pre-existing training data.

Essentially, this type of adaptation in the ensemble is just updating the models as new training data becomes available. For example, an ensemble could be trained on historic stock price data and, as time passes with the ensemble in production, new price data can be recorded. This new price data can then be used to retrain the networks in the ensemble potentially improving their predictive capabilities.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer, load_boston
from sklearn import metrics
import numpy as np

from pyml_ensemble import Ensemble
from pyml_ensemble.aggregator import MeanAggregator

from ann_model import ANNModel

# get model for ensemble
def get_ann_model(input_size, num_hidden_layers, hidden_layer_sizes, output_size):
    return ANNModel(input_size, num_hidden_layers, \
                      [hidden_layer_sizes for _ in range(num_hidden_layers)], \
                      output_size, epochs=1024, batch_size=4, fit_verbose=0)

# create ensemble structure
ensemble = Ensemble()
aggregator = MeanAggregator()
ensemble.set_aggregator(aggregator)

# load and split data
bc = load_breast_cancer()
trainx, testx, trainy, testy = train_test_split(bc.data, bc.target, test_size=0.33)

# build the ensemble by adding neural network models
num_models = 3
input_size = trainx.shape[1]
output_size = 1     # class
num_hidden_layers = 5
hidden_layer_sizes = 10
for i in range(num_models):
    ann = get_ann_model(input_size, num_hidden_layers, hidden_layer_sizes, output_size)
    ensemble.add_model(ann)

# train all available data
ensemble.train([trainx for _ in range(num_models)], [trainy for _ in range(num_models)])

# retrain on old + new data every 50 test data points
chunk_size = 50
y_hat = None
for i in range(0, testx.shape[0], chunk_size):
    # get the predicitons and test data chunks
    testx_chunk = testx[i:(i+chunk_size)]
    testy_chunk = testy[i:(i+chunk_size)]
    preds = ensemble.predict(testx_chunk)
    y_hat = preds if y_hat is None else np.vstack([y_hat, preds])

    # append the newly available data to the training sets
    trainx = np.concatenate([trainx, testx_chunk])
    trainy = np.concatenate([trainy, testy_chunk])

    # create and add the new model
    ann = get_ann_model(input_size, num_hidden_layers, hidden_layer_sizes, output_size)
    num_models += 1
    ensemble.add_model(ann)

    # train the new model on old + new data
    ensemble.train([trainx for _ in range(num_models)], [trainy for _ in range(num_models)])

y_hat = np.round(y_hat)
print(metrics.accuracy_score(testy, y_hat))

Chimera Ensembles

This section will demonstrate the inclusion of different types of ensemble methods in a single ensemble which is aptly named the chimera ensemble. I’m not familiar with any good reasons to use ensembles in this way but this can still easily be done using the PyML-Ensemble framework if need be.

The familiar TreeModel and ANNModel from above are used in the chimera ensemble. One small modification is made to the ANNModel to get the predictions into the right format (to match the other models) when predicting with the ensemble. This modification is shown, in short, below.

...
import numpy as np
...
class ANNModel(Model):
    ...
    def get_prediction(self, x):
        preds = np.round(self.model.predict(x)).tolist()
        return [pred[0] for pred in preds]

The prediction from the neural network is now rounded prior to being returned to the ensemble’s aggregator. Due to how the ModeAggregator works, the list of predictions has to be reshaped before being returned as well. Again, this isn’t the best way to do classification using neural networks but is being done here for simplicity’s sake.

A new classifier is also created to be added alongside the neural network and decision tree models:

from pyml_ensemble.model import Model
import random

class UselessClassifier(Model):
    def __init__(self, num_classes):
        self.num_classes = num_classes

    def train(self, x, y):
        pass    # no training needed

    def get_prediction(self, x):
        # Randomly select a class. Assumes classes are labled 0 through n-1 and
        # a list or numpy array is used as input.
        return [random.randint(0, self.num_classes-1) for _ in range(len(x))]

Clearly, as indicated by the name, the UselessClassifier doesn’t really have a purpose. However, it is included to demonstrate how any classifier conforming to the model ABC can easily be used in conjunction with other models in the ensemble. It is also introduced to illustrate how, when using the framework provided, a machine learning practitioner can stray from the beaten path (i.e. neural networks, trees, etc.) to create and use any type of classifier they deem fit.

With this modification and a new model the ensemble is created:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn import metrics
import numpy as np

from pyml_ensemble import Ensemble
from pyml_ensemble.aggregator import ModeAggregator
from pyml_ensemble.model import TreeModel

from ann_model import ANNModel
from useless import UselessClassifier

# split train/test data 
bc = load_breast_cancer()
trainx, testx, trainy, testy = \
    train_test_split(bc.data, bc.target, test_size=0.33)

# set up the ensemble 
ensemble = Ensemble()
ensemble.set_aggregator(ModeAggregator())

num_models = 3
ensemble.add_model(TreeModel())  # add a decision tree classifier
# add a neural network
ensemble.add_model(ANNModel(trainx.shape[1], 5, [10 for _ in range(5)], \
                      1, epochs=512, batch_size=8, fit_verbose=0))
ensemble.add_model(UselessClassifier(2))  # add a useless classifer: 2 = two classes

# train the ensemble
ensemble.train([trainx for _ in range(num_models)], \
               [trainy for _ in range(num_models)])

# get predictions 
y_hat = ensemble.predict(testx)
print(metrics.accuracy_score(testy, y_hat))

Conclusion

In this post, the PyML-Ensemble framework has been introduced and described. Links to the PyPi repository page and the GitHub repository have also been provided for those interested. Some examples of the framework’s usage are given but are really only the tip of the iceberg for such an open-ended framework. The goal of this project is to provide a basic framework to handle the busy-work behind building and managing machine learning ensembles. In future releases, I plan to extend this functionality by providing additional built-in models and aggregators while simultaneously trying to keep dependencies to a minimum. There are also a few conveniences that I would like to add to ensemble prediction and training.

Leave a Reply

Your email address will not be published. Required fields are marked *