Chapter 12. Model Selection

12.0 Introduction

In machine learning, we use training algorithms to learn the parameters of a model by minimizing some loss function. However, in addition many learning algorithms (e.g., support vector classifier and random forests) also have hyperparameters that must be defined outside of the learning process. For example, random forests are collections of decision trees (hence the word forest); however, the number of decision trees in the forest is not learned by the algorithm and must be set prior to fitting. This is often referred to as hyperparameter tuning, hyperparameter optimization, or model selection. Additionally, often we might want to try multiple learning algorithms (for example, trying both support vector classifier and random forests to see which learning method produces the best model).

While there is widespread variation in the use of terminology in this area, in this book we refer to both selecting the best learning algorithm and its best hyperparameters as model selection. The reason is straightforward: imagine we have data and want to train a support vector classifier with 10 candidate hyperparameter values and a random forest classifier with 10 candidate hyperparameter values. The result is that we are trying to select the best model from a set of 20 candidate models. In this chapter, we will cover techniques to efficiently select the best model from the set of candidates.

Throughout this chapter we will refer to specific hyperparameters, such as C (the inverse of regularization strength). Do not worry if you don’t know what the hyperparameters are. We will cover them in later chapters. Instead, just treat hyperparameters like the settings for the learning algorithm that we must choose before starting training.

12.3 Selecting Best Models from Multiple Learning Algorithms

Problem

You want to select the best model by searching over a range of learning algorithms and their respective hyperparameters.

Solution

Create a dictionary of candidate learning algorithms and their hyperparameters:

# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Set random seed
np.random.seed(0)

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

# Create dictionary with candidate learning algorithms and their hyperparameters
search_space = [{"classifier": [LogisticRegression()],
                 "classifier__penalty": ['l1', 'l2'],
                 "classifier__C": np.logspace(0, 4, 10)},
                {"classifier": [RandomForestClassifier()],
                 "classifier__n_estimators": [10, 100, 1000],
                 "classifier__max_features": [1, 2, 3]}]

# Create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)

# Fit grid search
best_model = gridsearch.fit(features, target)

Discussion

In the previous two recipes we found the best model by searching over possible hyperparameter values of a learning algorithm. However, what if we are not certain which learning algorithm to use? Recent versions of scikit-learn allow us to include learning algorithms as part of the search space. In our solution we define a search space that includes two learning algorithms: logistic regression and random forest classifier. Each learning algorithm has its own hyperparameters, and we define their candidate values using the format classifier__[hyperparameter name]. For example, for our logistic regression, to define the set of possible values for regularization hyperparameter space, C, and potential types of regularization penalties, penalty, we create a dictionary:

{'classifier': [LogisticRegression()],
 'classifier__penalty': ['l1', 'l2'],
 'classifier__C': np.logspace(0, 4, 10)}

We can also create a similar dictionary for the random forest hyperparameters:

{'classifier': [RandomForestClassifier()],
 'classifier__n_estimators': [10, 100, 1000],
 'classifier__max_features': [1, 2, 3]}

After the search is complete, we can use best_estimator_ to view the best model’s learning algorithm and hyperparameters:

# View best model
best_model.best_estimator_.get_params()["classifier"]
LogisticRegression(C=7.7426368268112693, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Just like with the last two recipes, once we have fit the model selection search, we can use this best model just like any other scikit-learn model:

# Predict target vector
best_model.predict(features)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

12.4 Selecting Best Models When Preprocessing

Problem

You want to include a preprocessing step during model selection.

Solution

Create a pipeline that includes the preprocessing step and any of its parameters:

# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Set random seed
np.random.seed(0)

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create a preprocessing object that includes StandardScaler features and PCA
preprocess = FeatureUnion([("std", StandardScaler()), ("pca", PCA())])

# Create a pipeline
pipe = Pipeline([("preprocess", preprocess),
                 ("classifier", LogisticRegression())])

# Create space of candidate values
search_space = [{"preprocess__pca__n_components": [1, 2, 3],
                 "classifier__penalty": ["l1", "l2"],
                 "classifier__C": np.logspace(0, 4, 10)}]

# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)

# Fit grid search
best_model = clf.fit(features, target)

Discussion

Very often we will need to preprocess our data before using it to train a model. We have to be careful to properly handle preprocessing when conducting model selection. First, GridSearchCV uses cross-validation to determine which model has the highest performance. However, in cross-validation we are in effect pretending that the fold held out, as the test set is not seen, and thus not part of fitting any preprocessing steps (e.g., scaling or standardization). For this reason, we cannot preprocess the data and then run GridSearchCV. Rather, the preprocessing steps must be a part of the set of actions taken by GridSearchCV. While this might appear complex, the reality is that scikit-learn makes it simple. FeatureUnion allows us to combine multiple preprocessing actions properly. In our solution we use FeatureUnion to combine two preprocessing steps: standardize the feature values (StandardScaler) and Principal Component Analysis (PCA). This object is called preprocess and contains both of our preprocessing steps. We then include preprocess into a pipeline with our learning algorithm. The end result is that this allows us to outsource the proper (and confusing) handling of fitting, transforming, and training the models with combinations of hyperparameters to scikit-learn.

Second, some preprocessing methods have their own parameters, which often have to be supplied by the user. For example, dimensionality reduction using PCA requires the user to define the number of principal components to use to produce the transformed feature set. Ideally, we would choose the number of components that produces a model with the greatest performance for some evaluation test metric. Luckily, scikit-learn makes this easy. When we include candidate component values in the search space, they are treated like any other hyperparameter to be searched over. In our solution, we defined features__pca__n_components': [1, 2, 3] in the search space to indicate that we wanted to discover if one, two, or three principal components produced the best model.

After model selection is complete, we can view the preprocessing values that produced the best model. For example, we can see the best number of principal components:

# View best model
best_model.best_estimator_.get_params()['preprocess__pca__n_components']
1

12.5 Speeding Up Model Selection with Parallelization

Problem

You need to speed up model selection.

Solution

Use all the cores in your machine by setting n_jobs=-1:

# Load libraries
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression()

# Create range of candidate regularization penalty hyperparameter values
penalty = ["l1", "l2"]

# Create range of candidate values for C
C = np.logspace(0, 4, 1000)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

# Create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1, verbose=1)

# Fit grid search
best_model = gridsearch.fit(features, target)
Fitting 5 folds for each of 2000 candidates, totalling 10000 fits


[Parallel(n_jobs=-1)]: Done 1352 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done 5120 tasks      | elapsed:   30.3s
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  1.1min finished

Discussion

In the recipes of this chapter, we have kept the number of candidate models small to make the code complete quickly. However, in the real world we will often have many thousands or tens of thousands of models to train. The end result is that it can take many hours to find the best model. To speed up the process, scikit-learn lets us train multiple models simultaneously. Without going into too much technical detail, scikit-learn can simultaneously train models up to the number of cores on the machine. Most modern laptops have four cores, so (assuming you are currently on a laptop) we can potentially train four models at the same time. This will dramatically increase the speed of our model selection process. The parameter n_jobs defines the number of models to train in parallel.

In our solution, we set n_jobs to -1, which tells scikit-learn to use all cores. However, by default n_jobs is set to 1, meaning it only uses one core. To demonstrate this, if we run the same GridSearch as in the solution, but with n_jobs=1, we can see it takes significantly longer to find the best model (note that exact time will depend on your computer):

# Create grid search using one core
clf = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=1, verbose=1)

# Fit grid search
best_model = clf.fit(features, target)
Fitting 5 folds for each of 2000 candidates, totalling 10000 fits


[Parallel(n_jobs=1)]: Done 10000 out of 10000 | elapsed:  2.8min finished

12.6 Speeding Up Model Selection Using Algorithm-Specific Methods

Problem

You need to speed up model selection.

Solution

If you are using a select number of learning algorithms, use scikit-learn’s model-specific cross-validation hyperparameter tuning. For example, LogisticRegressionCV:

# Load libraries
from sklearn import linear_model, datasets

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create cross-validated logistic regression
logit = linear_model.LogisticRegressionCV(Cs=100)

# Train model
logit.fit(features, target)
LogisticRegressionCV(Cs=100, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

Discussion

Sometimes the characteristics of a learning algorithm allow us to search for the best hyperparameters significantly faster than either brute force or randomized model search methods. In scikit-learn, many learning algorithms (e.g., ridge, lasso, and elastic net regression) have an algorithm-specific cross-validation method to take advantage of this. For example, LogisticRegression is used to conduct a standard logistic regression classifier, while LogisticRegressionCV implements an efficient cross-validated logistic regression classifier that has the ability to identify the optimum value of the hyperparameter C.

scikit-learn’s LogisticRegressionCV method includes a parameter Cs. If supplied a list, Cs is the candidate hyperparameter values to select from. If supplied an integer, the parameter Cs generates a list of that many candidate values. The candidate values are drawn logarithmically from a range between 0.0001 and 1,0000 (a range of reasonable values for C).

However, a major downside to LogisticRegressionCV is that it can only search a range of values for C. In Recipe 12.1 our possible hyperparameter space included both C and another hyperparameter (the regularization penalty norm). This limitation is common to many of scikit-learn’s model-specific cross-validated approaches.

12.7 Evaluating Performance After Model Selection

Problem

You want to evaluate the performance of a model found through model selection.

Solution

Use nested cross-validation to avoid biased evaluation:

# Load libraries
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV, cross_val_score

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression()

# Create range of 20 candidate values for C
C = np.logspace(0, 4, 20)

# Create hyperparameter options
hyperparameters = dict(C=C)

# Create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1, verbose=0)

# Conduct nested cross-validation and outut the average score
cross_val_score(gridsearch, features, target).mean()
0.95343137254901966

Discussion

Nested cross-validation during model selection is a difficult concept for many people to grasp the first time. Remember that in k-fold cross-validation, we train our model on k–1 folds of the data, use this model to make predictions on the remaining fold, and then evaluate our model best on how well our model’s predictions compare to the true values. We then repeat this process k times.

In the model selection searches described in this chapter (i.e., GridSearchCV and RandomizedSearchCV), we used cross-validation to evaluate which hyperparameter values produced the best models. However, a nuanced and generally underappreciated problem arises: since we used the data to select the best hyperparameter values, we cannot use that same data to evaluate the model’s performance. The solution? Wrap the cross-validation used for model search in another cross-validation! In nested cross-validation, the “inner” cross-validation selects the best model, while the “outer” cross-validation provides us with an unbiased evaluation of the model’s performance. In our solution, the inner cross-validation is our GridSearchCV object, which we then wrap in an outer cross-validation using cross_val_score.

If you are confused, try a simple experiment. First, set verbose=1 so we can see what is happening:

gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=1)

Next, run gridsearch.fit(features, target), which is our inner cross-validation used to find the best model:

best_model = gridsearch.fit(features, target)
Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.4s finished

From the output you can see the inner cross-validation trained 20 candidate models five times, totaling 100 models. Next, nest clf inside a new cross-validation, which defaults to three folds:

scores = cross_val_score(gridsearch, features, target)
Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.7s finished


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished

The output shows that the inner cross-validation trained 20 models five times to find the best model, and this model was evaluated using an outer three-fold cross-validation, creating a total of 300 models trained.