In machine learning, we use training algorithms to learn the parameters of a model by minimizing some loss function. However, in addition many learning algorithms (e.g., support vector classifier and random forests) also have hyperparameters that must be defined outside of the learning process. For example, random forests are collections of decision trees (hence the word forest); however, the number of decision trees in the forest is not learned by the algorithm and must be set prior to fitting. This is often referred to as hyperparameter tuning, hyperparameter optimization, or model selection. Additionally, often we might want to try multiple learning algorithms (for example, trying both support vector classifier and random forests to see which learning method produces the best model).
While there is widespread variation in the use of terminology in this area, in this book we refer to both selecting the best learning algorithm and its best hyperparameters as model selection. The reason is straightforward: imagine we have data and want to train a support vector classifier with 10 candidate hyperparameter values and a random forest classifier with 10 candidate hyperparameter values. The result is that we are trying to select the best model from a set of 20 candidate models. In this chapter, we will cover techniques to efficiently select the best model from the set of candidates.
Throughout this chapter we will refer to specific hyperparameters, such as C (the inverse of regularization strength). Do not worry if you don’t know what the hyperparameters are. We will cover them in later chapters. Instead, just treat hyperparameters like the settings for the learning algorithm that we must choose before starting training.
Use scikit-learn’s GridSearchCV:
# Load librariesimportnumpyasnpfromsklearnimportlinear_model,datasetsfromsklearn.model_selectionimportGridSearchCV# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create logistic regressionlogistic=linear_model.LogisticRegression()# Create range of candidate penalty hyperparameter valuespenalty=['l1','l2']# Create range of candidate regularization hyperparameter valuesC=np.logspace(0,4,10)# Create dictionary hyperparameter candidateshyperparameters=dict(C=C,penalty=penalty)# Create grid searchgridsearch=GridSearchCV(logistic,hyperparameters,cv=5,verbose=0)# Fit grid searchbest_model=gridsearch.fit(features,target)
GridSearchCV is a brute-force approach to model selection using
cross-validation. Specifically, a user defines sets of possible values
for one or multiple hyperparameters, and then GridSearchCV trains a model using every value and/or combination of values. The model with the best performance score is selected as the best model. For example, in our solution we used logistic regression as our learning algorithm, containing two hyperparameters: C and the regularization penalty. Don’t worry if you don’t know what C and regularization mean; we cover them in the next few chapters. Just realize that C and the regularization penalty can take a range of values, which have to be specified prior to training. For C, we define 10 possible values:
np.logspace(0,4,10)
array([ 1.00000000e+00, 2.78255940e+00, 7.74263683e+00,
2.15443469e+01, 5.99484250e+01, 1.66810054e+02,
4.64158883e+02, 1.29154967e+03, 3.59381366e+03,
1.00000000e+04])
And similarly we define two possible values for the regularization
penalty: ['l1', 'l2']. For each combination of C and regularization
penalty values, we train the model and evaluate it using k-fold
cross-validation. In our solution, we had 10 possible values of C, 2
possible values of regularization penalty, and 5 folds. They created
10 × 2 × 5 = 100 candidate models from which the
best was selected.
Once GridSearchCV is complete, we can see the hyperparameters of
the best model:
# View best hyperparameters('Best Penalty:',best_model.best_estimator_.get_params()['penalty'])('Best C:',best_model.best_estimator_.get_params()['C'])
Best Penalty: l1 Best C: 7.74263682681
By default, after identifying the best hyperparameters, GridSearchCV
will retrain a model using the best hyperparameters on the entire
dataset (rather than leaving a fold out for cross-validation). We can
use this model to predict values like any other scikit-learn model:
# Predict target vectorbest_model.predict(features)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
One GridSearchCV parameter is worth noting: verbose. While mostly
unnecessary, it can be reassuring during long searching processes to
receive an indication that the search is progressing. The verbose
parameter determines the amount of messages outputted during the search,
with 0 showing no output, and 1 to 3 outputting messages with
increasing detail.
Use scikit-learn’s RandomizedSearchCV:
# Load librariesfromscipy.statsimportuniformfromsklearnimportlinear_model,datasetsfromsklearn.model_selectionimportRandomizedSearchCV# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create logistic regressionlogistic=linear_model.LogisticRegression()# Create range of candidate regularization penalty hyperparameter valuespenalty=['l1','l2']# Create distribution of candidate regularization hyperparameter valuesC=uniform(loc=0,scale=4)# Create hyperparameter optionshyperparameters=dict(C=C,penalty=penalty)# Create randomized searchrandomizedsearch=RandomizedSearchCV(logistic,hyperparameters,random_state=1,n_iter=100,cv=5,verbose=0,n_jobs=-1)# Fit randomized searchbest_model=randomizedsearch.fit(features,target)
In Recipe 12.1, we used GridSearchCV on a user-defined set of
hyperparameter values to search for the best model according to a score
function. A more efficient method than GridSearchCV’s brute-force
search is to search over a specific number of random combinations of
hyperparameter values from user-supplied distributions (e.g., normal,
uniform). scikit-learn implements this randomized search technique
with RandomizedSearchCV.
With RandomizedSearchCV, if we specify a distribution, scikit-learn
will randomly sample without replacement hyperparameter values from that
distribution. As an example of the general concept, here we randomly
sample 10 values from a uniform distribution ranging from 0 to 4:
# Define a uniform distribution between 0 and 4, sample 10 valuesuniform(loc=0,scale=4).rvs(10)
array([ 1.97118721, 1.34786898, 3.79867719, 1.15999967, 3.33485021,
1.20394179, 0.88117832, 0.47855329, 0.48521699, 1.5479854 ])
Alternatively, if we specify a list of values such as two regularization
penalty hyperparameter values, ['l1', 'l2'], RandomizedSearchCV will
randomly sample with replacement from the list.
Just like with GridSearchCV, we can see the hyperparameter values of
the best model:
# View best hyperparameters('Best Penalty:',best_model.best_estimator_.get_params()['penalty'])('Best C:',best_model.best_estimator_.get_params()['C'])
Best Penalty: l1 Best C: 1.66808801881
And just like with GridSearchCV, after the search is complete
RandomizedSearchCV fits a new model using the best hyperparameters on
the entire dataset. We can use this model like any other in
scikit-learn; for example, to make predictions:
# Predict target vectorbest_model.predict(features)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
The number of sampled combinations of hyperparameters (i.e., the number
of candidate models trained) is specified with the n_iter (number of
iterations) setting.
Create a dictionary of candidate learning algorithms and their hyperparameters:
# Load librariesimportnumpyasnpfromsklearnimportdatasetsfromsklearn.linear_modelimportLogisticRegressionfromsklearn.ensembleimportRandomForestClassifierfromsklearn.model_selectionimportGridSearchCVfromsklearn.pipelineimportPipeline# Set random seednp.random.seed(0)# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create a pipelinepipe=Pipeline([("classifier",RandomForestClassifier())])# Create dictionary with candidate learning algorithms and their hyperparameterssearch_space=[{"classifier":[LogisticRegression()],"classifier__penalty":['l1','l2'],"classifier__C":np.logspace(0,4,10)},{"classifier":[RandomForestClassifier()],"classifier__n_estimators":[10,100,1000],"classifier__max_features":[1,2,3]}]# Create grid searchgridsearch=GridSearchCV(pipe,search_space,cv=5,verbose=0)# Fit grid searchbest_model=gridsearch.fit(features,target)
In the previous two recipes we found the best model by searching over
possible hyperparameter values of a learning algorithm. However, what if
we are not certain which learning algorithm to use? Recent versions of
scikit-learn allow us to include learning algorithms as part of the
search space. In our solution we define a search space that includes two
learning algorithms: logistic regression and random forest classifier.
Each learning algorithm has its own hyperparameters, and we define
their candidate values using the format
classifier__[hyperparameter name]. For example, for our logistic regression, to define the set of possible values for regularization hyperparameter space, C, and potential types of regularization penalties, penalty, we create a dictionary:
{'classifier': [LogisticRegression()],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': np.logspace(0, 4, 10)}
We can also create a similar dictionary for the random forest hyperparameters:
{'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100, 1000],
'classifier__max_features': [1, 2, 3]}
After the search is complete, we can use best_estimator_ to view the
best model’s learning algorithm and hyperparameters:
# View best modelbest_model.best_estimator_.get_params()["classifier"]
LogisticRegression(C=7.7426368268112693, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Just like with the last two recipes, once we have fit the model selection search, we can use this best model just like any other scikit-learn model:
# Predict target vectorbest_model.predict(features)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Create a pipeline that includes the preprocessing step and any of its parameters:
# Load librariesimportnumpyasnpfromsklearnimportdatasetsfromsklearn.linear_modelimportLogisticRegressionfromsklearn.model_selectionimportGridSearchCVfromsklearn.pipelineimportPipeline,FeatureUnionfromsklearn.decompositionimportPCAfromsklearn.preprocessingimportStandardScaler# Set random seednp.random.seed(0)# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create a preprocessing object that includes StandardScaler features and PCApreprocess=FeatureUnion([("std",StandardScaler()),("pca",PCA())])# Create a pipelinepipe=Pipeline([("preprocess",preprocess),("classifier",LogisticRegression())])# Create space of candidate valuessearch_space=[{"preprocess__pca__n_components":[1,2,3],"classifier__penalty":["l1","l2"],"classifier__C":np.logspace(0,4,10)}]# Create grid searchclf=GridSearchCV(pipe,search_space,cv=5,verbose=0,n_jobs=-1)# Fit grid searchbest_model=clf.fit(features,target)
Very often we will need to preprocess our data before using it to train
a model. We have to be careful to properly handle preprocessing when
conducting model selection. First, GridSearchCV uses cross-validation to determine which model has the highest performance. However, in cross-validation we are in effect pretending that the fold held out, as the test set is not seen, and thus not part of fitting any preprocessing steps (e.g., scaling or standardization). For this reason, we cannot preprocess the data and then run GridSearchCV. Rather, the preprocessing steps must be a part of the set of actions taken by GridSearchCV. While this might appear complex, the reality is that scikit-learn makes it simple. FeatureUnion allows us to combine multiple preprocessing actions properly. In our solution we use FeatureUnion to combine two preprocessing steps: standardize the feature values (StandardScaler) and Principal Component Analysis (PCA). This object is called preprocess and contains both of our preprocessing steps. We then include preprocess into a pipeline with our learning algorithm. The end result is that this allows us to outsource the proper (and confusing) handling of fitting, transforming, and training the models with combinations of hyperparameters to scikit-learn.
Second, some preprocessing methods have their own parameters, which often have to be supplied by the user. For example, dimensionality reduction using PCA requires the user to define the number of principal components to use to produce the transformed feature set. Ideally, we would choose the number of components that produces a model with the greatest performance for some evaluation test metric. Luckily, scikit-learn makes this easy. When we include candidate component values in the search space, they are treated like any other hyperparameter to be searched over. In our solution, we defined features__pca__n_components': [1, 2, 3] in the search space to indicate that we wanted to discover if one, two, or three principal components produced the best model.
After model selection is complete, we can view the preprocessing values that produced the best model. For example, we can see the best number of principal components:
# View best modelbest_model.best_estimator_.get_params()['preprocess__pca__n_components']
1
Use all the cores in your machine by setting n_jobs=-1:
# Load librariesimportnumpyasnpfromsklearnimportlinear_model,datasetsfromsklearn.model_selectionimportGridSearchCV# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create logistic regressionlogistic=linear_model.LogisticRegression()# Create range of candidate regularization penalty hyperparameter valuespenalty=["l1","l2"]# Create range of candidate values for CC=np.logspace(0,4,1000)# Create hyperparameter optionshyperparameters=dict(C=C,penalty=penalty)# Create grid searchgridsearch=GridSearchCV(logistic,hyperparameters,cv=5,n_jobs=-1,verbose=1)# Fit grid searchbest_model=gridsearch.fit(features,target)
Fitting 5 folds for each of 2000 candidates, totalling 10000 fits [Parallel(n_jobs=-1)]: Done 1352 tasks | elapsed: 7.2s [Parallel(n_jobs=-1)]: Done 5120 tasks | elapsed: 30.3s [Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed: 1.1min finished
In the recipes of this chapter, we have kept the number of candidate
models small to make the code complete quickly. However, in the real
world we will often have many thousands or tens of thousands of models
to train. The end result is that it can take many hours to find the best
model. To speed up the process, scikit-learn lets us train multiple
models simultaneously. Without going into too much technical detail,
scikit-learn can simultaneously train models up to the number of cores on the machine. Most modern laptops have four cores, so (assuming you are currently on a laptop) we can potentially train four models at the same time. This will dramatically increase the speed of our model selection process. The parameter n_jobs defines the number of models to train in
parallel.
In our solution, we set n_jobs to -1, which tells scikit-learn to use
all cores. However, by default n_jobs is set to 1, meaning it only
uses one core. To demonstrate this, if we run the same GridSearch as
in the solution, but with n_jobs=1, we can see it takes significantly longer to
find the best model (note that exact time will depend on your computer):
# Create grid search using one coreclf=GridSearchCV(logistic,hyperparameters,cv=5,n_jobs=1,verbose=1)# Fit grid searchbest_model=clf.fit(features,target)
Fitting 5 folds for each of 2000 candidates, totalling 10000 fits [Parallel(n_jobs=1)]: Done 10000 out of 10000 | elapsed: 2.8min finished
You need to speed up model selection.
If you are using a select number of learning algorithms, use
scikit-learn’s model-specific cross-validation hyperparameter tuning.
For example, LogisticRegressionCV:
# Load librariesfromsklearnimportlinear_model,datasets# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create cross-validated logistic regressionlogit=linear_model.LogisticRegressionCV(Cs=100)# Train modellogit.fit(features,target)
LogisticRegressionCV(Cs=100, class_weight=None, cv=None, dual=False,
fit_intercept=True, intercept_scaling=1.0, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
Sometimes the characteristics of a learning algorithm allow us to
search for the best hyperparameters significantly faster than either
brute force or randomized model search methods. In scikit-learn, many
learning algorithms (e.g., ridge, lasso, and elastic net regression) have
an algorithm-specific cross-validation method to take advantage of this.
For example, LogisticRegression is used to conduct a standard logistic
regression classifier, while LogisticRegressionCV implements an
efficient cross-validated logistic regression classifier that has the
ability to identify the optimum value of the hyperparameter C.
scikit-learn’s LogisticRegressionCV method includes a parameter Cs.
If supplied a list, Cs is the candidate hyperparameter values to
select from. If supplied an integer, the parameter Cs generates a list of that many candidate values. The candidate values are drawn logarithmically from a range between 0.0001 and 1,0000 (a range of reasonable values for C).
However, a major downside to LogisticRegressionCV is that it can only
search a range of values for C. In Recipe 12.1 our possible hyperparameter space included both C and another hyperparameter (the
regularization penalty norm). This limitation is common to many of
scikit-learn’s model-specific cross-validated approaches.
Use nested cross-validation to avoid biased evaluation:
# Load librariesimportnumpyasnpfromsklearnimportlinear_model,datasetsfromsklearn.model_selectionimportGridSearchCV,cross_val_score# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create logistic regressionlogistic=linear_model.LogisticRegression()# Create range of 20 candidate values for CC=np.logspace(0,4,20)# Create hyperparameter optionshyperparameters=dict(C=C)# Create grid searchgridsearch=GridSearchCV(logistic,hyperparameters,cv=5,n_jobs=-1,verbose=0)# Conduct nested cross-validation and outut the average scorecross_val_score(gridsearch,features,target).mean()
0.95343137254901966
Nested cross-validation during model selection is a difficult concept for many people to grasp the first time. Remember that in k-fold cross-validation, we train our model on k–1 folds of the data, use this model to make predictions on the remaining fold, and then evaluate our model best on how well our model’s predictions compare to the true values. We then repeat this process k times.
In the model selection searches described in this chapter (i.e.,
GridSearchCV and RandomizedSearchCV), we used cross-validation to
evaluate which hyperparameter values produced the best models. However,
a nuanced and generally underappreciated problem arises: since we used
the data to select the best hyperparameter values, we cannot use that
same data to evaluate the model’s performance. The solution? Wrap the
cross-validation used for model search in another cross-validation! In
nested cross-validation, the “inner” cross-validation selects the best
model, while the “outer” cross-validation provides us with an unbiased
evaluation of the model’s performance. In our solution, the inner
cross-validation is our GridSearchCV object, which we then wrap in an
outer cross-validation using cross_val_score.
If you are confused, try a simple experiment. First, set verbose=1 so
we can see what is happening:
gridsearch=GridSearchCV(logistic,hyperparameters,cv=5,verbose=1)
Next, run gridsearch.fit(features, target), which is our inner
cross-validation used to find the best model:
best_model=gridsearch.fit(features,target)
Fitting 5 folds for each of 20 candidates, totalling 100 fits [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.4s finished
From the output you can see the inner cross-validation trained 20
candidate models five times, totaling 100 models. Next, nest clf inside
a new cross-validation, which defaults to three folds:
scores=cross_val_score(gridsearch,features,target)
Fitting 5 folds for each of 20 candidates, totalling 100 fits [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.5s finished Fitting 5 folds for each of 20 candidates, totalling 100 fits [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.7s finished Fitting 5 folds for each of 20 candidates, totalling 100 fits [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.5s finished
The output shows that the inner cross-validation trained 20 models five times to find the best model, and this model was evaluated using an outer three-fold cross-validation, creating a total of 300 models trained.