The K-Nearest Neighbors classifier (KNN) is one of the simplest yet most commonly used classifiers in supervised machine learning. KNN is often considered a lazy learner; it doesn’t technically train a model to make predictions. Instead an observation is predicted to be the class of that of the largest proportion of the k nearest observations. For example, if an observation with an unknown class is surrounded by an observation of class 1, then the observation is classified as class 1. In this chapter we will explore how to use scikit-learn to create and use a KNN classifier.
Use scikit-learn’s NearestNeighbors:
# Load librariesfromsklearnimportdatasetsfromsklearn.neighborsimportNearestNeighborsfromsklearn.preprocessingimportStandardScaler# Load datairis=datasets.load_iris()features=iris.data# Create standardizerstandardizer=StandardScaler()# Standardize featuresfeatures_standardized=standardizer.fit_transform(features)# Two nearest neighborsnearest_neighbors=NearestNeighbors(n_neighbors=2).fit(features_standardized)# Create an observationnew_observation=[1,1,1,1]# Find distances and indices of the observation's nearest neighborsdistances,indices=nearest_neighbors.kneighbors([new_observation])# View the nearest neighborsfeatures_standardized[indices]
array([[[ 1.03800476, 0.56925129, 1.10395287, 1.1850097 ],
[ 0.79566902, 0.33784833, 0.76275864, 1.05353673]]])
In our solution we used the dataset of Iris flowers. We created an
observation, new_observation, with some values and then found the two
observations that are closest to our observation. indices contains the
locations of the observations in our dataset that are closest, so
X[indices] displays the values of those observations. Intuitively,
distance can be thought of as a measure of similarity, so the two
closest observations are the two flowers most similar to the flower we
created.
How do we measure distance? scikit-learn offers a wide variety of distance metrics, d, including Euclidean:
and Manhattan distance:
By default, NearestNeighbors uses Minkowski distance:
where xi and yi are the two observations we are calculating the distance between. Minkowski includes a hyperparameter, p, where p = 1 is Manhattan distance and p = 2 is Euclidean distance, and so on. By default in scikit-learn p = 2.
We can set the distance metric using the metric parameter:
# Find two nearest neighbors based on euclidean distancenearestneighbors_euclidean=NearestNeighbors(n_neighbors=2,metric='euclidean').fit(features_standardized)
The distance variable we created contains the actual distance
measurement to each of the two nearest neighbors:
# View distancesdistances
array([[ 0.48168828, 0.73440155]])
In addition, we can use kneighbors_graph to create a matrix indicating
each observation’s nearest neighbors:
# Find each observation's three nearest neighbors# based on euclidean distance (including itself)nearestneighbors_euclidean=NearestNeighbors(n_neighbors=3,metric="euclidean").fit(features_standardized)# List of lists indicating each observation's 3 nearest neighbors# (including itself)nearest_neighbors_with_self=nearestneighbors_euclidean.kneighbors_graph(features_standardized).toarray()# Remove 1's marking an observation is a nearest neighbor to itselffori,xinenumerate(nearest_neighbors_with_self):x[i]=0# View first observation's two nearest neighborsnearest_neighbors_with_self[0]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0.])
When we are finding nearest neighbors or using any learning algorithm based on distance, it is important to transform features so that they are on the same scale. The reason is because the distance metrics treat all features as if they were on the same scale, but if one feature is in millions of dollars and a second feature is in percentages, the distance
calculated will be biased toward the former. In our solution we addressed this potential issue by standardizing the features using
StandardScaler.
If the dataset is not very large, use KNeighborsClassifier:
# Load librariesfromsklearn.neighborsimportKNeighborsClassifierfromsklearn.preprocessingimportStandardScalerfromsklearnimportdatasets# Load datairis=datasets.load_iris()X=iris.datay=iris.target# Create standardizerstandardizer=StandardScaler()# Standardize featuresX_std=standardizer.fit_transform(X)# Train a KNN classifier with 5 neighborsknn=KNeighborsClassifier(n_neighbors=5,n_jobs=-1).fit(X_std,y)# Create two observationsnew_observations=[[0.75,0.75,0.75,0.75],[1,1,1,1]]# Predict the class of two observationsknn.predict(new_observations)
array([1, 2])
In KNN, given an observation, xu, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called xu’s neighborhood) based on some distance metric (e.g., Euclidean distance), then these k observations “vote” based on their class, and the class that wins the vote is xu’s predicted class. More formally, the probability xu is some class j is:
where ν is the k observation in xu’s neighborhood, yi is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using predict_proba:
# View probability each observation is one of three classesknn.predict_proba(new_observations)
array([[ 0. , 0.6, 0.4],
[ 0. , 0. , 1. ]])
The class with the highest probability becomes the predicted class. For example, in the preceding output, the first observation should be class 1 (Pr = 0.6) while the second observation should be class 2 (Pr = 1), and this is just what we see:
knn.predict(new_observations)
array([1, 2])
KNeighborsClassifier contains a number of important parameters to
consider. First, metric sets the distance metric used (discussed more in Recipe 14.1). Second, n_jobs determines how many of the computer’s cores to use. Because making a prediction requires calculating the distance from a point to every single point in the data, using multiple cores is highly recommended. Third, algorithm sets the method used to calculate the nearest neighbors. While there are real differences in the algorithms, by default KNeighborsClassifier attempts to auto-select the best algorithm so you often don’t need to
worry about this parameter. Fourth, by default KNeighborsClassifier
works how we described previously, with each observation in the neighborhood getting one vote; however, if we set the weights parameter to distance, the closer observations’ votes are weighted more than observations farther away. Intuitively this make sense, since more similar neighbors might tell us more about an observation’s class than others.
Finally, as discussed in Recipe 14.1, because distance calculations treat all features as if they are on the same scale, it is important to standardize the features prior to using a KNN classifier.
Use model selection techniques like GridSearchCV:
# Load librariesfromsklearn.neighborsimportKNeighborsClassifierfromsklearnimportdatasetsfromsklearn.preprocessingimportStandardScalerfromsklearn.pipelineimportPipeline,FeatureUnionfromsklearn.model_selectionimportGridSearchCV# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create standardizerstandardizer=StandardScaler()# Standardize featuresfeatures_standardized=standardizer.fit_transform(features)# Create a KNN classifierknn=KNeighborsClassifier(n_neighbors=5,n_jobs=-1)# Create a pipelinepipe=Pipeline([("standardizer",standardizer),("knn",knn)])# Create space of candidate valuessearch_space=[{"knn__n_neighbors":[1,2,3,4,5,6,7,8,9,10]}]# Create grid searchclassifier=GridSearchCV(pipe,search_space,cv=5,verbose=0).fit(features_standardized,target)
The size of k has real implications in KNN classifiers. In
machine learning we are trying to find a balance between bias and
variance, and in few places is that as explicit as the value of
k. If k = n where n is the number of observations, then we have high bias but low variance. If k = 1, we will have low bias but high variance. The best model will come from finding the value of k that balances this bias-variance trade-off. In our solution, we used GridSearchCV to conduct five-fold cross-validation on KNN classifiers with different values of k. When that is completed, we can see the k that produces the best model:
# Best neighborhood size (k)classifier.best_estimator_.get_params()["knn__n_neighbors"]
6
Use RadiusNeighborsClassifier:
# Load librariesfromsklearn.neighborsimportRadiusNeighborsClassifierfromsklearn.preprocessingimportStandardScalerfromsklearnimportdatasets# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create standardizerstandardizer=StandardScaler()# Standardize featuresfeatures_standardized=standardizer.fit_transform(features)# Train a radius neighbors classifierrnn=RadiusNeighborsClassifier(radius=.5,n_jobs=-1).fit(features_standardized,target)# Create two observationsnew_observations=[[1,1,1,1]]# Predict the class of two observationsrnn.predict(new_observations)
array([2])
In KNN classification, an observation’s class is predicted from the
classes of its k neighbors. A less common technique is classification in a radius-based (RNN) classifier, where an observation’s class is predicted from the classes of all observations within a given radius r. In scikit-learn RadiusNeighborsClassifier is very similar to KNeighborsClassifier, with the exception of two parameters. First, in RadiusNeighborsClassifier we need to specify the radius of the fixed area used to determine if an observation is a neighbor using radius. Unless there is some substantive reason for setting radius to some value, it is best to treat it like any other hyperparameter and tune it during model selection. The second useful parameter is outlier_label, which indicates what label to give an observation that has no observations within the radius—which itself
can often be a useful tool for identifying outliers.