Chapter 15. K-Nearest Neighbors

15.0 Introduction

The K-Nearest Neighbors classifier (KNN) is one of the simplest yet most commonly used classifiers in supervised machine learning. KNN is often considered a lazy learner; it doesn’t technically train a model to make predictions. Instead an observation is predicted to be the class of that of the largest proportion of the k nearest observations. For example, if an observation with an unknown class is surrounded by an observation of class 1, then the observation is classified as class 1. In this chapter we will explore how to use scikit-learn to create and use a KNN classifier.

15.1 Finding an Observation’s Nearest Neighbors

Problem

You need to find an observation’s k nearest observations (neighbors).

Solution

Use scikit-learn’s NearestNeighbors:

# Load libraries
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

# Load data
iris = datasets.load_iris()
features = iris.data

# Create standardizer
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Two nearest neighbors
nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)

# Create an observation
new_observation = [ 1,  1,  1,  1]

# Find distances and indices of the observation's nearest neighbors
distances, indices = nearest_neighbors.kneighbors([new_observation])

# View the nearest neighbors
features_standardized[indices]
array([[[ 1.03800476,  0.56925129,  1.10395287,  1.1850097 ],
        [ 0.79566902,  0.33784833,  0.76275864,  1.05353673]]])

Discussion

In our solution we used the dataset of Iris flowers. We created an observation, new_observation, with some values and then found the two observations that are closest to our observation. indices contains the locations of the observations in our dataset that are closest, so X[indices] displays the values of those observations. Intuitively, distance can be thought of as a measure of similarity, so the two closest observations are the two flowers most similar to the flower we created.

How do we measure distance? scikit-learn offers a wide variety of distance metrics, d, including Euclidean:

d euclidean = i=1 n (x i -y i ) 2

and Manhattan distance:

d manhattan = i=1 n x i - y i

By default, NearestNeighbors uses Minkowski distance:

d minkowski = i=1 n x i -y i p 1/p

where xi and yi are the two observations we are calculating the distance between. Minkowski includes a hyperparameter, p, where p = 1 is Manhattan distance and p = 2 is Euclidean distance, and so on. By default in scikit-learn p = 2.

We can set the distance metric using the metric parameter:

# Find two nearest neighbors based on euclidean distance
nearestneighbors_euclidean = NearestNeighbors(
n_neighbors=2, metric='euclidean').fit(features_standardized)

The distance variable we created contains the actual distance measurement to each of the two nearest neighbors:

# View distances
distances
array([[ 0.48168828,  0.73440155]])

In addition, we can use kneighbors_graph to create a matrix indicating each observation’s nearest neighbors:

# Find each observation's three nearest neighbors
# based on euclidean distance (including itself)
nearestneighbors_euclidean = NearestNeighbors(
n_neighbors=3, metric="euclidean").fit(features_standardized)

# List of lists indicating each observation's 3 nearest neighbors
# (including itself)
nearest_neighbors_with_self = nearestneighbors_euclidean.kneighbors_graph(
    features_standardized).toarray()

# Remove 1's marking an observation is a nearest neighbor to itself
for i, x in enumerate(nearest_neighbors_with_self):
    x[i] = 0

# View first observation's two nearest neighbors
nearest_neighbors_with_self[0]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.])

When we are finding nearest neighbors or using any learning algorithm based on distance, it is important to transform features so that they are on the same scale. The reason is because the distance metrics treat all features as if they were on the same scale, but if one feature is in millions of dollars and a second feature is in percentages, the distance calculated will be biased toward the former. In our solution we addressed this potential issue by standardizing the features using StandardScaler.

15.2 Creating a K-Nearest Neighbor Classifier

Problem

Given an observation of unknown class, you need to predict its class based on the class of its neighbors.

Solution

If the dataset is not very large, use KNeighborsClassifier:

# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X)

# Train a KNN classifier with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

# Create two observations
new_observations = [[ 0.75,  0.75,  0.75,  0.75],
                    [ 1,  1,  1,  1]]

# Predict the class of two observations
knn.predict(new_observations)
array([1, 2])

Discussion

In KNN, given an observation, xu, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called xu’s neighborhood) based on some distance metric (e.g., Euclidean distance), then these k observations “vote” based on their class, and the class that wins the vote is xu’s predicted class. More formally, the probability xu is some class j is:

1 k iν I ( y i = j )

where ν is the k observation in xu’s neighborhood, yi is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using predict_proba:

# View probability each observation is one of three classes
knn.predict_proba(new_observations)
array([[ 0. ,  0.6,  0.4],
       [ 0. ,  0. ,  1. ]])

The class with the highest probability becomes the predicted class. For example, in the preceding output, the first observation should be class 1 (Pr = 0.6) while the second observation should be class 2 (Pr = 1), and this is just what we see:

knn.predict(new_observations)
array([1, 2])

KNeighborsClassifier contains a number of important parameters to consider. First, metric sets the distance metric used (discussed more in Recipe 14.1). Second, n_jobs determines how many of the computer’s cores to use. Because making a prediction requires calculating the distance from a point to every single point in the data, using multiple cores is highly recommended. Third, algorithm sets the method used to calculate the nearest neighbors. While there are real differences in the algorithms, by default KNeighborsClassifier attempts to auto-select the best algorithm so you often don’t need to worry about this parameter. Fourth, by default KNeighborsClassifier works how we described previously, with each observation in the neighborhood getting one vote; however, if we set the weights parameter to distance, the closer observations’ votes are weighted more than observations farther away. Intuitively this make sense, since more similar neighbors might tell us more about an observation’s class than others.

Finally, as discussed in Recipe 14.1, because distance calculations treat all features as if they are on the same scale, it is important to standardize the features prior to using a KNN classifier.

15.3 Identifying the Best Neighborhood Size

Problem

You want to select the best value for k in a k-nearest neighbors classifier.

Solution

Use model selection techniques like GridSearchCV:

# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create standardizer
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Create a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# Create a pipeline
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

# Create space of candidate values
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

# Create grid search
classifier = GridSearchCV(
    pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)

Discussion

The size of k has real implications in KNN classifiers. In machine learning we are trying to find a balance between bias and variance, and in few places is that as explicit as the value of k. If k = n where n is the number of observations, then we have high bias but low variance. If k = 1, we will have low bias but high variance. The best model will come from finding the value of k that balances this bias-variance trade-off. In our solution, we used GridSearchCV to conduct five-fold cross-validation on KNN classifiers with different values of k. When that is completed, we can see the k that produces the best model:

# Best neighborhood size (k)
classifier.best_estimator_.get_params()["knn__n_neighbors"]
6

15.4 Creating a Radius-Based Nearest Neighbor Classifier

Problem

Given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance.

Solution

Use RadiusNeighborsClassifier:

# Load libraries
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create standardizer
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Train a radius neighbors classifier
rnn = RadiusNeighborsClassifier(
    radius=.5, n_jobs=-1).fit(features_standardized, target)

# Create two observations
new_observations = [[ 1,  1,  1,  1]]

# Predict the class of two observations
rnn.predict(new_observations)
array([2])

Discussion

In KNN classification, an observation’s class is predicted from the classes of its k neighbors. A less common technique is classification in a radius-based (RNN) classifier, where an observation’s class is predicted from the classes of all observations within a given radius r. In scikit-learn RadiusNeighborsClassifier is very similar to KNeighborsClassifier, with the exception of two parameters. First, in RadiusNeighborsClassifier we need to specify the radius of the fixed area used to determine if an observation is a neighbor using radius. Unless there is some substantive reason for setting radius to some value, it is best to treat it like any other hyperparameter and tune it during model selection. The second useful parameter is outlier_label, which indicates what label to give an observation that has no observations within the radius—which itself can often be a useful tool for identifying outliers.