In much of this book we have looked at supervised machine learning—where we have access to both the features and the target. This is, unfortunately, not always the case. Frequently, we run into situations where we only know the features. For example, imagine we have records of sales from a grocery store and we want to break up sales by whether or not the shopper is a member of a discount club. This would be impossible using supervised learning because we don’t have a target to train and evaluate our models. However, there is another option: unsupervised learning. If the behavior of discount club members and nonmembers in the grocery store is actually disparate, then the average difference in behavior between two members will be smaller than the average difference in behavior between a member and nonmember shopper. Put another way, there will be two clusters of observations.
The goal of clustering algorithms is to identify those latent groupings of observations, which if done well, allow us to predict the class of observations even without a target vector. There are many clustering algorithms and they have a wide variety of approaches to identifying the clusters in data. In this chapter, we will cover a selection of clustering algorithms using scikit-learn and how to use them in practice.
You want to group observations into k groups.
# Load librariesfromsklearnimportdatasetsfromsklearn.preprocessingimportStandardScalerfromsklearn.clusterimportKMeans# Load datairis=datasets.load_iris()features=iris.data# Standardize featuresscaler=StandardScaler()features_std=scaler.fit_transform(features)# Create k-mean objectcluster=KMeans(n_clusters=3,random_state=0,n_jobs=-1)# Train modelmodel=cluster.fit(features_std)
k-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:
k cluster “center” points are created at random locations.
For each observation:
The distance between each observation and the k center points is calculated.
The observation is assigned to the cluster of the nearest center point.
The center points are moved to the means (i.e., centers) of their respective clusters.
Steps 2 and 3 are repeated until no observation changes in cluster membership.
At this point the algorithm is considered converged and stops.
It is important to note three things about k-means. First, k-means clustering assumes the clusters are convex shaped (e.g., a circle, a sphere). Second, all features are equally scaled. In our solution, we standardized the features to meet this assumption. Third, the groups are balanced (i.e., have roughly the same number of observations). If we suspect that we cannot meet these assumptions, we might try other clustering approaches.
In scikit-learn, k-means clustering is implemented in the KMeans
class. The most important parameter is n_clusters, which sets the
number of clusters k. In some situations, the nature of
the data will determine the value for k (e.g., data on a
school’s students will have one cluster per grade), but often we don’t
know the number of clusters. In these cases, we will want to select
k based on using some criteria. For example, silhouette
coefficients (see Recipe 11.9) measure the similarity within clusters
compared with the similarity between clusters. Furthermore, because
k-means clustering is computationally expensive, we might want to take
advantage of all the cores on our computer. We can do this by setting
n_jobs=-1.
In our solution, we cheated a little and used the Iris flower data, in
which we know there are three classes. Therefore, we set
k = 3. We can use labels_ to see the predicted classes of each observation:
# View predict classmodel.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2,
0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0,
2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)
If we compare this to the observation’s true class we can see that
despite the difference in class labels (i.e., 1, 2, and 3), k-means
did reasonably well:
# View true classiris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
However, as you might imagine, the performance of k-means drops considerably, even critically, if we select the wrong number of clusters.
Finally, as with other scikit-learns we can use the trained cluster to predict the value of new observations:
# Create new observationnew_observation=[[0.8,0.8,0.8,0.8]]# Predict observation's clustermodel.predict(new_observation)
array([0], dtype=int32)
The observation is predicted to belong to the cluster whose center point
is closest. We can even use cluster_centers_ to see those center
points:
# View cluster centersmodel.cluster_centers_
array([[ 1.13597027, 0.09659843, 0.996271 , 1.01717187],
[-1.01457897, 0.84230679, -1.30487835, -1.25512862],
[-0.05021989, -0.88029181, 0.34753171, 0.28206327]])
Use mini-batch k-means:
# Load librariesfromsklearnimportdatasetsfromsklearn.preprocessingimportStandardScalerfromsklearn.clusterimportMiniBatchKMeans# Load datairis=datasets.load_iris()features=iris.data# Standardize featuresscaler=StandardScaler()features_std=scaler.fit_transform(features)# Create k-mean objectcluster=MiniBatchKMeans(n_clusters=3,random_state=0,batch_size=100)# Train modelmodel=cluster.fit(features_std)
Mini-batch k-means works similarly to the k-means algorithm discussed in Recipe 19.1. Without going into too much detail, the difference is that in mini-batch k-means the most computationally costly step is conducted on only a random sample of observations as opposed to all observations. This approach can significantly reduce the time required for the algorithm to find convergence (i.e., fit the data) with only a small cost in quality.
MiniBatchKMeans works similarly to KMeans, with one significant
difference: the batch_size parameter. batch_size controls the number of randomly selected observations in each batch. The larger the size of the batch, the more computationally costly the training process.
Use meanshift clustering:
# Load librariesfromsklearnimportdatasetsfromsklearn.preprocessingimportStandardScalerfromsklearn.clusterimportMeanShift# Load datairis=datasets.load_iris()features=iris.data# Standardize featuresscaler=StandardScaler()features_std=scaler.fit_transform(features)# Create meanshift objectcluster=MeanShift(n_jobs=-1)# Train modelmodel=cluster.fit(features_std)
One of the disadvantages of k-means clustering we discussed previously is that we needed to set the number of clusters, k, prior to training, and the method made assumptions about the shape of the clusters. One clustering algorithm without these limitations is meanshift.
Meanshift is a simple concept, but somewhat difficult to explain. Therefore, an analogy might be the best approach. Imagine a very foggy football field (i.e., a two-dimensional feature space) with 100 people standing on it (i.e., our observations). Because it is foggy, a person can only see a short distance. Every minute each person looks around and takes a step in the direction of the most people they can see. As time goes on, people start to group up as they repeatedly take steps toward larger and larger crowds. The end result is clusters of people around the field. People are assigned to the clusters in which they end up.
scikit-learn’s actual implementation of meanshift, MeanShift, is more
complex but follows the same basic logic. MeanShift has two important
parameters we should be aware of. First, bandwidth sets the radius of the area (i.e., kernel) an observation uses to determine the direction to shift. In our analogy, bandwidth was how far a person could see through the fog. We can set this parameter manually, but by default a reasonable bandwidth is estimated automatically (with a significant increase in computational cost). Second, sometimes in meanshift there are no other observations within an observation’s kernel. That is, a person on our football field cannot see a single other person. By default, MeanShift assigns all these “orphan” observations to the kernel of the nearest observation. However, if we want to leave out these orphans, we can set cluster_all=False wherein orphan observations are given the label of -1.
Use DBSCAN clustering:
# Load librariesfromsklearnimportdatasetsfromsklearn.preprocessingimportStandardScalerfromsklearn.clusterimportDBSCAN# Load datairis=datasets.load_iris()features=iris.data# Standardize featuresscaler=StandardScaler()features_std=scaler.fit_transform(features)# Create meanshift objectcluster=DBSCAN(n_jobs=-1)# Train modelmodel=cluster.fit(features_std)
DBSCAN is motivated by the idea that clusters will be areas where many observations are densely packed together and makes no assumptions of cluster shape. Specifically, in DBSCAN:
A random observation, xi, is chosen.
If xi has a minimum number of close neighbors, we consider it to be part of a cluster.
Step 2 is repeated recursively for all of xi’s neighbors, then neighbor’s neighbor, and so on. These are the cluster’s core observations.
Once step 3 runs out of nearby observations, a new random point is chosen (i.e., restarting step 1).
Once this is complete, we have a set of core observations for a number of clusters. Finally, any observation close to a cluster but not a core sample is considered part of a cluster, while any observation not close to the cluster is labeled an outlier.
DBSCAN has three main parameters to set:
epsThe maximum distance from an observation for another observation to be considered its neighbor.
min_samples The minimum number of observations less than eps
distance from an observation for it to be considered a core observation.
metric The distance metric used by eps—for example, minkowski or euclidean (note that if Minkowski distance is used, the
parameter p can be used to set the power of the Minkowski metric).
If we look at the clusters in our training data we can see two clusters
have been identified, 0 and 1, while outlier observations are
labeled -1:
# Show cluster membershipmodel.labels_
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0,
0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, -1, 1,
1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
-1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1])
Use agglomerative clustering:
# Load librariesfromsklearnimportdatasetsfromsklearn.preprocessingimportStandardScalerfromsklearn.clusterimportAgglomerativeClustering# Load datairis=datasets.load_iris()features=iris.data# Standardize featuresscaler=StandardScaler()features_std=scaler.fit_transform(features)# Create meanshift objectcluster=AgglomerativeClustering(n_clusters=3)# Train modelmodel=cluster.fit(features_std)
Agglomerative clustering is a powerful, flexible hierarchical
clustering algorithm. In agglomerative clustering, all observations
start as their own clusters. Next, clusters meeting some criteria are
merged together. This process is repeated, growing clusters until some
end point is reached. In scikit-learn, AgglomerativeClustering uses
the linkage parameter to determine the merging strategy to minimize the following:
Variance of merged clusters (ward)
Average distance between observations from pairs of clusters (average)
Maximum distance between observations from pairs of clusters (complete)
Two other parameters are useful to know. First, the affinity parameter
determines the distance metric used for linkage (minkowski, euclidean, etc.). Second, n_clusters sets the number of clusters the clustering algorithm will attempt to find. That is, clusters are successively merged until there are only n_clusters remaining.
As with other clustering algorithms we have covered, we can use
labels_ to see the cluster in which every observation is assigned:
# Show cluster membershipmodel.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2,
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 2, 0,
2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])