In statistics, theĀ sum of squared errors is a method that measures the difference between the predicted value from the model and the actual value that has been noted. This is also known as the residual. For clustering, this is measured as the distance of the projected point from the center of the cluster.
We will be using the Euclidean distance, that is, the distance between two points in a straight line as a measure to compute theĀ sum of the squared errors.
We define the Euclidean distance as follows:
def euclidean_distance_points(x1, x2):
x3 = x1 - x2
return np.sqrt(x3.T.dot(x3))
We will call this preceding function to compute the error:
from operator import add
tine1 = time.time()
def ss_error(k_clusters, point):
nearest_center = k_clusters.centers[k_clusters.predict(point)]
return euclidean_distance_points(nearest_center, point)**2
WSSSE = data.map(lambda point: ss_error(k_clusters, point)).reduce(add)
print("Within Set Sum of Squared Error = " + str(WSSSE))
print(time.time() - time1)
Within Set Sum of Squared Error = 3.05254895755e+18
15.861504316329956
Since the data is already labeled, we will once check how these labels sit in with the two clusters that we have generated:
clusterLabel = labelsAndData.map(lambda row: ((k_clusters.predict(row[1]), row[0]), 1)).reduceByKey(add)
for items in clusterLabe.collect():
print(items)
((0, 'rootkit.'), 10)
((0, 'multihop.'), 7)
((0, 'normal.'), 972781)
((0, 'phf.'), 4)
((0, 'nmap.'), 2316)
((0, 'pod.'), 264)
((0, 'back.'), 2203)
((0, 'ftp_write.'), 8)
((0, 'spy.'), 2)
((0, 'warezmaster.'), 20)
((1, 'portsweep.'), 5)
((0, 'perl.'), 3)
((0, 'land.'), 21)
((0, 'portsweep.'), 10408)
((0, 'smurf.'), 2807886)
((0, 'ipsweep.'), 12481)
((0, 'imap.'), 12)
((0, 'warezclient.'), 1020)
((0, 'loadmodule.'), 9)
((0, 'guess_passwd.'), 53)
((0, 'neptune.'), 1072017)
((0, 'teardrop.'), 979)
((0, 'buffer_overflow.'), 30)
((0, 'satan.'), 15892)
The preceding labels confirms the imbalance in the data, as different types of labels have got clustered in the same cluster.
We will now cluster the entire data and, for that, we need to choose the right value of k. Since the dataset has 23 labels, we can choose K=23, but there are other methods to compute the value of K. The following section describes them.