K-means is a clustering approach that divides the set of data points into K
groups based on similarity. It works by repeatedly assigning each data point
to the cluster with the closest mean and then updating the mean of each
cluster based on the new assignments. The process terminates when the cluster
assignments no longer change.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
iris = load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
X_scaled = (X - X.mean()) / X.std()
wss = []
for i in range(1, 11):
kmeans_fit = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
plt.plot(range(1, 11), wss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Within cluster sum of squares')
kmeans_fit = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans_fit.fit_predict(X_scaled)
plt.scatter(X_scaled[y_kmeans == 0].iloc[:, 0], X_scaled[y_kmeans == 0].iloc[:, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X_scaled[y_kmeans == 1].iloc[:, 0], X_scaled[y_kmeans == 1].iloc[:, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X_scaled[y_kmeans == 2].iloc[:, 0], X_scaled[y_kmeans == 2].iloc[:, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(kmeans_fit.cluster_centers_[:, 0], kmeans_fit.cluster_centers_[:, 1], s = 200, c = 'yellow', label = 'Centroids')
plt.title('K-means Clustering')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
We loaded the necessary Python packages in this code, including pandas, numpy, matplotlib, and scikit-learn. Then we used scikit-learn's built-in method to import the Iris dataset and preprocessed it by scaling it to zero mean and unit variance.
The elbow approach was then used to determine the appropriate number of clusters for K-means clustering. The elbow approach consists of graphing the within-cluster sum of squares (WSS) for several k values and selecting the k value at which the rate of decrease in WSS begins to level off. In this situation, we determined that k = 3 was the ideal number of clusters.
We next used the predict function to assign each data point to one of the three clusters after fitting a K-means model with k = 3. Finally, we used matplotlib to plot the findings, with different colours indicating distinct clusters and yellow dots representing cluster centroids.
Overall, this code shows how to conduct K-means clustering in Python in a simple and succinct manner, as well as how the elbow technique may be used to determine the ideal number of clusters.