K-means Clustering in Python

K-means is a clustering approach that divides the set of data points into K groups based on similarity. It works by repeatedly assigning each data point to the cluster with the closest mean and then updating the mean of each cluster based on the new assignments. The process terminates when the cluster assignments no longer change.

If you want to know more about K-means clustering definition, description, advantages, disadvantages and limitations. Unlock the power of machine learning with AKSTATS' "K-means Clustering in R" article.

Let's unleash the Python-Code for K-Means!

Load the required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

Load the dataset

iris = load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)

Preprocess the data

X_scaled = (X - X.mean()) / X.std()

Find the optimal number of clusters using the elbow method

wss = []
for i in range(1, 11):
    kmeans_fit = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans_fit.fit(X_scaled)
    wss.append(kmeans_fit.inertia_)
plt.plot(range(1, 11), wss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Within cluster sum of squares')
plt.show()

Fit a K-means model with the optimal number of clusters

kmeans_fit = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans_fit.fit_predict(X_scaled)

Plot the results

plt.scatter(X_scaled[y_kmeans == 0].iloc[:, 0], X_scaled[y_kmeans == 0].iloc[:, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X_scaled[y_kmeans == 1].iloc[:, 0], X_scaled[y_kmeans == 1].iloc[:, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X_scaled[y_kmeans == 2].iloc[:, 0], X_scaled[y_kmeans == 2].iloc[:, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(kmeans_fit.cluster_centers_[:, 0], kmeans_fit.cluster_centers_[:, 1], s = 200, c = 'yellow', label = 'Centroids')
plt.title('K-means Clustering')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()

Summary

We loaded the necessary Python packages in this code, including pandas, numpy, matplotlib, and scikit-learn. Then we used scikit-learn's built-in method to import the Iris dataset and preprocessed it by scaling it to zero mean and unit variance.

The elbow approach was then used to determine the appropriate number of clusters for K-means clustering. The elbow approach consists of graphing the within-cluster sum of squares (WSS) for several k values and selecting the k value at which the rate of decrease in WSS begins to level off. In this situation, we determined that k = 3 was the ideal number of clusters.

We next used the predict function to assign each data point to one of the three clusters after fitting a K-means model with k = 3. Finally, we used matplotlib to plot the findings, with different colours indicating distinct clusters and yellow dots representing cluster centroids.

Overall, this code shows how to conduct K-means clustering in Python in a simple and succinct manner, as well as how the elbow technique may be used to determine the ideal number of clusters.

Summary

Translate

AKSTATS

Contact Form