K-Means Clustering

K-means clustering is one of the most popular and, at the same time, the simplest types of unsupervised learning. The purpose is to separate the given set of data points into different groups called clusters, in which each data point is assigned to the cluster with the nearest mean value. This method is particularly useful for identifying natural groupings in the data, as shown in Figure 1. In K-means cluster analysis, K is first randomly selected, and then it is repeatedly updated by considering the amount of variance in every cluster. It is a vector quantization that proceeds by partitioning into k clusters, with each data point being assigned to the cluster whose mean is the closest. Many people use this method to segment customers, process images, and use social networking.

Figure 1

Here is how it works:

  • Initializing K centroids randomly. These are k initial cluster centers or centroids represented by 𝜇1,𝜇2,…,𝜇𝑘.
  • Every data point is allocated to the nearest centroid, forming K clusters. $$C_k = { \mathbf{x}_i : |\mathbf{x}_i – \mu_k|^2 \leq |\mathbf{x}_i – \mu_j|^2 \text{ for all } j = 1, \ldots, K }$$ where Ck is the cluster to which data point 𝑥𝑖 belongs and |𝑥𝑖−𝜇𝑗|2 is the squared Euclidean distance between the data point 𝑥𝑖 and the centroid 𝜇𝑗.
  • The centroids were recalculated as the average of all the points in the cluster. $$\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$$ where 𝐶𝑗 is a set of points, assigned to 𝑗-th cluster, and |𝐶𝑗| is the quantity of points in the 𝑗th cluster.
  • Continue with the assignment, updating all the steps until the centroid does not change significantly.

It is useful where there is a small and medium-sized data sample. However, it depends on the initial positioning of the centroid and outliers. This differentiation leads the previously mentioned learning type, supervised learning, to use labeled information to make predictions on a specific target variable. In contrast, the present type of learning, clustering, aims to determine group structures or patterns in the data without any reference to those labeled outcomes.

Practical example:

For instance, suppose the points in our space are the customers who visit a particular store; then, we have to cluster these points based on their shopping patterns.

  • Data: In our dataset, two features are present: the amount spent on sports and digital products.
  • Goal: For the purpose of this example, the clusters for the sample dataset will be set to three (k=3).

Step-by-step:

  • Initialization: At this stage, three random points are chosen from the learning dataset and act as the starting centroids.
  • Assignment: Find the distance from each determined centroid for each data point. In the next step, every data point is clustered around the nearest distance centroids.
  • Update: To get the new centroid, it is the means of all the points falling in the same cluster.
  • Iteration: These steps of assignment and updating should be repeated until the centroids’ values do not shift any further.

The following code is implemented in Python with the scikit-learn module. It aims to partition such data into K clusters, with each point being in the nearest cluster center. K-means attempt to move points in the same cluster closer to each other, thereby minimizing cluster variance. This method is quite effective when clusters are spherical and/or the sizes are almost equal.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Sample data: amount spent on sport and digital products
X = np.array([
    [105, 210], [125, 230], [140, 250], [160, 265], [210, 450], 
    [215, 500], [225, 530], [240, 560], [290, 990], [315, 1000],
    [320, 1050], [330, 1070]
])

# Number of clusters
k = 3

# Create KMeans instance
kmeans = KMeans(n_clusters=k)

# Fit the model
kmeans.fit(X)

# Get cluster centroids
centroids = kmeans.cluster_centers_

# Get labels for each point
labels = kmeans.labels_

# Plot the data points with cluster assignments
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X')
plt.xlabel('Amount Spent on Groceries')
plt.ylabel('Amount Spent on Clothing')
plt.title('K-Means Clustering')
plt.show()

Explanation:

  • Data preparation: A sample data set involving the values of sports and digital products was developed to be used when carrying out an example.
  • Number of Clusters: The number of clusters, or 𝑘, is 3.
  • KMeans Instance: In the KMeans class, the initialization parameters include the number of clusters to be created.
  • Fit the Model: The fit method calculates the centroid of the clusters and categorizes the data points into near clusters.
  • Plotting: A Python graph library called matplotlib visualizes the clusters. The data points are assigned color according to the number of clusters, and centroids are depicted by X in red.