Introduction to Clustering Algorithms

Author:Murphy | View: 23854 | Time: 2025-03-23 12:05:09

Introduction

Clustering algorithms play an important role in data analysis. These unsupervised learning, exploratory data analysis tools provide systems for knowledge discovery by categorizing data points into distinct groups based on shared characteristics. This allows for the identification of relationships and trends that may be hard to see in the raw data. They facilitate more informed decision making by systematically adding more understanding to complex and intricate datasets.

In this article, we will cover the basics of three types of clustering algorithms: Hierarchical, Partitional, and Density-Based Clustering models. We will begin by defining each of these categories. Next, we will dive into 10 different clustering algorithms, providing definitions, links to the original or interesting research papers, strengths of the algorithms, and python code-snippets for each.

Hierarchical Clustering Algorithms

Birch
Cure
ROCK
Chameleon

Partitional Clustering Algorithms

K-Means
K-Medoids (PAM)
CLARANS
ISODATA

Density-Based Clustering Algorithms

DBSCAN
DENCLUE

Hierarchical Clustering Algorithms

Definition: Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. It can be visualized as a tree structure (dendrogram) where the leaves represent individual data points and the root represents a single cluster containing all data points.

Use Cases:

Taxonomy Problems.
When Vertical relationships are important in the data.

Strengths:

Provides a hierarchical structure of clusters.
No need to specify the number of clusters beforehand.

Weaknesses:

Prone to noise and outliers.
Computationally intensive for large datasets.

Birch

Summary:

The Birch algorithm, short for "Balanced Iterative Reducing and Clustering using Hierarchies," is a hierarchical clustering algorithm designed for scalability and efficiency, particularly suited for large datasets. It uses a two-step process to build clusters:

Construct a feature-based tree called a Clustering Feature (CF) tree, summarizing the dataset's distribution. This CF tree allows for efficient memory usage and incremental updates.
Apply a clustering mechanism based on the leaf nodes of the CF tree to form cohesive clusters.

Original Paper / Helpful Research:

BIRCH: an efficient data clustering method for very large databases: ACM SIGMOD Record: Vol 25, No…

Strengths:

Birch is known for its ability to handle large volumes of data and its resilience to outliers.

Python Code:

Cure

Summary:

The CURE (Clustering Using Representatives) algorithm is an agglomerative hierarchical clustering method designed to address the limitations of traditional centroid-based algorithms like K-Means, especially when dealing with non-spherical and arbitrarily shaped clusters. CURE takes a unique approach by representing clusters with a fixed number of points, called representatives, which are randomly selected from each cluster. These representatives are then "shrunk" toward the center of mass, effectively capturing the cluster's geometry. CURE is known for its robustness against outliers, ability to handle clusters of varying shapes and sizes, and improved performance in scenarios where traditional centroid-based algorithms might fail.

Original Paper:

CURE: an efficient clustering algorithm for large databases: ACM SIGMOD Record: Vol 27, No 2

Strengths:

Cure is robust to outliers and is able to identify clusters with varied shapes.

Python Code:

ROCK

Summary:

The ROCK (Robust Clustering using Links) algorithm is an agglomerative hierarchical clustering method designed to address challenges in clustering datasets with categorical attributes. It introduces the concept of "links," which measure the proximity between data points with categorical attributes. Utilizing a "goodness measure," ROCK aims to identify clusters by evaluating the similarity of objects within them. The algorithm is particularly useful for datasets with mixed attribute types, providing a global approach to clustering. However, ROCK may produce ambiguous results if parameter choices in the static model differ significantly from the dataset being clustered, and it may struggle to accurately define clusters with different sizes and shapes.

Original Paper / Helpful Research:

Rock: A robust clustering algorithm for categorical attributes

Strengths:

ROCK scales well with increasing dimensionality and can measure similarity within clusters using a global approach. This algorithm works well with categorical variables.

Python Code:

Chameleon

Summary:

Chameleon is a dynamic clustering algorithm designed to measure the similarity of two clusters based on a dynamic model. It works in two phases:

Create a graph with links between each point and its N-nearest neighbors.
Split the graph by a graph partitioning algorithm, resulting in many small unconnected subgraphs. Chameleon iteratively combines the two most similar clusters, considering their connectivity and closeness, making it more functional than some other algorithms when dealing with arbitrary-shaped clusters.

Figure from the original Chameleon algorithm paper linked below

Original Paper / Helpful Research:

https://www.researchgate.net/publication/2955136_CHAMELEON_A_hierarchical_clustering_algorithm_using_dynamic_modeling

Strengths:

Chameleon can measures cluster similarity based on a dynamic model and it is strong in dealing with arbitrary-shaped clusters.

Python Code:

There are no existing libraries to run this algorithm. However, Moonpuck has implemented the chameleon algorithm and their code can be explored here on Github.

Partitional Clustering Algorithms

Definition: Partitional clustering divides data into non-overlapping subsets (partitions) based on similarity. The most common method, which you've likely heard of, is the K-Means algorithm, where data points are assigned to the nearest centroid.

Use-Cases:

Large datasets because these algorithms have lower computational requirements.
When the coherence of clustering is more critical than hierarchical structure.

Strengths:

Efficient for large datasets.
Suitable for spherical-shaped clusters.

Weaknesses:

Sensitive to noise and outliers.
Assumes hyper-ellipsoidal shapes for clusters.

K-Means

Summary:

The K-Means algorithm is a partitioning-based clustering technique widely used for grouping data points into K distinct clusters. The steps of the algorithm are as follows:

Randomly select K centroids in the feature space.
Assign each data point to the nearest centroid.
Iteratively update the centroids based on the mean of the points within each cluster.
Continue the above steps until convergence, where the centroids no longer change significantly.

K-Means is computationally efficient and simple to implement, making it one of the most popular clustering algorithms; however, its performance can be sensitive to the initial placement of centroids and is influenced by outliers. Additionally, K-Means assumes clusters with a spherical shape and struggles with clusters of varying sizes and densities.

Original Paper / Helpful Research:

[PDF] Least squares quantization in PCM | Semantic Scholar

Strengths:

K-Means is efficient, widely studied and understood, applicable to various domains, and is strong for spherical clusters (an assumption of the model).

Python Code:

K-Medoids (PAM)

Summary:

The K-Medoids algorithm, also known as Partitioning Around Medoids (PAM), is a clustering technique similar to K-Means but with a crucial difference: instead of using the mean (centroid) to represent a cluster, it employs the actual data point (medoid) that minimizes the average dissimilarity to all other points in the cluster. This makes K-Medoids more robust against outliers and resistant to the influence of extreme values. The algorithm iteratively refines cluster assignments by choosing data points as medoids and updating until a stable configuration is reached. While K-Medoids is more robust to noise than K-Means, it can still be computationally expensive due to its exhaustive search for the best medoids.

Original Paper / Helpful Research:

Analysis of K-Means and K-Medoids Algorithm For Big Data

Strengths:

K-Medoids is better against outliers compared to centroids (which K-Means uses) and it identifies cluster centers using actual data points.

Python Code:

CLARANS

Summary:

CLARANS (Clustering Large Applications based on Randomized Search) is a clustering algorithm that combines sampling techniques with the Partitioning Around Medoids (PAM) method. It employs a randomized search to discover clusters without relying on additional data structures. CLARANS is particularly robust in high-dimensional spaces, as it doesn't assume a specific distance function, and it can effectively identify clusters with non-convex shapes. However, its efficiency comes at the cost of increased computational complexity, making it potentially slower than other partitioning methods.

Original Paper / Helpful Research:

https://ieeexplore.ieee.org/abstract/document/1033770

Strengths:

CLARANS is robust against increases in dimensionality and can identify polygon-shaped objects effectively.

Python Code:

ISODATA

Summary:

The ISODATA (Iterative Self-Organizing Data Analysis Technique) algorithm is an iterative and adaptive clustering method, considered a variant of the K-Means algorithm. It dynamically adjusts clusters through splitting and merging based on user-defined thresholds, such as minimum points for each cluster, maximum variance for splitting clusters, and minimum distance for merging clusters. This adaptability allows ISODATA to handle outliers effectively through its splitting procedure and prevent the formation of elongated clusters. Despite its advantages in handling noise and varying cluster shapes, ISODATA's sensitivity to input parameters and the need for careful tuning are notable considerations.

Original Paper / Helpful Research:

Isodata Algorithm

Strengths:

ISODATA handles outliers better than k-means and adjusts clusters dynamically during iterations.

Python Code:

ISODATA is a version of K-Means that can be implemented using the code in the k-means section above with some adaptations You can also use this Github code by PyRadar.

Density-Based Clustering Algorithms

Definition: Density-based clustering identifies clusters as contiguous regions of high data point density separated by regions of lower density. It is based on the idea that clusters are areas of higher density compared to their surroundings.

Use-Cases:

Dealing with arbitrary-shaped clusters.
Working with noisy data.

Strengths:

Capable of discovering clusters of arbitrary shapes.
Robust against noise.

Weaknesses:

Efficiency and performance may struggle with high-dimensional data.

DBSCAN

Summary:

The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is a density-based clustering technique that identifies clusters based on the density of data points in the feature space. It classifies points into three categories:

Core Points: dense regions.
Border Points: points on the outskirts of a cluster.
Noise Points: isolated points.

DBSCAN efficiently discovers clusters with arbitrary shapes, requires no predefined number of clusters, and is robust to noise. However, its performance can be influenced by the choice of parameters such as the neighborhood radius and minimum number of points required to form a dense region.

Original Paper / Helpful Research:

A density-based algorithm for discovering clusters in large spatial databases with noise |…

Strengths:

DBSCAN is effective against noise and discovers clusters of arbitrary shapes.

Python Code:

DENCLUE

Summary:

The DENCLUE (Density Clustering) algorithm is a density-based clustering technique that determines clusters based on the local density attractors, representing local maxima in an overall density function. It employs an influence function to calculate the distance between data points, and the density function is the cumulative sum of these influences. DENCLUE is designed to identify clusters with arbitrary shapes and exhibits good scalability, making it suitable for datasets with unpredictable structures. However, DENCLUE is sensitive to input parameters and can be affected by the curse of dimensionality.

Original Paper / Helpful Research:

An overview of various enhancements of DENCLUE algorithm | Proceedings of the Second International…

Strengths:

DENCLUE is scalable and spots clusters with unpredictable shapes and is effective against noise.

Python Code:

There are no libraries that support DENCLUE. See mgarrett57's implementation on Github here.

Conclusion

In this article, we covered 10 of the most commonly used clustering algorithms within the areas of Hierarchical, Partitional, and Density-Based Clustering. While this article is meant to be exhaustive of the basics, I will be going in depth on each of these algorithms and providing more context in future articles dedicated specifically to each algorithm providing a full review of the literature surrounding them. We have also linked to research papers for each algorithm as helpful starting points for more exploration for each. I hope this article has helped serve as a good starting point for your data exploration projects using clustering.

Tags: Clustering Algorithm Data Analytics Data Science Machine Learning Unsupervised Learning