Scaling Agglomerative Clustering for Big Data

Introduction
Agglomerative Clustering is one of the best clustering tools in data science, but traditional implementations fail to scale to large datasets.
In this article, I will take you through some background on agglomerative clustering, an introduction to reciprocal agglomerative clustering (RAC) based on 2021 research from Google, a runtime comparison between RAC++
and scikit-learn
‘s AgglomerativeClustering, and finally a brief explanation of the theory behind RAC.
Background on Agglomerative Clustering
In data science, it is frequently useful to cluster unlabeled data. With applications ranging from grouping of search engine results, to genotype classification, to banking anomaly detection, Clustering is an essential piece of a data scientist's toolkit.
Agglomerative clustering is one of the most popular clustering approaches in data-science and for good reason, it:
✅ Is easy to use with little to no parameter tuning ✅ Creates meaningful taxonomies ✅ Works well with high-dimensional data ✅ Does not need to know the number of clusters beforehand ✅ Creates the same clusters every time
In comparison, partitioning methods like K-Means
require the data scientist to guess at the number of clusters, the very popular density-based method DBSCAN
requires some parameters around density calculation radius (epsilon) and min neighborhood size, and Gaussian mixture models
make strong assumptions about the underlying cluster data distribution.
With agglomerative clustering, all you need to specify is a distance metric.
At a high-level, agglomerative clustering follows the below algorithm:
Identify cluster distances between all pairs of clusters (each cluster begins as a single point)
Merge the two clusters which are closest to one another
Repeat
The result: a beautiful dendrogram that can then be partitioned based on domain expertise.
In fields like biology and natural language processing, clusters (of cells, genes, or words) naturally follow a hierarchical relationship. Agglomerative clustering therefore enables a more natural and data-driven selection of the final clustering cutoff.
Pictured below is a sample agglomerative clustering of the famous Iris Dataset.

So why not use agglomerative clustering for every unsupervised classification problem?
❌ Agglomerative clustering has a terrible runtime as datasets increase in size.
Unfortunately, traditional agglomerative clustering does not scale. The runtime is O(n³)
or O(n²log(n))
if implemented with a min-heap. Even worse, agglomerative clustering runs sequentially on a single-core and cannot be scaled up with compute.
In the field of NLP, agglomerative clustering is a top performer… for small datasets.
Reciprocal Agglomerative Clustering (RAC)
Reciprocal Agglomerative Clustering (RAC) is a method proposed by Google to scale the benefits of traditional Agglomerative clustering to larger datasets.
RAC decreases the runtime complexity while also parallelizing the operations to utilize a multi-core architecture. Despite these optimizations, RAC produces the exact same results as traditional Agglomerative clustering when the data is fully connected (see below).
Note: Fully connected data means that a distance metric can be calculated between any pair of points. Non-fully connected datasets have connectivity constraints (usually provided in the form of a linkage matrix) whereby some points are considered disconnected.

Even with connectivity constraints (data which is not fully connected), RAC and Agglomerative clustering are still typically identical, as seen in the second Swiss Roll dataset example above.
However, large discrepancies can emerge when there are very few possible clusters. The Noisy Moons dataset is a good example of this:

RAC++ scales to larger datasets than scikit-learn
We can compare [RAC++](https://github.com/porterehunley/RACplusplus)
(an implementation of reciprocal agglomerative clustering) to its counterpart, AgglomerativeClustering, in scikit-learn
.
Let's generate some example data with 25 dimensions, and test how long agglomerative clustering takes using racplusplus.rac
vs. sklearn.cluster.AgglomerativeClustering
for datasets ranging in size from 1,000 to 64,000 points.
Note: I am using a connectivity matrix to limit memory consumption.
Python">import numpy as np
import racplusplus
from sklearn.cluster import AgglomerativeClustering
import time
points = [1000, 2000, 4000, 6000, 10000, 14000, 18000, 22000, 26000, 32000, 64000]
for point_no in points:
X = np.random.random((point_no, 25))
distance_threshold = .17
knn = kneighbors_graph(X, 30, include_self=False)
# Matrix must be symmetric - done internally in scikit-learn
symmetric = knn + knn.T
start = time.time()
model = AgglomerativeClustering(
linkage="average",
connectivity=knn,
n_clusters=None,
distance_threshold=distance_threshold,
metric='cosine'
)
sklearn_times.append(time.time() - start)
start = time.time()
rac_labels = racplusplus.rac(
X, distance_threshold, symmetric,
batch_size=1000, no_cores=8, metric="cosine"
)
rac_times.append(time.time() - start)
Here is a graph of the runtime results for each size dataset:

As we can see, there are drastic difference in runtime between RAC++ and traditional Agglomerative clustering.
At just over 30k points, RAC++
is around 100x faster! Even more importantly, scikit-learn
‘s Agglomerative clustering hits a time limit at ~35 thousand points, while RAC++
could scale to hundreds of thousands of points by the time it hits a reasonable time limit.
RAC++ can scale to high-dimensions
We can also compare how well RAC++
scales to high-dimensional data vs its traditional counterpart.

Time taken to generate clusters vs dimensionality for 3,000 points
For 3,000 points we can see that traditional agglomerative clustering is faster but it scales linearly while RAC++
is nearly constant. Working with 768 or 1536 dimensional embeddings has become the norm in the field of NLP, and so scaling dimensionality to meet those requirements is important.
RAC++ has a better runtime
Researchers at Google proved that RAC has a runtime of O(nk)
where k
is the connectivity constraint and n
is the number of points— a linear runtime. However, this is excluding the initial distance matrix calculation which is O(n²)
– a quadratic runtime.
Our results, running a constant 30-neighbor connectivity constraint do confirm an O(n²)
runtime:
+ - - - - - - -+ - - - - - +
| Data points | Seconds |
+ - - - - - - -+ - - - - - +
| 2000 | 0.051 |
| 4000 | 0.125 |
| 6000 | 0.245 |
| 10000 | 0.560 |
| 14000 | 1.013 |
| 18000 | 1.842 |
| 22000 | 2.800 |
| 26000 | 3.687 |
| 32000 | 5.590 |
| 64000 | 22.499 |
+ - - - - - - -+ - - - - - +
Doubling data-points is a 4x increase in time.
A quadratic runtime limits how well RAC++ will perform as datasets become truly massive, however, this runtime is already a big improvement over the traditional O(n³)
or min-heap optimizedO(n²log(n))
runtime.
Note: the developers of RAC++
are working on passing the distance matrix as a parameter which would give RAC++
a linear runtime.
How RAC Works
Why is RAC++ so must faster? We can reduce the underlying algorithm to a few steps:
Pair clusters with reciprocal nearest neighbors
Merge the pairs of clusters
Update neighbors
Note that the only difference between this and the traditional agglomerative clustering algorithm is that we make sure to pair reciprocal nearest neighbors together. This is where the name Reciprocal Agglomerative Clustering (RAC) comes from. As you'll see, this reciprocal pairing enables us to parallelize the most computationally expensive step of agglomerative clustering.
Pair clusters with reciprocal nearest neighbors
First we loop through to find clusters with reciprocal nearest neighbors, meaning that their closest neighbors are each-other (remember, distance can be directional!).

Merge pairs
RAC is parallelizable because it does not matter what order reciprocal nearest neighbors are merged in, as long as the linkage method is reducible.
A linkage method is the function that determines the distance between two clusters, based on pairwise distances between the points contained in each cluster. A reducible linkage method guarantees that the new merged cluster is not closer to any other cluster after the merge.

Fortunately, the four most popular linkage methods are reducible:
- Single linkage – min distance
- Average linkage – average of the distances
- Complete linkage – max distance
- Ward linkage – minimizing variance

Since we know that our identified reciprocal pairs are each other's nearest neighbor, and we know that reducible linkage merges will never make a newly merged cluster closer to another cluster, we can safely merge all reciprocal nearest neighbor pairs together at once. Each nearest-neighbor pair can be placed into an available thread to be merged according to the linkage method.
The fact that we can merge reciprocal nearest neighbors at the same time is fantastic, because merging clusters is the most computationally expensive step!

Update nearest neighbors
With reducible linkage the order in which nearest neighbors are updated after merging also does not matter. Therefore, with some clever design, we can update the relevant neighbors in parallel as well.

Conclusion
With a few test datasets, we have shown that RAC++
produces the exact same results as traditional Agglomerative Clustering (ie. sklearn
) for fully connected datasets at a much better runtime. With an understanding of reducible linkage metrics and a basic understanding of parallel programming, we can understand the logic that makes RAC++
so much faster.
For a more complete understanding (and proof) of the algorithm RAC++
has adopted into open-source, take a look at the original Google research it was based on.
The future
Porter Hunley started building RAC++ to create taxonomies for clinical term endpoints produced via fine-tuned BERT embeddings. Each of these medical embeddings had 768 dimensions and out of the many clustering algorithms he tried, only agglomerative clustering gave good results.
All other high-scale clustering methods required reducing dimensionality to give any coherent results at all. There is, unfortunately, no fool-proof way to reduce dimensionality – you will always lose information.
After discovering Google's research around RAC, Porter decided to build a custom open-source clustering implementation to support his clinical term clustering research. Porter lead development, and I co-developed portions of RAC, particularly wrapping the C++ implementation in python, optimizing runtime, and packaging the software for distribution.
RAC++
enables tons of clustering applications which are too slow using traditional agglomerative clustering, and will eventually be scalable to millions of datapoints.
Although RAC++ can already be used to cluster large datasets, there are improvements to be made… RAC++ is still in development – please contribute!
Contributing authors:
- Porter Hunley, Senior Software Engineer at Daceflow.ai: github
- Daniel Frees, MS Stats & Data Science Student at Stanford, Data Scientist at IBM: github
GitHub – porterehunley/RACplusplus: A high performance implementation of Reciprocal Agglomerative…