Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets
- URL: http://arxiv.org/abs/2105.11653v1
- Date: Tue, 25 May 2021 04:14:21 GMT
- Title: Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets
- Authors: Baris Sumengen (1), Anand Rajagopalan (1), Gui Citovsky (1), David
Simcha (1), Olivier Bachem (1), Pradipta Mitra (1), Sam Blasiak (1), Mason
Liang (2), Sanjiv Kumar (1) ((1) Google Research, (2) 0x Labs)
- Abstract summary: We propose Reciprocal Agglomerative Clustering (RAC), a distributed algorithm for HAC, that uses a novel strategy to efficiently merge clusters in parallel.
In extensive experiments, we show that RAC can recover the HAC hierarchy on billions of data points connected by trillions of edges in less than an hour.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Hierarchical Agglomerative Clustering (HAC) is one of the oldest but still
most widely used clustering methods. However, HAC is notoriously hard to scale
to large data sets as the underlying complexity is at least quadratic in the
number of data points and many algorithms to solve HAC are inherently
sequential. In this paper, we propose {Reciprocal Agglomerative Clustering
(RAC)}, a distributed algorithm for HAC, that uses a novel strategy to
efficiently merge clusters in parallel. We prove theoretically that RAC
recovers the exact solution of HAC. Furthermore, under clusterability and
balancedness assumption we show provable speedups in total runtime due to the
parallelism. We also show that these speedups are achievable for certain
probabilistic data models. In extensive experiments, we show that this
parallelism is achieved on real world data sets and that the proposed RAC
algorithm can recover the HAC hierarchy on billions of data points connected by
trillions of edges in less than an hour.
Related papers
- Data Aggregation for Hierarchical Clustering [0.3626013617212666]
BETULA is a numerically stable version of the well known BIRCH data aggregation algorithm.
It can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality.
arXiv Detail & Related papers (2023-09-05T19:39:43Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Convex Clustering through MM: An Efficient Algorithm to Perform
Hierarchical Clustering [1.0589208420411012]
We propose convex clustering through majorization-minimization ( CCMM) -- an iterative algorithm that uses cluster fusions and a highly efficient updating scheme.
With a current desktop computer, CCMM efficiently solves convex clustering problems featuring over one million objects in seven-dimensional space.
arXiv Detail & Related papers (2022-11-03T15:07:51Z) - Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle [9.578056676899203]
oracle block model (SBM) is a fundamental model for studying graph clustering or community detection in networks.
We provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.
arXiv Detail & Related papers (2022-02-17T08:51:19Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - (k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time
Warping [57.316437798033974]
In this work we consider the problem of center-based clustering of trajectories.
We propose the usage of a continuous version of DTW as distance measure, which we call continuous dynamic time warping (CDTW)
We show a practical way to compute a center from a set of trajectories and subsequently iteratively improve it.
arXiv Detail & Related papers (2020-12-01T13:17:27Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Fair Algorithms for Hierarchical Agglomerative Clustering [17.66340013352806]
Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science.
It is imperative to ensure that these algorithms are fair -- even if the dataset contains biases against certain protected groups.
We propose fair algorithms for performing HAC that enforce fairness constraints.
arXiv Detail & Related papers (2020-05-07T01:41:56Z) - Probabilistic Partitive Partitioning (PPP) [0.0]
Clustering algorithms, in general, face two common problems.
They converge to different settings with different initial conditions.
The number of clusters has to be arbitrarily decided beforehand.
arXiv Detail & Related papers (2020-03-09T19:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.