BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed
Bandits
- URL: http://arxiv.org/abs/2006.06856v2
- Date: Sun, 6 Dec 2020 22:40:18 GMT
- Title: BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed
Bandits
- Authors: Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris
Piech, Ilan Shomorony
- Abstract summary: Current $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets.
We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n2)$ to $O(n log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice.
We empirically validate our results on several large real-world datasets, including a coding
- Score: 16.1767275655842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering is a ubiquitous task in data science. Compared to the commonly
used $k$-means clustering, $k$-medoids clustering requires the cluster centers
to be actual data points and support arbitrary distance metrics, which permits
greater interpretability and the clustering of structured objects. Current
state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around
Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each
iteration, being prohibitively expensive for large datasets. We propose
BanditPAM, a randomized algorithm inspired by techniques from multi-armed
bandits, that reduces the complexity of each PAM iteration from $O(n^2)$ to
$O(n \log n)$ and returns the same results with high probability, under
assumptions on the data that often hold in practice. As such, BanditPAM matches
state-of-the-art clustering loss while reaching solutions much faster. We
empirically validate our results on several large real-world datasets,
including a coding exercise submissions dataset, the 10x Genomics 68k PBMC
single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset.
In these experiments, we observe that BanditPAM returns the same results as
state-of-the-art PAM-like algorithms up to 4x faster while performing up to
200x fewer distance computations. The improvements demonstrated by BanditPAM
enable $k$-medoids clustering on a wide range of applications, including
identifying cell types in large-scale single-cell data and providing scalable
feedback for students learning computer science online. We also release highly
optimized Python and C++ implementations of our algorithm.
Related papers
- A Scalable k-Medoids Clustering via Whale Optimization Algorithm [0.0]
We introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA)
By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations.
We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive.
arXiv Detail & Related papers (2024-08-30T03:43:37Z) - Distributed Collapsed Gibbs Sampler for Dirichlet Process Mixture Models
in Federated Learning [0.22499166814992444]
This paper proposes a new distributed Markov Chain Monte Carlo (MCMC) inference method for DPMMs (DisCGS) using sufficient statistics.
Our approach uses the collapsed Gibbs sampler and is specifically designed to work on distributed data across independent and heterogeneous machines.
For instance, with a dataset of 100K data points, the centralized algorithm requires approximately 12 hours to complete 100 iterations while our approach achieves the same number of iterations in just 3 minutes.
arXiv Detail & Related papers (2023-12-18T13:16:18Z) - BanditPAM++: Faster $k$-medoids Clustering [16.42816643809205]
In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used.
Recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy.
We present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime.
arXiv Detail & Related papers (2023-10-28T23:11:16Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Faster DBSCAN via subsampled similarity queries [42.93847281082316]
DBSCAN is a popular density-based clustering algorithm.
We propose SNG-DBSCAN, which clusters based on a subsampled $epsilon$-neighborhood graph.
arXiv Detail & Related papers (2020-06-11T18:57:54Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Efficient Algorithms for Generating Provably Near-Optimal Cluster
Descriptors for Explainability [36.11663695534294]
We study the problem of making clusters more interpretable by extending a recent approach of constructing succinct representations for clusters.
We develop approximation algorithms with provable performance guarantees for the problem.
We also show applications to explain clusters from datasets, including clusters of genomic sequences that represent different threat levels.
arXiv Detail & Related papers (2020-02-06T19:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.