Related papers: BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits

BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits

URL: http://arxiv.org/abs/2006.06856v2
Date: Sun, 6 Dec 2020 22:40:18 GMT
Title: BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits
Authors: Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony
Abstract summary: Current $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n2)$ to $O(n log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large real-world datasets, including a coding
Score: 16.1767275655842
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering, $k$-medoids clustering requires the cluster centers to be actual data points and support arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. As such, BanditPAM matches state-of-the-art clustering loss while reaching solutions much faster. We empirically validate our results on several large real-world datasets, including a coding exercise submissions dataset, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. In these experiments, we observe that BanditPAM returns the same results as state-of-the-art PAM-like algorithms up to 4x faster while performing up to 200x fewer distance computations. The improvements demonstrated by BanditPAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release highly optimized Python and C++ implementations of our algorithm.

Related papers

A Scalable k-Medoids Clustering via Whale Optimization Algorithm [0.0]
We introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA) By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations. We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive.
arXiv Detail & Related papers (2024-08-30T03:43:37Z)
Distributed Collapsed Gibbs Sampler for Dirichlet Process Mixture Models in Federated Learning [0.22499166814992444]
This paper proposes a new distributed Markov Chain Monte Carlo (MCMC) inference method for DPMMs (DisCGS) using sufficient statistics. Our approach uses the collapsed Gibbs sampler and is specifically designed to work on distributed data across independent and heterogeneous machines. For instance, with a dataset of 100K data points, the centralized algorithm requires approximately 12 hours to complete 100 iterations while our approach achieves the same number of iterations in just 3 minutes.
arXiv Detail & Related papers (2023-12-18T13:16:18Z)
BanditPAM++: Faster $k$-medoids Clustering [16.42816643809205]
In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used. Recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy. We present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime.
arXiv Detail & Related papers (2023-10-28T23:11:16Z)
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels. We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z)
GBMST: An Efficient Minimum Spanning Tree Clustering Based on Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST) We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority" Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z)
Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback. It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z)
Meta Clustering Learning for Large-scale Unsupervised Person Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL) MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z)
Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem. We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z)
Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering. We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z)
Faster DBSCAN via subsampled similarity queries [42.93847281082316]
DBSCAN is a popular density-based clustering algorithm. We propose SNG-DBSCAN, which clusters based on a subsampled $epsilon$-neighborhood graph.
arXiv Detail & Related papers (2020-06-11T18:57:54Z)
Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA. We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
Efficient Algorithms for Generating Provably Near-Optimal Cluster Descriptors for Explainability [36.11663695534294]
We study the problem of making clusters more interpretable by extending a recent approach of constructing succinct representations for clusters. We develop approximation algorithms with provable performance guarantees for the problem. We also show applications to explain clusters from datasets, including clusters of genomic sequences that represent different threat levels.
arXiv Detail & Related papers (2020-02-06T19:49:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.