BanditPAM++: Faster $k$-medoids Clustering
- URL: http://arxiv.org/abs/2310.18844v1
- Date: Sat, 28 Oct 2023 23:11:16 GMT
- Title: BanditPAM++: Faster $k$-medoids Clustering
- Authors: Mo Tiwari, Ryan Kang, Donghyun Lee, Sebastian Thrun, Chris Piech, Ilan
Shomorony, Martin Jinye Zhang
- Abstract summary: In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used.
Recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy.
We present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime.
- Score: 16.42816643809205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering is a fundamental task in data science with wide-ranging
applications. In $k$-medoids clustering, cluster centers must be actual
datapoints and arbitrary distance metrics may be used; these features allow for
greater interpretability of the cluster centers and the clustering of exotic
objects in $k$-medoids clustering, respectively. $k$-medoids clustering has
recently grown in popularity due to the discovery of more efficient $k$-medoids
algorithms. In particular, recent research has proposed BanditPAM, a randomized
$k$-medoids algorithm with state-of-the-art complexity and clustering accuracy.
In this paper, we present BanditPAM++, which accelerates BanditPAM via two
algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and
substantially faster than BanditPAM in wall-clock runtime. First, we
demonstrate that BanditPAM has a special structure that allows the reuse of
clustering information $\textit{within}$ each iteration. Second, we demonstrate
that BanditPAM has additional structure that permits the reuse of information
$\textit{across}$ different iterations. These observations inspire our proposed
algorithm, BanditPAM++, which returns the same clustering solutions as
BanditPAM but often several times faster. For example, on the CIFAR10 dataset,
BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$
faster. Finally, we provide a high-performance C++ implementation of
BanditPAM++, callable from Python and R, that may be of interest to
practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to
reproduce all of our experiments via a one-line script is available at
https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.
Related papers
- Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Optimal Algorithms for Latent Bandits with Cluster Structure [50.44722775727619]
We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem.
We propose LATTICE which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $widetildeO(sqrt(mathsfM+mathsfN)mathsfT.
arXiv Detail & Related papers (2023-01-17T17:49:04Z) - Faster Maximum Inner Product Search in High Dimensions [17.040520467777295]
Maximum Inner Product Search (MIPS) is a ubiquitous task in machine learning applications such as recommendation systems.
We present BanditMIPS, a novel randomized MIPS algorithm whose complexity is independent of $d$.
We provide theoretical guarantees that BanditMIPS returns the correct answer with high probability, while improving the complexity in $d$ from $O(sqrtd)$ to $O(1)$.
arXiv Detail & Related papers (2022-12-14T23:46:23Z) - Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle [9.578056676899203]
oracle block model (SBM) is a fundamental model for studying graph clustering or community detection in networks.
We provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.
arXiv Detail & Related papers (2022-02-17T08:51:19Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the
PAM, CLARA, and CLARANS Algorithms [0.0]
Partitioning Around Medoids (PAM) is an algorithm for clustering non-Euclidean data.
We propose modifications to PAM that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm.
In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm.
arXiv Detail & Related papers (2020-08-12T08:37:50Z) - BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed
Bandits [16.1767275655842]
Current $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets.
We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n2)$ to $O(n log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice.
We empirically validate our results on several large real-world datasets, including a coding
arXiv Detail & Related papers (2020-06-11T22:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.