Related papers: Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

URL: http://arxiv.org/abs/2502.06163v1
Date: Mon, 10 Feb 2025 05:22:08 GMT
Title: Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search
Authors: Jack Spalding-Jamieson, Eliot Wong Robson, Da Wei Zheng,
Abstract summary: For very large values of $k$, we consider methods for fast clustering of massive datasets with $107sim109$ points in high-dimensions.<n>All current practical methods for this problem have runtimes at least $Omega(k2)$.<n>We propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.
Score: 0.6144680854063939
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $\Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

Related papers

Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering.<n>In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z)
A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$. We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z)
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations [20.944914202453962]
We study the worst-case performance of graph-based approximate nearest neighbor search algorithms. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query. We present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size.
arXiv Detail & Related papers (2023-10-29T19:25:48Z)
Multi-Swap $k$-Means++ [30.967186562175893]
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective. Lattanzi and Sohler (ICML) proposed augmenting $k$-means++ with $O(k log log k)$ local search steps to yield a $c$-approximation to the $k$-means clustering problem.
arXiv Detail & Related papers (2023-09-28T12:31:35Z)
Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets. We present an improved version of the "$q$-means" algorithm for clustering. We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z)
An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z)
Computing Star Discrepancies with Numerical Black-Box Optimization Algorithms [56.08144272945755]
We compare 8 popular numerical black-box optimization algorithms on the $L_infty$ star discrepancy problem. We show that all used solvers perform very badly on a large majority of the instances. We suspect that state-of-the-art numerical black-box optimization techniques fail to capture the global structure of the problem.
arXiv Detail & Related papers (2023-06-29T14:57:56Z)
Scalable Differentially Private Clustering via Hierarchically Separated Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee. Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z)
Systematically improving existing k-means initialization algorithms at nearly no cost, by pairwise-nearest-neighbor smoothing [1.2570180539670577]
We present a meta-method for initializing the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor method.
arXiv Detail & Related papers (2022-02-08T15:56:30Z)
Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering. We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z)
Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model. We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.