Related papers: Improved Outlier Robust Seeding for k-means

Improved Outlier Robust Seeding for k-means

URL: http://arxiv.org/abs/2309.02710v1
Date: Wed, 6 Sep 2023 04:46:01 GMT
Title: Improved Outlier Robust Seeding for k-means
Authors: Amit Deshpande and Rameshwar Pratap
Abstract summary: In adversarial noise or outliers, $D2$ sampling is more likely to pick centers from distant outliers instead of inlier clusters. We propose a simple variant in the $D2$ sampling distribution, which makes it robust to the outliers. Our algorithm can also be modified to output exactly $k$ clusters instead of $O(k)$ clusters.
Score: 3.9973713691377646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The $k$-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called $k$-means++ uses $D^{2}$ sampling and comes with a provable $O(\log k)$ approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, $D^{2}$ sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} $k$-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the $D^2$ sampling distribution, which makes it robust to the outliers. Our algorithm runs in $O(ndk)$ time, outputs $O(k)$ clusters, discards marginally more points than the optimal number of outliers, and comes with a provable $O(1)$ approximation guarantee. Our algorithm can also be modified to output exactly $k$ clusters instead of $O(k)$ clusters, while keeping its running time linear in $n$ and $d$. This is an improvement over previous results for robust $k$-means based on LP relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and \textit{robust $k$-means++} \cite{DeshpandeKP20}. Our empirical results show the advantage of our algorithm over $k$-means++~\cite{AV2007}, uniform random seeding, greedy sampling for $k$ means~\cite{tkmeanspp}, and robust $k$-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of $k$-means++ \cite{Bahmani,BachemL017} and is of independent interest for coreset constructions in the presence of outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}.

Related papers

A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs [0.12289361708127876]
We present a simple and effective rejection sampling based approach for speeding up $k$-$mathttmeans$++. Our first method runs in time $tildeO(mathttnnz (mathcalX) + beta k2d)$ while still being $O(log k )$ competitive in expectation. Our second method presents a new trade-off between computational cost and solution quality.
arXiv Detail & Related papers (2025-02-04T08:05:34Z)
Revisiting Step-Size Assumptions in Stochastic Approximation [1.3654846342364308]
It is shown for the first time that this assumption is not required for convergence and finer results.<n>Rates of convergence are obtained for the standard algorithm and for estimates obtained via the averaging technique of Polyak and Ruppert.<n>Results from numerical experiments illustrate that $beta_theta$ may be large due to a combination of multiplicative noise and Markovian memory.
arXiv Detail & Related papers (2024-05-28T05:11:05Z)
Simple, Scalable and Effective Clustering via One-Dimensional Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$. We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z)
Multi-Swap $k$-Means++ [30.967186562175893]
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective. Lattanzi and Sohler (ICML) proposed augmenting $k$-means++ with $O(k log log k)$ local search steps to yield a $c$-approximation to the $k$-means clustering problem.
arXiv Detail & Related papers (2023-09-28T12:31:35Z)
Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets. We present an improved version of the "$q$-means" algorithm for clustering. We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z)
Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$. We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z)
Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner. We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z)
Global $k$-means$++$: an effective relaxation of the global $k$-means clustering algorithm [0.20305676256390928]
The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. We propose the emphglobal $k$-meanstexttt++ clustering algorithm, which is an effective way of acquiring quality clustering solutions.
arXiv Detail & Related papers (2022-11-22T13:42:53Z)
Improved Learning-augmented Algorithms for k-means and k-medians Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space. We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost. Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z)
Random matrices in service of ML footprint: ternary random features with no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$. We propose a novel random technique, called Ternary Random Feature (TRF) The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z)
Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions. Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset. A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
Adapting $k$-means algorithms for outliers [1.9290392443571387]
We show how to adapt several simple sampling-based algorithms for the $k$-means problem to the setting with outliers. Our algorithms output $(varepsilon)z$ outliers while achieving an $O(varepsilon)$-approximation to the objective function.
arXiv Detail & Related papers (2020-07-02T14:14:33Z)
Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized. This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.