Improved Outlier Robust Seeding for k-means
- URL: http://arxiv.org/abs/2309.02710v1
- Date: Wed, 6 Sep 2023 04:46:01 GMT
- Title: Improved Outlier Robust Seeding for k-means
- Authors: Amit Deshpande and Rameshwar Pratap
- Abstract summary: In adversarial noise or outliers, $D2$ sampling is more likely to pick centers from distant outliers instead of inlier clusters.
We propose a simple variant in the $D2$ sampling distribution, which makes it robust to the outliers.
Our algorithm can also be modified to output exactly $k$ clusters instead of $O(k)$ clusters.
- Score: 3.9973713691377646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The $k$-means is a popular clustering objective, although it is inherently
non-robust and sensitive to outliers. Its popular seeding or initialization
called $k$-means++ uses $D^{2}$ sampling and comes with a provable $O(\log k)$
approximation guarantee \cite{AV2007}. However, in the presence of adversarial
noise or outliers, $D^{2}$ sampling is more likely to pick centers from distant
outliers instead of inlier clusters, and therefore its approximation guarantees
\textit{w.r.t.} $k$-means solution on inliers, does not hold.
Assuming that the outliers constitute a constant fraction of the given data,
we propose a simple variant in the $D^2$ sampling distribution, which makes it
robust to the outliers. Our algorithm runs in $O(ndk)$ time, outputs $O(k)$
clusters, discards marginally more points than the optimal number of outliers,
and comes with a provable $O(1)$ approximation guarantee.
Our algorithm can also be modified to output exactly $k$ clusters instead of
$O(k)$ clusters, while keeping its running time linear in $n$ and $d$. This is
an improvement over previous results for robust $k$-means based on LP
relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and
\textit{robust $k$-means++} \cite{DeshpandeKP20}. Our empirical results show
the advantage of our algorithm over $k$-means++~\cite{AV2007}, uniform random
seeding, greedy sampling for $k$ means~\cite{tkmeanspp}, and robust
$k$-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data
sets used in previous work. Our proposal is easily amenable to scalable,
faster, parallel implementations of $k$-means++ \cite{Bahmani,BachemL017} and
is of independent interest for coreset constructions in the presence of
outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}.
Related papers
- Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Multi-Swap $k$-Means++ [30.967186562175893]
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective.
Lattanzi and Sohler (ICML) proposed augmenting $k$-means++ with $O(k log log k)$ local search steps to yield a $c$-approximation to the $k$-means clustering problem.
arXiv Detail & Related papers (2023-09-28T12:31:35Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Global $k$-means$++$: an effective relaxation of the global $k$-means
clustering algorithm [0.20305676256390928]
The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed.
We propose the emphglobal $k$-meanstexttt++ clustering algorithm, which is an effective way of acquiring quality clustering solutions.
arXiv Detail & Related papers (2022-11-22T13:42:53Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Adapting $k$-means algorithms for outliers [1.9290392443571387]
We show how to adapt several simple sampling-based algorithms for the $k$-means problem to the setting with outliers.
Our algorithms output $(varepsilon)z$ outliers while achieving an $O(varepsilon)$-approximation to the objective function.
arXiv Detail & Related papers (2020-07-02T14:14:33Z) - Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized.
This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.