Adapting $k$-means algorithms for outliers
- URL: http://arxiv.org/abs/2007.01118v2
- Date: Fri, 23 Sep 2022 14:26:00 GMT
- Title: Adapting $k$-means algorithms for outliers
- Authors: Christoph Grunau and V\'aclav Rozho\v{n}
- Abstract summary: We show how to adapt several simple sampling-based algorithms for the $k$-means problem to the setting with outliers.
Our algorithms output $(varepsilon)z$ outliers while achieving an $O(varepsilon)$-approximation to the objective function.
- Score: 1.9290392443571387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper shows how to adapt several simple and classical sampling-based
algorithms for the $k$-means problem to the setting with outliers.
Recently, Bhaskara et al. (NeurIPS 2019) showed how to adapt the classical
$k$-means++ algorithm to the setting with outliers. However, their algorithm
needs to output $O(\log (k) \cdot z)$ outliers, where $z$ is the number of true
outliers, to match the $O(\log k)$-approximation guarantee of $k$-means++. In
this paper, we build on their ideas and show how to adapt several sequential
and distributed $k$-means algorithms to the setting with outliers, but with
substantially stronger theoretical guarantees: our algorithms output
$(1+\varepsilon)z$ outliers while achieving an $O(1 /
\varepsilon)$-approximation to the objective function. In the sequential world,
we achieve this by adapting a recent algorithm of Lattanzi and Sohler (ICML
2019). In the distributed setting, we adapt a simple algorithm of Guha et al.
(IEEE Trans. Know. and Data Engineering 2003) and the popular $k$-means$\|$ of
Bahmani et al. (PVLDB 2012).
A theoretical application of our techniques is an algorithm with running time
$\tilde{O}(nk^2/z)$ that achieves an $O(1)$-approximation to the objective
function while outputting $O(z)$ outliers, assuming $k \ll z \ll n$. This is
complemented with a matching lower bound of $\Omega(nk^2/z)$ for this problem
in the oracle model.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Improved Outlier Robust Seeding for k-means [3.9973713691377646]
In adversarial noise or outliers, $D2$ sampling is more likely to pick centers from distant outliers instead of inlier clusters.
We propose a simple variant in the $D2$ sampling distribution, which makes it robust to the outliers.
Our algorithm can also be modified to output exactly $k$ clusters instead of $O(k)$ clusters.
arXiv Detail & Related papers (2023-09-06T04:46:01Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix
Factorization [54.29685789885059]
We introduce efficient $(1+varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem.
The goal is to approximate $mathbfA$ as a product of low-rank factors.
Our techniques generalize to other common variants of the BMF problem.
arXiv Detail & Related papers (2023-06-02T18:55:27Z) - A Nearly Tight Analysis of Greedy k-means++ [1.452875650827562]
The $k$-means++ algorithm is known to return a $Theta(log k)$ approximate solution in expectation.
We present nearly matching lower and upper bounds for the greedy $k$-means++ algorithm.
arXiv Detail & Related papers (2022-07-16T13:49:07Z) - Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
We give a sketching-based iterative algorithm that computes $1+varepsilon$ approximate solutions for the ridge regression problem.
We also show that this algorithm can be used to give faster algorithms for kernel ridge regression.
arXiv Detail & Related papers (2022-04-13T22:18:47Z) - Near-optimal Algorithms for Explainable k-Medians and k-Means [12.68470213641421]
We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian(ICML 2020)
Our goal is to find a emphoptimal decision tree that partitions data into $k clusters and minimizes the $k-medians or $k-means objective.
arXiv Detail & Related papers (2021-07-02T02:07:12Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.