Adversarially robust clustering with optimality guarantees
- URL: http://arxiv.org/abs/2306.09977v2
- Date: Wed, 14 Aug 2024 06:53:41 GMT
- Title: Adversarially robust clustering with optimality guarantees
- Authors: Soham Jana, Kun Yang, Sanjeev Kulkarni,
- Abstract summary: We consider the problem of clustering data points coming from sub-Gaussian mixtures.
Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers.
We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present.
- Score: 7.0830450321883935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of clustering data points coming from sub-Gaussian mixtures. Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast, clustering methods seemingly robust to adversarial perturbations are not known to satisfy the optimal statistical guarantees. We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present. Our algorithm achieves the optimal error rate in constant iterations when a weak initialization condition is satisfied. In the absence of outliers, in fixed dimensions, our theoretical guarantees are similar to that of the Lloyd algorithm. Extensive experiments on various simulated and public datasets are conducted to support the theoretical guarantees of our method.
Related papers
- A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - Scalable Clustering: Large Scale Unsupervised Learning of Gaussian
Mixture Models with Outliers [5.478764356647437]
This paper introduces a provably robust clustering algorithm based on loss minimization.
It provides theoretical guarantees that the algorithm obtains high accuracy with high probability.
Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-02-28T14:39:18Z) - Robust Consensus Clustering and its Applications for Advertising
Forecasting [18.242055675730253]
We propose a novel algorithm -- robust consensus clustering that can find common ground truth among experts' opinions.
We apply the proposed method to the real-world advertising campaign segmentation and forecasting tasks.
arXiv Detail & Related papers (2022-12-27T21:49:04Z) - Mean-based Best Arm Identification in Stochastic Bandits under Reward
Contamination [80.53485617514707]
This paper proposes two algorithms, a gap-based algorithm and one based on the successive elimination, for best arm identification in sub-Gaussian bandits.
Specifically, for the gap-based algorithm, the sample complexity is optimal up to constant factors, while for the successive elimination, it is optimal up to logarithmic factors.
arXiv Detail & Related papers (2021-11-14T21:49:58Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - Sparse PCA: Algorithms, Adversarial Perturbations and Certificates [9.348107805982604]
We study efficient algorithms for Sparse PCA in standard statistical models.
Our goal is to achieve optimal recovery guarantees while being resilient to small perturbations.
arXiv Detail & Related papers (2020-11-12T18:58:51Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Consistency of Anchor-based Spectral Clustering [0.0]
Anchor-based techniques reduce the computational complexity of spectral clustering algorithms.
We show that it is amenable to rigorous analysis, as well as being effective in practice.
We find that it is competitive with the state-of-the-art LSC method of Chen and Cai.
arXiv Detail & Related papers (2020-06-24T18:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.