Uniform Concentration Bounds toward a Unified Framework for Robust
Clustering
- URL: http://arxiv.org/abs/2110.14148v1
- Date: Wed, 27 Oct 2021 03:43:44 GMT
- Title: Uniform Concentration Bounds toward a Unified Framework for Robust
Clustering
- Authors: Debolina Paul, Saptarshi Chakraborty, Swagatam Das and Jason Xu
- Abstract summary: Recent advances in center-based clustering continue to improve upon the drawbacks of Lloyd's celebrated $k$-means algorithm.
Various methods seek to address poor local minima, sensitivity to outliers, and data that are not well-suited to Euclidean measures of fit.
This paper proposes a cohesive robust framework for center-based clustering under a general class of dissimilarity measures.
- Score: 21.789311405437573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in center-based clustering continue to improve upon the
drawbacks of Lloyd's celebrated $k$-means algorithm over $60$ years after its
introduction. Various methods seek to address poor local minima, sensitivity to
outliers, and data that are not well-suited to Euclidean measures of fit, but
many are supported largely empirically. Moreover, combining such approaches in
a piecemeal manner can result in ad hoc methods, and the limited theoretical
results supporting each individual contribution may no longer hold. Toward
addressing these issues in a principled way, this paper proposes a cohesive
robust framework for center-based clustering under a general class of
dissimilarity measures. In particular, we present a rigorous theoretical
treatment within a Median-of-Means (MoM) estimation framework, showing that it
subsumes several popular $k$-means variants. In addition to unifying existing
methods, we derive uniform concentration bounds that complete their analyses,
and bridge these results to the MoM framework via Dudley's chaining arguments.
Importantly, we neither require any assumptions on the distribution of the
outlying observations nor on the relative number of observations $n$ to
features $p$. We establish strong consistency and an error rate of
$O(n^{-1/2})$ under mild conditions, surpassing the best-known results in the
literature. The methods are empirically validated thoroughly on real and
synthetic datasets.
Related papers
- Robust Barycenter Estimation using Semi-Unbalanced Neural Optimal Transport [84.51977664336056]
We propose a novel, scalable approach for estimating the textitrobust continuous barycenter.
Our method is framed as a $min$-$max$ optimization problem and is adaptable to textitgeneral cost function.
arXiv Detail & Related papers (2024-10-04T23:27:33Z) - Estimating Barycenters of Distributions with Neural Optimal Transport [93.28746685008093]
We propose a new scalable approach for solving the Wasserstein barycenter problem.
Our methodology is based on the recent Neural OT solver.
We also establish theoretical error bounds for our proposed approach.
arXiv Detail & Related papers (2024-02-06T09:17:07Z) - Can Evolutionary Clustering Have Theoretical Guarantees? [21.343803954998915]
Evolutionary clustering has found many successful applications, but all the results are empirical, lacking theoretical support.
This paper fills this gap by proving that the approximation performance of the GSEMO can be theoretically guaranteed.
arXiv Detail & Related papers (2022-12-04T09:00:35Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Bregman Power k-Means for Clustering Exponential Family Data [11.434503492579477]
We bridge algorithmic advances to classical work on hard clustering under Bregman divergences.
The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm.
We consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.
arXiv Detail & Related papers (2022-06-22T06:09:54Z) - A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms.
We develop a general framework from the perspective of Bregman minimization divergence.
We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z) - Improved Estimation of Concentration Under $\ell_p$-Norm Distance
Metrics Using Half Spaces [14.947511752748005]
Concentration of measure has been argued to be the fundamental cause of adversarial vulnerability.
We propose a method to estimate the concentration of any empirical dataset under $ell_p$-norm distance metrics.
Our proposed algorithm is more efficient than Mahloujifar et al.'s, and our experiments on synthetic datasets and image benchmarks demonstrate that it is able to find much tighter intrinsic robustness bounds.
arXiv Detail & Related papers (2021-03-24T01:16:28Z) - Generalization Bounds in the Presence of Outliers: a Median-of-Means
Study [8.905677748354364]
The Median-of-Means (MoM) is an estimator of the mean $theta$ of a square integrable r.v. $Z$.
Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning.
A new line of work is now trying to characterize and leverage MoM's ability to deal with corrupted data.
arXiv Detail & Related papers (2020-06-09T13:21:39Z) - Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function.
It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z) - Entropy Regularized Power k-Means Clustering [21.013169939337583]
We propose a scalable majorization-minimization algorithm that enjoys closed-form updates and convergence guarantees.
Our method retains the same computational complexity of $k$-means and power $k$-means, but yields significant improvements over both.
arXiv Detail & Related papers (2020-01-10T14:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.