A general theory for robust clustering via trimmed mean
- URL: http://arxiv.org/abs/2401.05574v2
- Date: Fri, 2 Feb 2024 19:58:30 GMT
- Title: A general theory for robust clustering via trimmed mean
- Authors: Soham Jana, Jianqing Fan, Sanjeev Kulkarni
- Abstract summary: We introduce a hybrid clustering technique with a novel trimmed mean type centroid estimate to produce mislabeling guarantees.
Our results reduce to the sub-Gaussian case when errors follow sub-Gaussian distributions.
We show that these initial centroid estimates are sufficiently good for the subsequent clustering algorithm to achieve the optimal mislabeling rates.
- Score: 7.650319416775203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering is a fundamental tool in statistical machine learning in the
presence of heterogeneous data. Many recent results focus primarily on optimal
mislabeling guarantees, when data are distributed around centroids with
sub-Gaussian errors. Yet, the restrictive sub-Gaussian model is often invalid
in practice, since various real-world applications exhibit heavy tail
distributions around the centroids or suffer from possible adversarial attacks
that call for robust clustering with a robust data-driven initialization. In
this paper, we introduce a hybrid clustering technique with a novel
multivariate trimmed mean type centroid estimate to produce mislabeling
guarantees under a weak initialization condition for general error
distributions around the centroids. A matching lower bound is derived, up to
factors depending on the number of clusters. In addition, our approach also
produces the optimal mislabeling even in the presence of adversarial outliers.
Our results reduce to the sub-Gaussian case when errors follow sub-Gaussian
distributions. To solve the problem thoroughly, we also present novel
data-driven robust initialization techniques and show that, with probabilities
approaching one, these initial centroid estimates are sufficiently good for the
subsequent clustering algorithm to achieve the optimal mislabeling rates.
Furthermore, we demonstrate that the Lloyd algorithm is suboptimal for more
than two clusters even when errors are Gaussian, and for two clusters when
errors distributions have heavy tails. Both simulated data and real data
examples lend further support to both of our robust initialization procedure
and clustering algorithm.
Related papers
- Fuzzy K-Means Clustering without Cluster Centroids [79.19713746387337]
Fuzzy K-Means clustering is a critical computation technique in unsupervised data analysis.
This paper proposes a novel Fuzzy K-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z) - Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers [24.88026399458157]
Byzantine machine learning has garnered considerable attention in light of the unpredictable faults that can occur.
The key to secure machines in distributed learning is resilient aggregation mechanisms.
arXiv Detail & Related papers (2023-12-20T08:36:55Z) - Robust and Automatic Data Clustering: Dirichlet Process meets
Median-of-Means [18.3248037914529]
We present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies.
Statistical guarantees on the upper bound of clustering error suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
arXiv Detail & Related papers (2023-11-26T19:01:15Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Compound Batch Normalization for Long-tailed Image Classification [77.42829178064807]
We propose a compound batch normalization method based on a Gaussian mixture.
It can model the feature space more comprehensively and reduce the dominance of head classes.
The proposed method outperforms existing methods on long-tailed image classification.
arXiv Detail & Related papers (2022-12-02T07:31:39Z) - Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous
Data [16.153709556346417]
Clustering is a widely deployed learning tool.
iLA-SDP is less sensitive than EM to and more stable on high-dimensional data.
arXiv Detail & Related papers (2022-09-29T21:03:13Z) - Clustering by the Probability Distributions from Extreme Value Theory [32.496691290725764]
This paper generalizes k-means to model the distribution of clusters.
We use GPD to establish a probability model for each cluster.
We also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means.
Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means.
arXiv Detail & Related papers (2022-02-20T10:52:43Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Robust M-Estimation Based Bayesian Cluster Enumeration for Real
Elliptically Symmetric Distributions [5.137336092866906]
Robustly determining optimal number of clusters in a data set is an essential factor in a wide range of applications.
This article generalizes so that it can be used with any arbitrary Really Symmetric (RES) distributed mixture model.
We derive a robust criterion for data sets with finite sample size, and also provide an approximation to reduce the computational cost at large sample sizes.
arXiv Detail & Related papers (2020-05-04T11:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.