Dirichlet Process-based Robust Clustering using the Median-of-Means Estimator
- URL: http://arxiv.org/abs/2311.15384v2
- Date: Wed, 29 Jan 2025 06:21:40 GMT
- Title: Dirichlet Process-based Robust Clustering using the Median-of-Means Estimator
- Authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das,
- Abstract summary: We propose an efficient and automatic clustering technique by integrating the strengths of model-based and centroid-based methodologies.
Our method mitigates the effect of noise on the quality of clustering; while at the same time, estimates the number of clusters.
- Score: 16.774378814288806
- License:
- Abstract: Clustering stands as one of the most prominent challenges in unsupervised machine learning. Among centroid-based methods, the classic $k$-means algorithm, based on Lloyd's heuristic, is widely used. Nonetheless, it is a well-known fact that $k$-means and its variants face several challenges, including heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When data contains noise or outliers, the Median-of-Means (MoM) estimator offers a robust alternative for stabilizing centroid-based methods. On a different note, another limitation in many commonly used clustering methods is the need to specify the number of clusters beforehand. Model-based approaches, such as Bayesian nonparametric models, address this issue by incorporating infinite mixture models, which eliminate the requirement for predefined cluster counts. Motivated by these facts, in this article, we propose an efficient and automatic clustering technique by integrating the strengths of model-based and centroid-based methodologies. Our method mitigates the effect of noise on the quality of clustering; while at the same time, estimates the number of clusters. Statistical guarantees on an upper bound of clustering error, and rigorous assessment through simulated and real datasets, suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Fuzzy K-Means Clustering without Cluster Centroids [21.256564324236333]
Fuzzy K-Means clustering is a critical technique in unsupervised data analysis.
This paper proposes a novel Fuzzy textitK-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z) - A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Gradient Based Clustering [72.15857783681658]
We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality.
The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions.
arXiv Detail & Related papers (2022-02-01T19:31:15Z) - Envelope Imbalance Learning Algorithm based on Multilayer Fuzzy C-means
Clustering and Minimum Interlayer discrepancy [14.339674126923903]
This paper proposes a deep instance envelope network-based imbalanced learning algorithm with the multilayer fuzzy c-means (MlFCM) and a minimum interlayer discrepancy mechanism based on the maximum mean discrepancy (MIDMD)
This algorithm can guarantee high quality balanced instances using a deep instance envelope network in the absence of prior knowledge.
arXiv Detail & Related papers (2021-11-02T04:59:57Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - A Deep Learning Object Detection Method for an Efficient Clusters
Initialization [6.365889364810239]
Clustering has been used in numerous applications such as banking customers profiling, document retrieval, image segmentation, and e-commerce recommendation engines.
Existing clustering techniques present significant limitations, from which is the dependability of their stability on the initialization parameters.
This paper proposes a solution that can provide near-optimal clustering parameters with low computational and resources overhead.
arXiv Detail & Related papers (2021-04-28T08:34:25Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap
Median-of-Means [3.222802562733787]
We propose a new clustering algorithm that is robust to the presence of outliers in the dataset.
We build on the idea of median-of-means statistics to estimate the centroids, but allow for replacement while constructing the blocks.
We prove its robustness to adversarial contamination by deriving robust rates of convergence for the K-means distorsion.
arXiv Detail & Related papers (2020-02-10T16:08:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.