Noise-robust Clustering
- URL: http://arxiv.org/abs/2110.08871v2
- Date: Tue, 19 Oct 2021 23:33:46 GMT
- Title: Noise-robust Clustering
- Authors: Rahmat Adesunkanmi, Ratnesh Kumar
- Abstract summary: This paper presents noise-robust clustering techniques in unsupervised machine learning.
The uncertainty about the noise, consistency, and other ambiguities can become severe obstacles in data analytics.
- Score: 2.0199917525888895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents noise-robust clustering techniques in unsupervised
machine learning. The uncertainty about the noise, consistency, and other
ambiguities can become severe obstacles in data analytics. As a result, data
quality, cleansing, management, and governance remain critical disciplines when
working with Big Data. With this complexity, it is no longer sufficient to
treat data deterministically as in a classical setting, and it becomes
meaningful to account for noise distribution and its impact on data sample
values. Classical clustering methods group data into "similarity classes"
depending on their relative distances or similarities in the underlying space.
This paper addressed this problem via the extension of classical $K$-means and
$K$-medoids clustering over data distributions (rather than the raw data). This
involves measuring distances among distributions using two types of measures:
the optimal mass transport (also called Wasserstein distance, denoted $W_2$)
and a novel distance measure proposed in this paper, the expected value of
random variable distance (denoted ED). The presented distribution-based
$K$-means and $K$-medoids algorithms cluster the data distributions first and
then assign each raw data to the cluster of data's distribution.
Related papers
- A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Influence of Swarm Intelligence in Data Clustering Mechanisms [0.0]
Nature inspired Swarm based algorithms are used for data clustering to cope with larger datasets with lack and inconsistency of data.
This paper reviews the performances of these new approaches and compares which is best for certain problematic situation.
arXiv Detail & Related papers (2023-05-07T08:40:50Z) - Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy
Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located.
The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z) - Optimizing the Noise in Self-Supervised Learning: from Importance
Sampling to Noise-Contrastive Estimation [80.07065346699005]
It is widely assumed that the optimal noise distribution should be made equal to the data distribution, as in Generative Adversarial Networks (GANs)
We turn to Noise-Contrastive Estimation which grounds this self-supervised task as an estimation problem of an energy-based model of the data.
We soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.
arXiv Detail & Related papers (2023-01-23T19:57:58Z) - Wasserstein $K$-means for clustering probability distributions [16.153709556346417]
In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent.
In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric.
We show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric.
arXiv Detail & Related papers (2022-09-14T23:43:16Z) - No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds [17.226362076527764]
Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis.
In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset.
We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by centroids.
arXiv Detail & Related papers (2022-03-04T18:59:02Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.