Wasserstein $K$-means for clustering probability distributions
- URL: http://arxiv.org/abs/2209.06975v1
- Date: Wed, 14 Sep 2022 23:43:16 GMT
- Title: Wasserstein $K$-means for clustering probability distributions
- Authors: Yubo Zhuang, Xiaohui Chen, Yun Yang
- Abstract summary: In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent.
In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric.
We show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric.
- Score: 16.153709556346417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering is an important exploratory data analysis technique to group
objects based on their similarity. The widely used $K$-means clustering method
relies on some notion of distance to partition data into a fewer number of
groups. In the Euclidean space, centroid-based and distance-based formulations
of the $K$-means are equivalent. In modern machine learning applications, data
often arise as probability distributions and a natural generalization to handle
measure-valued data is to use the optimal transport metric. Due to non-negative
Alexandrov curvature of the Wasserstein space, barycenters suffer from
regularity and non-robustness issues. The peculiar behaviors of Wasserstein
barycenters may make the centroid-based formulation fail to represent the
within-cluster data points, while the more direct distance-based $K$-means
approach and its semidefinite program (SDP) relaxation are capable of
recovering the true cluster labels. In the special case of clustering Gaussian
distributions, we show that the SDP relaxed Wasserstein $K$-means can achieve
exact recovery given the clusters are well-separated under the $2$-Wasserstein
metric. Our simulation and real data examples also demonstrate that
distance-based $K$-means can achieve better classification performance over the
standard centroid-based $K$-means for clustering probability distributions and
images.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - Spectral Clustering for Discrete Distributions [22.450518079181542]
Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods.
We show that spectral clustering combined with distribution affinity measures can be more accurate and efficient than barycenter methods.
We provide theoretical guarantees for the success of our methods in clustering distributions.
arXiv Detail & Related papers (2024-01-25T03:17:03Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Clustering by the Probability Distributions from Extreme Value Theory [32.496691290725764]
This paper generalizes k-means to model the distribution of clusters.
We use GPD to establish a probability model for each cluster.
We also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means.
Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means.
arXiv Detail & Related papers (2022-02-20T10:52:43Z) - Distributed k-Means with Outliers in General Metrics [0.6117371161379208]
We present a distributed coreset-based 3-round approximation algorithm for k-means with $z$ outliers for general metric spaces.
An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the dimension doubling $D$ of the metric space.
arXiv Detail & Related papers (2022-02-16T16:24:31Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Sum-of-norms clustering does not separate nearby balls [49.1574468325115]
We show a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure.
We state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
arXiv Detail & Related papers (2021-04-28T13:35:17Z) - K-expectiles clustering [0.0]
We propose a novel partitioning clustering algorithm based on expectiles.
We suggest two schemes: fixed $tau$ clustering, and adaptive $tau$ clustering.
arXiv Detail & Related papers (2021-03-16T21:14:56Z) - Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic
Analysis [14.048989759890475]
We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center.
We find a formula for the average performance distortion in the regime where the number of cluster centers are large.
In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.
arXiv Detail & Related papers (2020-10-23T17:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.