Spectral Clustering for Discrete Distributions
- URL: http://arxiv.org/abs/2401.13913v2
- Date: Fri, 16 Aug 2024 06:00:05 GMT
- Title: Spectral Clustering for Discrete Distributions
- Authors: Zixiao Wang, Dong Qiao, Jicong Fan,
- Abstract summary: Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods.
We show that spectral clustering combined with distribution affinity measures can be more accurate and efficient than barycenter methods.
We provide theoretical guarantees for the success of our methods in clustering distributions.
- Score: 22.450518079181542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The discrete distribution is often used to describe complex instances in machine learning, such as images, sequences, and documents. Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods. These methods operate under the assumption that clusters can be well-represented by barycenters, which is seldom true in many real-world applications. Additionally, these methods are not scalable for large datasets due to the high computational cost of calculating Wasserstein barycenters. In this work, we explore the feasibility of using spectral clustering combined with distribution affinity measures (e.g., maximum mean discrepancy and Wasserstein distance) to cluster discrete distributions. We demonstrate that these methods can be more accurate and efficient than barycenter methods. To further enhance scalability, we propose using linear optimal transport to construct affinity matrices efficiently for large datasets. We provide theoretical guarantees for the success of our methods in clustering distributions. Experiments on both synthetic and real data show that our methods outperform existing baselines.
Related papers
- Estimating Barycenters of Distributions with Neural Optimal Transport [93.28746685008093]
We propose a new scalable approach for solving the Wasserstein barycenter problem.
Our methodology is based on the recent Neural OT solver.
We also establish theoretical error bounds for our proposed approach.
arXiv Detail & Related papers (2024-02-06T09:17:07Z) - Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Dataset Distillation via the Wasserstein Metric [35.32856617593164]
We introduce the Wasserstein distance, a metric grounded in optimal transport theory, to enhance distribution matching in dataset distillation.
Our method achieves new state-of-the-art performance across a range of high-resolution datasets.
arXiv Detail & Related papers (2023-11-30T13:15:28Z) - DECWA : Density-Based Clustering using Wasserstein Distance [1.4132765964347058]
We propose a new clustering algorithm based on spatial density and probabilistic approach.
We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.
arXiv Detail & Related papers (2023-10-25T11:10:08Z) - Approximating a RUM from Distributions on k-Slates [88.32814292632675]
We find a generalization-time algorithm that finds the RUM that best approximates the given distribution on average.
Our theoretical result can also be made practical: we obtain a that is effective and scales to real-world datasets.
arXiv Detail & Related papers (2023-05-22T17:43:34Z) - Wasserstein $K$-means for clustering probability distributions [16.153709556346417]
In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent.
In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric.
We show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric.
arXiv Detail & Related papers (2022-09-14T23:43:16Z) - Density Ratio Estimation via Infinitesimal Classification [85.08255198145304]
We propose DRE-infty, a divide-and-conquer approach to reduce Density ratio estimation (DRE) to a series of easier subproblems.
Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions.
We show that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.
arXiv Detail & Related papers (2021-11-22T06:26:29Z) - Density-Based Clustering with Kernel Diffusion [59.4179549482505]
A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in density-based clustering algorithms.
We propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness.
arXiv Detail & Related papers (2021-10-11T09:00:33Z) - Clustering Large Data Sets with Incremental Estimation of Low-density
Separating Hyperplanes [16.3460693863947]
An efficient method for obtaining low-density hyperplane separators in the unsupervised context is proposed.
Experiments with the proposed approach show that it is highly competitive in terms of both speed and accuracy when compared with relevant benchmarks.
arXiv Detail & Related papers (2021-08-07T12:45:05Z) - Continuous Regularized Wasserstein Barycenters [51.620781112674024]
We introduce a new dual formulation for the regularized Wasserstein barycenter problem.
We establish strong duality and use the corresponding primal-dual relationship to parametrize the barycenter implicitly using the dual potentials of regularized transport problems.
arXiv Detail & Related papers (2020-08-28T08:28:06Z) - Distributed Bayesian Matrix Decomposition for Big Data Mining and
Clustering [13.491022200305824]
We propose a distributed matrix decomposition model for big data mining and clustering.
Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference.
Our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
arXiv Detail & Related papers (2020-02-10T13:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.