A Distribution Testing Approach to Clustering Distributions
- URL: http://arxiv.org/abs/2512.08376v1
- Date: Tue, 09 Dec 2025 09:01:41 GMT
- Title: A Distribution Testing Approach to Clustering Distributions
- Authors: Gunjan Kumar, Yash Pote, Jonathan Scarlett,
- Abstract summary: Given a hidden partition of $k$ distributions into two groups, the goal is to recover the partition.<n>We establish upper and lower bounds on the sample complexity for two fundamental cases.<n>In particular, we achieve tightness with respect to $(n,k,r,varepsilon)$ (up to an $O(log k)$ factor) for all regimes.
- Score: 35.016184519329194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes.
Related papers
- Near-Optimal Clustering in Mixture of Markov Chains [74.3828414695655]
We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$.<n>We derive an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence between the transition kernels of the chains.<n>We then present a novel two-stage clustering algorithm.
arXiv Detail & Related papers (2025-06-02T05:10:40Z) - Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences [4.2603120588176635]
We consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from em unknown distributions.<n>The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters.
arXiv Detail & Related papers (2024-11-21T08:18:04Z) - Clustering Mixtures of Bounded Covariance Distributions Under Optimal
Separation [44.25945344950543]
We study the clustering problem for mixtures of bounded covariance distributions.
We give the first poly-time algorithm for this clustering task.
Our algorithm is robust to $Omega(alpha)$-fraction of adversarial outliers.
arXiv Detail & Related papers (2023-12-19T01:01:53Z) - Testing with Non-identically Distributed Samples [19.421846478881363]
We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed.<n>We show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution.<n>We extend our techniques to the problem of testing "closeness" of two distributions.
arXiv Detail & Related papers (2023-11-19T01:25:50Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Stochastic Approximation Approaches to Group Distributionally Robust Optimization and Beyond [89.72693227960274]
This paper investigates group distributionally robust optimization (GDRO) with the goal of learning a model that performs well over $m$ different distributions.
To reduce the number of samples in each round from $m$ to 1, we cast GDRO as a two-player game, where one player conducts and the other executes an online algorithm for non-oblivious multi-armed bandits.
In the second scenario, we propose to optimize the average top-$k$ risk instead of the maximum risk, thereby mitigating the impact of distributions.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - Sum-of-norms clustering does not separate nearby balls [49.1574468325115]
We show a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure.
We state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
arXiv Detail & Related papers (2021-04-28T13:35:17Z) - $k$-Variance: A Clustered Notion of Variance [23.57925128327]
We introduce $k$-variance, a generalization of variance built on the machinery of random bipartite matchings.
We provide in-depth analysis of this quantity in several key cases, including one-dimensional measures, clustered measures, and measures concentrated on low-dimensional subsets.
arXiv Detail & Related papers (2020-12-13T04:25:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.