Related papers: A Distribution Testing Approach to Clustering Distributions

A Distribution Testing Approach to Clustering Distributions

URL: http://arxiv.org/abs/2512.08376v1
Date: Tue, 09 Dec 2025 09:01:41 GMT
Title: A Distribution Testing Approach to Clustering Distributions
Authors: Gunjan Kumar, Yash Pote, Jonathan Scarlett,
Abstract summary: Given a hidden partition of $k$ distributions into two groups, the goal is to recover the partition.<n>We establish upper and lower bounds on the sample complexity for two fundamental cases.<n>In particular, we achieve tightness with respect to $(n,k,r,varepsilon)$ (up to an $O(log k)$ factor) for all regimes.
Score: 35.016184519329194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes.

Related papers

Near-Optimal Clustering in Mixture of Markov Chains [74.3828414695655]
We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$.<n>We derive an instance-dependent, high-probability lower bound on the clustering error rate, governed by the weighted KL divergence between the transition kernels of the chains.<n>We then present a novel two-stage clustering algorithm.
arXiv Detail & Related papers (2025-06-02T05:10:40Z)
Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences [4.2603120588176635]
We consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from em unknown distributions.<n>The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters.
arXiv Detail & Related papers (2024-11-21T08:18:04Z)
Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation [44.25945344950543]
We study the clustering problem for mixtures of bounded covariance distributions. We give the first poly-time algorithm for this clustering task. Our algorithm is robust to $Omega(alpha)$-fraction of adversarial outliers.
arXiv Detail & Related papers (2023-12-19T01:01:53Z)
Testing with Non-identically Distributed Samples [19.421846478881363]
We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed.<n>We show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution.<n>We extend our techniques to the problem of testing "closeness" of two distributions.
arXiv Detail & Related papers (2023-11-19T01:25:50Z)
Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner. We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z)
Stochastic Approximation Approaches to Group Distributionally Robust Optimization and Beyond [89.72693227960274]
This paper investigates group distributionally robust optimization (GDRO) with the goal of learning a model that performs well over $m$ different distributions. To reduce the number of samples in each round from $m$ to 1, we cast GDRO as a two-player game, where one player conducts and the other executes an online algorithm for non-oblivious multi-armed bandits. In the second scenario, we propose to optimize the average top-$k$ risk instead of the maximum risk, thereby mitigating the impact of distributions.
arXiv Detail & Related papers (2023-02-18T09:24:15Z)
Sum-of-norms clustering does not separate nearby balls [49.1574468325115]
We show a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. We state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
arXiv Detail & Related papers (2021-04-28T13:35:17Z)
$k$-Variance: A Clustered Notion of Variance [23.57925128327]
We introduce $k$-variance, a generalization of variance built on the machinery of random bipartite matchings. We provide in-depth analysis of this quantity in several key cases, including one-dimensional measures, clustered measures, and measures concentrated on low-dimensional subsets.
arXiv Detail & Related papers (2020-12-13T04:25:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.