Semidefinite programming on population clustering: a global analysis
- URL: http://arxiv.org/abs/2301.00344v1
- Date: Sun, 1 Jan 2023 04:52:25 GMT
- Title: Semidefinite programming on population clustering: a global analysis
- Authors: Shuheng Zhou
- Abstract summary: We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions.
We are interested in the case that individual features are of low average quality $gamma$, and we want to use as few of them as possible to correctly partition the sample.
- Score: 0.6472434306724609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the problem of partitioning a small data sample of
size $n$ drawn from a mixture of $2$ sub-gaussian distributions. Our work is
motivated by the application of clustering individuals according to their
population of origin using markers, when the divergence between the two
populations is small. We are interested in the case that individual features
are of low average quality $\gamma$, and we want to use as few of them as
possible to correctly partition the sample. We consider semidefinite relaxation
of an integer quadratic program which is formulated essentially as finding the
maximum cut on a graph where edge weights in the cut represent dissimilarity
scores between two nodes based on their features. A small simulation result in
Blum, Coja-Oghlan, Frieze and Zhou (2007, 2009) shows that even when the sample
size $n$ is small, by increasing $p$ so that $np= \Omega(1/\gamma^2)$, one can
classify a mixture of two product populations using the spectral method therein
with success rate reaching an ``oracle'' curve. There the ``oracle'' was
computed assuming that distributions were known, where success rate means the
ratio between correctly classified individuals and the sample size $n$. In this
work, we show the theoretical underpinning of this observed concentration of
measure phenomenon in high dimensions, simultaneously for the semidefinite
optimization goal and the spectral method, where the input is based on the gram
matrix computed from centered data. We allow a full range of tradeoffs between
the sample size and the number of features such that the product of these two
is lower bounded by $1/{\gamma^2}$ so long as the number of features $p$ is
lower bounded by $1/\gamma$.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Distribution of lowest eigenvalue in $k$-body bosonic random matrix ensembles [0.8999666725996978]
We numerically study the distribution of the lowest eigenvalue of finite many-boson systems with $k$-body interactions.
The first four moments of the distribution of lowest eigenvalues have been analyzed as a function of the $q$ parameter.
arXiv Detail & Related papers (2024-04-30T20:44:31Z) - Learning general Gaussian mixtures with efficient score matching [16.06356123715737]
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions.
We make no separation assumptions on the underlying mixture components.
We give an algorithm that draws $dmathrmpoly(k/varepsilon)$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler.
arXiv Detail & Related papers (2024-04-29T17:30:36Z) - Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions [11.222970035173372]
kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1 t-fracd+22(tfracd2 vee 1)right)
We show that a kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1/2 t-fracd4right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian
arXiv Detail & Related papers (2024-02-23T20:51:31Z) - Debiasing and a local analysis for population clustering using
semidefinite programming [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions.
This work is motivated by the application of clustering individuals according to their population of origin.
arXiv Detail & Related papers (2024-01-16T03:14:24Z) - Stochastic Approximation Approaches to Group Distributionally Robust Optimization and Beyond [89.72693227960274]
This paper investigates group distributionally robust optimization (GDRO) with the goal of learning a model that performs well over $m$ different distributions.
To reduce the number of samples in each round from $m$ to 1, we cast GDRO as a two-player game, where one player conducts and the other executes an online algorithm for non-oblivious multi-armed bandits.
In the second scenario, we propose to optimize the average top-$k$ risk instead of the maximum risk, thereby mitigating the impact of distributions.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - Causal Bandits for Linear Structural Equation Models [58.2875460517691]
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model.
It is assumed that the graph's structure is known and has $N$ nodes.
Two algorithms are proposed for the frequentist (UCB-based) and Bayesian settings.
arXiv Detail & Related papers (2022-08-26T16:21:31Z) - Distributed Sparse Regression via Penalization [5.990069843501885]
We study linear regression over a network of agents, modeled as an undirected graph (with no centralized node)
The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint.
We show that the proximal-gradient algorithm applied to the penalized problem converges linearly up to a tolerance of the order of the centralized statistical error.
arXiv Detail & Related papers (2021-11-12T01:51:50Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting.
We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$.
Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.