Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions
- URL: http://arxiv.org/abs/2502.15655v1
- Date: Fri, 21 Feb 2025 18:26:25 GMT
- Title: Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions
- Authors: Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath,
- Abstract summary: We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices.<n>We prove exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues.<n>We demonstrate our general results by analyzing the effective dynamics in the case of multi-class logistic regression.
- Score: 11.143337341980978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n\in \mathbb R^d$, are i.i.d. draws of a $k$-component Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where $\mathbf{x}\in \mathbb{R}^{d\times C}$ are the parameters, and $C$ need not equal $k$. This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We prove exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension $n,d\to\infty$ and $n/d=\phi \in (0,\infty)$. These limits depend on the parameters $\mathbf{x}$ only through the summary statistic of the $(C+k)\times (C+k)$ Gram matrix of the parameters and class means, $\mathbf{G} = (\mathbf{x},\mathbf{\mu})^\top(\mathbf{x},\mathbf{\mu})$. It is known that under general conditions, when $\mathbf{x}$ is trained by stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the spectral theory to the training dynamics. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions.
Related papers
- Conditional regression for the Nonlinear Single-Variable Model [4.565636963872865]
We consider a model $F(X):=f(Pi_gamma):mathbbRdto[0,rmlen_gamma]$ where $Pi_gamma: [0,rmlen_gamma]tomathbbRd$ and $f:[0,rmlen_gamma]tomathbbR1$.
We propose a nonparametric estimator, based on conditional regression, and show that it can achieve the $one$-dimensional optimal min-max rate
arXiv Detail & Related papers (2024-11-14T18:53:51Z) - Which exceptional low-dimensional projections of a Gaussian point cloud can be found in polynomial time? [8.74634652691576]
We study the subset $mathscrF_m,alpha$ of distributions that can be realized by a class of iterative algorithms.
Non-rigorous methods from statistical physics yield an indirect characterization of $mathscrF_m,alpha$ in terms of a generalized Parisi formula.
arXiv Detail & Related papers (2024-06-05T05:54:56Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Universality laws for Gaussian mixtures in generalized linear models [22.154969876570238]
We investigate the joint statistics of the family of generalized linear estimators $(Theta_1, dots, Theta_M)$.
This allow us to prove the universality of different quantities of interest, such as the training and generalization errors.
We discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty.
arXiv Detail & Related papers (2023-02-17T15:16:06Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.