Related papers: Learning single index model with gradient descent: spectral initialization and precise asymptotics

Learning single index model with gradient descent: spectral initialization and precise asymptotics

URL: http://arxiv.org/abs/2509.23527v1
Date: Sat, 27 Sep 2025 23:27:24 GMT
Title: Learning single index model with gradient descent: spectral initialization and precise asymptotics
Authors: Yuchen Chen, Yandi Shen,
Abstract summary: We show that for learning problems with large enough sample size, there exists a region around the true signal with benign data.<n>Motivated by many variables, a widely used strategy is a two-stage algorithm, where first apply a spectral gradient descent.<n>We demonstrate our general theory in the example of regularized Wirtinger flow for retrieval.
Score: 6.142981584296888
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Non-convex optimization plays a central role in many statistics and machine learning problems. Despite the landscape irregularities for general non-convex functions, some recent work showed that for many learning problems with random data and large enough sample size, there exists a region around the true signal with benign landscape. Motivated by this observation, a widely used strategy is a two-stage algorithm, where we first apply a spectral initialization to plunge into the region, and then run gradient descent for further refinement. While this two-stage algorithm has been extensively analyzed for many non-convex problems, the precise distributional property of both its transient and long-time behavior remains to be understood. In this work, we study this two-stage algorithm in the context of single index models under the proportional asymptotics regime. We derive a set of dynamical mean field equations, which describe the precise behavior of the trajectory of spectral initialized gradient descent in the large system limit. We further show that when the spectral initialization successfully lands in a region of benign landscape, the above equation system is asymptotically time translation invariant and exponential converging, and thus admits a set of long-time fixed points that represents the mean field characterization of the limiting point of the gradient descent dynamic. As a proof of concept, we demonstrate our general theory in the example of regularized Wirtinger flow for phase retrieval.

Related papers

Long-time dynamics and universality of nonconvex gradient descent [0.7614628596146601]
This paper develops a general approach to characterize the long-time behavior of non gradient descent in single-index models.<n>Our approach reveals that gradient descents are in general approximately independent of the data and strongly incoherent with the feature vectors.
arXiv Detail & Related papers (2025-09-14T20:36:18Z)
Gradient descent inference in empirical risk minimization [1.1510009152620668]
Gradient descent is one of the most widely used iterative algorithms in modern statistical learning.<n>This paper provides a precise, non-asymotical characterization of gradient descent in a broad class of empirical risk minimization problems.
arXiv Detail & Related papers (2024-12-12T17:47:08Z)
Limit Theorems for Stochastic Gradient Descent with Infinite Variance [51.4853131023238]
We show that the gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-rnstein process driven by an appropriate L'evy process.<n>We also explore the applications of these results in linear regression and logistic regression models.
arXiv Detail & Related papers (2024-10-21T09:39:10Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent [43.097493761380186]
gradient algorithms are an efficient method of approximately solving linear systems. We show that gradient descent produces accurate predictions, even in cases where it does not converge quickly to the optimum. Experimentally, gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks.
arXiv Detail & Related papers (2023-06-20T15:07:37Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm. We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z)
Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem [8.164433158925593]
We investigate how analytically-based algorithms such as dynamical descent, its persistent gradient, and Langevin landscape descent can be studied. We apply statistical-field theory from statistical trajectories to these algorithms in their full-time limit, with a start, and for large system sizes.
arXiv Detail & Related papers (2021-03-08T17:06:18Z)
Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape. We define a prototype process for which can be extended to a continuous-dimensional gradient flow. In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.