Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
- URL: http://arxiv.org/abs/2510.11987v1
- Date: Mon, 13 Oct 2025 22:29:52 GMT
- Title: Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
- Authors: Conor Rowan,
- Abstract summary: We show that neural network training reliably fails when relying on exact curvature information.<n>The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.
Related papers
- Curvature Learning for Generalization of Hyperbolic Neural Networks [51.888534247573894]
Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures.<n>Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance.<n>We propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs.
arXiv Detail & Related papers (2025-08-24T07:14:30Z) - Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime [26.47265060394168]
We study the convergence of methods for the training of mean-field-hidden-layer neural networks.<n>We show that a strategy enables provable convergence rates for the sampling of a teacher feature distribution.
arXiv Detail & Related papers (2025-04-25T09:40:10Z) - Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization [0.5120567378386615]
First-order descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning.<n>However, these methods do not exploit curvature information.<n>Quasi-Newton methods re-use previously computed low Hessian approximations.
arXiv Detail & Related papers (2025-02-17T20:20:11Z) - Debiasing Mini-Batch Quadratics for Applications in Deep Learning [22.90473935350847]
Quadratic approximations form a fundamental building block of machine learning methods.<n>When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches.<n>We show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.
arXiv Detail & Related papers (2024-10-18T09:37:05Z) - Trust-Region Sequential Quadratic Programming for Stochastic Optimization with Random Models [57.52124921268249]
We propose a Trust Sequential Quadratic Programming method to find both first and second-order stationary points.
To converge to first-order stationary points, our method computes a gradient step in each iteration defined by minimizing a approximation of the objective subject.
To converge to second-order stationary points, our method additionally computes an eigen step to explore the negative curvature the reduced Hessian matrix.
arXiv Detail & Related papers (2024-09-24T04:39:47Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Dual Gauss-Newton Directions for Deep Learning [16.77273032202006]
Inspired by Gauss-Newton-like methods, we study the benefit of leveraging the structure of deep learning objectives.
We propose to compute such direction oracles via their dual formulation, leading to both computational benefits and new insights.
arXiv Detail & Related papers (2023-08-17T09:44:05Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Error-Correcting Neural Networks for Two-Dimensional Curvature
Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method.
Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.