Related papers: A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

URL: http://arxiv.org/abs/2512.15606v1
Date: Wed, 17 Dec 2025 17:17:12 GMT
Title: A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
Authors: Carlos Couto, José Mourão, Mário A. T. Figueiredo, Pedro Ribeiro,
Abstract summary: We show that the learning performance of descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters.<n>For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
Score: 2.6704011101972136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.

Related papers

Geometry and Optimization of Shallow Polynomial Networks [37.10914374024599]
We study shallow neural networks with activations, focusing on the relationship between width and optimization.<n>We then consider teacher-student problems, that can be viewed as a problem of low-rank tensor approximation.<n>In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures.
arXiv Detail & Related papers (2025-01-10T16:11:27Z)
A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z)
Deep Learning without Global Optimization by Random Fourier Neural Networks [0.0]
We introduce a new training algorithm for deep neural networks that utilize random complex exponential activation functions.<n>Our approach employs a Markov Chain Monte Carlo sampling procedure to iteratively train network layers.<n>It consistently attains the theoretical approximation rate for residual networks with complex exponential activation functions.
arXiv Detail & Related papers (2024-07-16T16:23:40Z)
Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks. We show that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z)
Implicit Regularization via Spectral Neural Networks and Non-linear Matrix Sensing [2.171120568435925]
Spectral Neural Networks (abbrv. SNN) is particularly suitable for matrix learning problems. We show that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.
arXiv Detail & Related papers (2024-02-27T15:28:01Z)
Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer. In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z)
Online Network Source Optimization with Graph-Kernel MAB [62.6067511147939]
We propose Grab-UCB, a graph- kernel multi-arms bandit algorithm to learn online the optimal source placement in large scale networks. We describe the network processes with an adaptive graph dictionary model, which typically leads to sparse spectral representations. We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy.
arXiv Detail & Related papers (2023-07-07T15:03:42Z)
Joint Feature and Differentiable $ k $-NN Graph Learning using Dirichlet Energy [103.74640329539389]
We propose a deep FS method that simultaneously conducts feature selection and differentiable $ k $-NN graph learning. We employ Optimal Transport theory to address the non-differentiability issue of learning $ k $-NN graphs in neural networks. We validate the effectiveness of our model with extensive experiments on both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-21T08:15:55Z)
Analytical aspects of non-differentiable neural networks [0.0]
We discuss the expressivity of quantized neural networks and approximation techniques for non-differentiable networks. We show that QNNs have the same expressivity as DNNs in terms of approximation of Lipschitz functions in the $Linfty$ norm. We also consider networks defined by means of Heaviside-type activation functions, and prove for them a pointwise approximation result by means of smooth networks.
arXiv Detail & Related papers (2020-11-03T17:20:43Z)
Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory [110.99247009159726]
Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise.
arXiv Detail & Related papers (2020-06-08T17:25:22Z)
Eigendecomposition-Free Training of Deep Networks for Linear Least-Square Problems [107.3868459697569]
We introduce an eigendecomposition-free approach to training a deep network. We show that our approach is much more robust than explicit differentiation of the eigendecomposition. Our method has better convergence properties and yields state-of-the-art results.
arXiv Detail & Related papers (2020-04-15T04:29:34Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.