Related papers: Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

URL: http://arxiv.org/abs/2602.12384v2
Date: Mon, 16 Feb 2026 11:03:14 GMT
Title: Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment
Authors: Nathanaël Haas, François Gatine, Augustin M Cosse, Zied Bouraoui,
Abstract summary: We show that depth-induced exponential scaling of ordered singular values and strong spectral separation can be used to study deep Jacobians.<n>We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians.<n> Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics.
Score: 10.515277266852838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding why gradient-based training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.

Related papers

Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF [0.0]
Large language model alignment objectives are often presented as a collection of distinct algorithms, such as PPO, DPO, IPO, and their variants.<n>In this work, we argue that this diversity obscures a simpler underlying structure.<n>We show that this entanglement is not merely a modeling convenience but a source of systematic instability.
arXiv Detail & Related papers (2026-01-18T13:57:44Z)
VIKING: Deep variational inference with stochastic projections [48.946143517489496]
Variational mean field approximations tend to struggle with contemporary overparametrized deep neural networks.<n>We propose a simple variational family that considers two independent linear subspaces of the parameter space.<n>This allows us to build a fully-correlated approximate posterior reflecting the overparametrization.
arXiv Detail & Related papers (2025-10-27T15:38:35Z)
Variational Deep Learning via Implicit Regularization [11.296548737163599]
Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization.<n>We propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent.
arXiv Detail & Related papers (2025-05-26T17:15:57Z)
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models [18.99511760351873]
Recent experiments have shown that training trajectories of multiple deep neural networks evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold.<n>Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter.<n>We show that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent.
arXiv Detail & Related papers (2025-05-13T19:20:19Z)
Dynamical heterogeneity and large deviations in the open quantum East glass model from tensor networks [0.0]
We study the non-equilibrium dynamics of the dissipative quantum East model via numerical tensor networks. We use matrix product states to represent evolution under quantum-jump unravellings for sizes beyond those accessible to exact diagonalisation.
arXiv Detail & Related papers (2024-04-04T18:41:18Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth. We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth. We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z)
Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented. $p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z)
Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI) Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.