Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws
- URL: http://arxiv.org/abs/2510.16927v1
- Date: Sun, 19 Oct 2025 16:54:00 GMT
- Title: Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws
- Authors: Egor Petrov, Nikita Kiselev, Vladislav Meshkov, Andrey Grabovoy,
- Abstract summary: We extend Hessian theory to the full Transformer architecture.<n>This work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.
- Score: 0.5774786149181391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.
Related papers
- Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems [59.94955550958074]
We study a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network.<n>We show that expert specialization reduces gradient conflicts and makes each subtask strongly convex.<n>We prove that the training drives the expected prediction loss to near zero in $O(log(epsilon-1)$ steps, significantly improving over the $O(epsilon-1)$ rate for a single transformer.
arXiv Detail & Related papers (2025-10-30T21:07:36Z) - Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis [57.614436689939986]
Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions.<n>We recast their frameworks through the lens of Optimal Control and prove that the cost function of the Diffusion Bridge is lower.<n>To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer.
arXiv Detail & Related papers (2025-09-29T09:45:22Z) - Provable In-Context Vector Arithmetic via Retrieving Task Concepts [53.685764040547625]
We show how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic.<n>These results elucidate the advantages of transformers over static embedding predecessors.
arXiv Detail & Related papers (2025-08-13T13:54:44Z) - On the Convergence of Gradient Descent on Learning Transformers with Residual Connections [26.02176724426513]
We analyze the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections.<n> residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation.
arXiv Detail & Related papers (2025-06-05T17:10:22Z) - Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z) - A Theoretical Framework for OOD Robustness in Transformers using Gevrey Classes [5.236910203359897]
We study the robustness of Transformer language models under semantic out-of-distribution shifts.<n>We derive sub-exponential upper bounds on prediction error using Wasserstein-1 distance and Gevrey-class smoothness.
arXiv Detail & Related papers (2025-04-17T14:59:29Z) - Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z) - Constrained belief updates explain geometric structures in transformer representations [1.1666234644810893]
We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models.<n>Our analysis focuses on single-layer transformers, revealing how the first attention layer implements constrained updates.<n>We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail.
arXiv Detail & Related papers (2025-02-04T03:03:54Z) - Dynamics of Transient Structure in In-Context Linear Regression Transformers [0.5242869847419834]
We show that when transformers are trained on in-context linear regression tasks with intermediate task diversity, they behave like ridge regression before specializing to the tasks in their training distribution.<n>This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis.<n>We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
arXiv Detail & Related papers (2025-01-29T16:32:14Z) - Unraveling the Gradient Descent Dynamics of Transformers [37.096572564254515]
Gradient Descent (GD) can train a Transformer model to achieve a global optimal solution, especially when the input embedding dimension is large.
We analyze the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels.
arXiv Detail & Related papers (2024-11-12T04:33:56Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data.
This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer.
We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.