Related papers: Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

URL: http://arxiv.org/abs/2512.22473v1
Date: Sat, 27 Dec 2025 05:31:44 GMT
Title: Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra,
Abstract summary: We show how cross-entropy training reshapes attention scores and value vectors in a transformer attention head.<n>Our core result is an emphadvantage-based routing law for attention scores.<n>We show that this coupled specialization behaves like a two-timescale EM procedure.
Score: 0.4779196219827507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ Δv_j = -η\sum_i α_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).

Related papers

From Core to Detail: Unsupervised Disentanglement with Entropy-Ordered Flows [8.351253396371686]
entropy-ordered flows (EOFlows) order latent dimensions by their explained entropy, analogous to PCA's explained variance.<n> EOFlows build on insights from Independent Mechanism Analysis, Principal Component Flows and Manifold Entropic Metrics.<n>We combine likelihood-based training with local Jacobian regularization and noise augmentation into a method that scales well to high-dimensional data such as images.
arXiv Detail & Related papers (2026-02-06T18:41:03Z)
Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability [1.078600700827543]
We build a simple model-agnostic witness of training memory based on emphback-flow of distinguishability.<n>We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, and more micro-steps.<n>We position this as a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization.
arXiv Detail & Related papers (2026-01-23T09:03:25Z)
The Bayesian Geometry of Transformer Attention [0.4779196219827507]
We build controlled environments where the true posterior is known in closed form and memorization is provably impossible.<n>Small transformers reproduce Bayesian posteriors with mbox$10-3$--$10-4$ bit accuracy, while capacity-matched geometrics fail by orders of magnitude.
arXiv Detail & Related papers (2025-12-27T05:28:58Z)
Two-Scale Latent Dynamics for Recurrent-Depth Transformers [18.852161704625562]
We study the geometry of current-depth transformers scale test-time compute by iterating latent computations before emitting tokens.<n>Across checkpoints, our measurements show that loop steps become emphsmaller and increasingly emphorthogonal to one another.<n>These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size.
arXiv Detail & Related papers (2025-09-27T14:01:40Z)
Tracing the Representation Geometry of Language Models from Pretraining to Post-training [22.18942718274405]
We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training.<n>We uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining.<n>Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data.
arXiv Detail & Related papers (2025-09-27T00:46:29Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
CWF: Consolidating Weak Features in High-quality Mesh Simplification [50.634070540791555]
We propose a smooth functional that simultaneously considers all of these requirements. The functional comprises a normal anisotropy term and a Centroidal Voronoi Tessellation (CVT) energy term.
arXiv Detail & Related papers (2024-04-24T05:37:17Z)
Implicit Bias and Fast Convergence Rates for Self-attention [26.766649949420746]
We study the fundamental optimization principles of self-attention, the defining mechanism of transformers.<n>We analyze the implicit bias of gradient-baseds in a self-attention layer with a decoder in a linear classification.
arXiv Detail & Related papers (2024-02-08T15:15:09Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Graph Signal Sampling for Inductive One-Bit Matrix Completion: a Closed-form Solution [112.3443939502313]
We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (signal) on the vertices of an item-item graph. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain.
arXiv Detail & Related papers (2023-02-08T08:17:43Z)
Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT) We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature. While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z)
Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial Attacks [86.88061841975482]
We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle. We use this setting to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method(FGSM) We show that the method uses fewer queries and achieves higher attack success rates than the current state of the art.
arXiv Detail & Related papers (2020-10-08T18:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.