Related papers: The Bayesian Geometry of Transformer Attention

The Bayesian Geometry of Transformer Attention

URL: http://arxiv.org/abs/2512.22471v1
Date: Sat, 27 Dec 2025 05:28:58 GMT
Title: The Bayesian Geometry of Transformer Attention
Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra,
Abstract summary: We build controlled environments where the true posterior is known in closed form and memorization is provably impossible.<n>Small transformers reproduce Bayesian posteriors with mbox$10-3$--$10-4$ bit accuracy, while capacity-matched geometrics fail by orders of magnitude.
Score: 0.4779196219827507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with \mbox{$10^{-3}$--$10^{-4}$} bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query--key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame--precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

Related papers

Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens.<n>We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics.<n>The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space.
arXiv Detail & Related papers (2026-01-16T23:11:02Z)
Geometric Scaling of Bayesian Inference in LLMs [0.4779196219827507]
Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference.<n>We investigate whether this geometric signature persists in production-grade language models.
arXiv Detail & Related papers (2025-12-27T05:29:55Z)
Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade [38.28865883904372]
Cascaded Sensing (Cas-Sensing) is a hierarchical reconstruction framework that integrates an autoencoder-diffusion cascade.<n>A conditional diffusion model, trained with a mask-cascade strategy, generates fine-scale details conditioned on large-scale structures.<n>Experiments on both simulation and real-world datasets demonstrate that Cas-Sensing generalizes well across varying sensor configurations and geometric boundaries.
arXiv Detail & Related papers (2025-12-01T11:46:14Z)
Manifold Percolation: from generative model to Reinforce learning [0.26905021039717986]
Generative modeling is typically framed as learning mapping rules, but from an observer's perspective without access to these rules, the task becomes disentangling the geometric support from the probability distribution.<n>We propose that continuum percolation is uniquely suited to this support analysis, as the sampling process effectively projects high-dimensional density estimation onto a geometric counting problem on the support.
arXiv Detail & Related papers (2025-11-25T17:12:42Z)
VIKING: Deep variational inference with stochastic projections [48.946143517489496]
Variational mean field approximations tend to struggle with contemporary overparametrized deep neural networks.<n>We propose a simple variational family that considers two independent linear subspaces of the parameter space.<n>This allows us to build a fully-correlated approximate posterior reflecting the overparametrization.
arXiv Detail & Related papers (2025-10-27T15:38:35Z)
Generative Model Inversion Through the Lens of the Manifold Hypothesis [98.37040155914595]
Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models.<n>Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process.
arXiv Detail & Related papers (2025-09-24T14:39:25Z)
Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference [5.234742752529437]
We introduce the emphCycle-Consistent Helmholtz Machine (C$2$HM)<n>C$2$HM reframes inference as a emphgoal-seeded, emphasymmetric process grounded in structured internal priors.<n>By offering a biologically inspired alternative to classical amortized inference, $C2$HM reconceives generative modeling as intentional simulation.
arXiv Detail & Related papers (2025-07-03T17:24:27Z)
Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models [0.0]
Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics.<n>This study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure.<n>The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales.
arXiv Detail & Related papers (2025-01-08T05:24:11Z)
Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z)
3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop [128.07841893637337]
Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images. Minor deviation in parameters may lead to noticeable misalignment between the estimated meshes and image evidences. We propose a Pyramidal Mesh Alignment Feedback (PyMAF) loop to leverage a feature pyramid and rectify the predicted parameters.
arXiv Detail & Related papers (2021-03-30T17:07:49Z)
Supporting Optimal Phase Space Reconstructions Using Neural Network Architecture for Time Series Modeling [68.8204255655161]
We propose an artificial neural network with a mechanism to implicitly learn the phase spaces properties. Our approach is either as competitive as or better than most state-of-the-art strategies.
arXiv Detail & Related papers (2020-06-19T21:04:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.