Related papers: Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

URL: http://arxiv.org/abs/2511.19997v1
Date: Tue, 25 Nov 2025 07:03:20 GMT
Title: Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test
Authors: Mihir Sahasrabudhe,
Abstract summary: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings.<n>Recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time.<n>This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself?
Score: 0.15229257192293197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs [43.414287127130684]
We propose a synthetic framework that generates text from symmetric/inverse triples, trains GPT-style autoregressive models from scratch, and evaluate memorization, logical inference, and in-context generalization.<n>We find that relational semantics emerge with sufficient logic-bearing supervision, even in shallow (2-3 layer) models, and that successful generalization aligns with stable intermediate-layer signals.
arXiv Detail & Related papers (2026-01-06T11:20:38Z)
Block-Recurrent Dynamics in Vision Transformers [42.261020313952976]
We argue that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k ll L$ distinct blocks applied recurrently.<n>We train a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost.
arXiv Detail & Related papers (2025-12-23T00:18:23Z)
Pay Attention Later: From Vector Space Diffusion to Linearithmic Spectral Phase-Locking [0.0]
Standard Transformers suffer from a "Semantic Alignment Tax"<n>We introduce the Phase-Resonant Intelligent Spectral Model (PRISM)<n>PRISM encodes semantic identity as resonant frequencies in the complex domain (Cd) and replaces quadratic self-attention with linearithmic O(N log N) Gated Harmonic Convolutions.
arXiv Detail & Related papers (2025-12-01T02:46:15Z)
Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z)
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation [8.973965016201822]
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to instability.<n>Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and gradients.
arXiv Detail & Related papers (2025-05-30T08:18:23Z)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training. We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models. We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.