Scaling Bidirectional Spans and Span Violations in Attention Mechanism
- URL: http://arxiv.org/abs/2512.13033v1
- Date: Mon, 15 Dec 2025 07:03:24 GMT
- Title: Scaling Bidirectional Spans and Span Violations in Attention Mechanism
- Authors: Jongwook Kim, Sangheon Yun, Sukjin Yoon,
- Abstract summary: canonical $O(N2)$ Transformer remains the empirical performance frontier in sequence modeling.<n>We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans.<n>We demonstrate that selectively scaling these components, focusing primarily on $0th$ order bidirectional parallel spans, yields the most effective learning signal.
- Score: 5.755498052202004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
Related papers
- Preconditioned Score and Flow Matching [15.378063594292955]
We show that the covariance $_t$ of $p_t$ governs optimization bias.<n>We propose reversible, label-conditional emphpreconditioning maps that reshape the geometry of $p_t$.<n>We show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.
arXiv Detail & Related papers (2026-03-02T19:09:15Z) - PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z) - Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization [22.883367233817836]
We show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem.<n>We validate our theory across vision, language, and tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs.
arXiv Detail & Related papers (2025-09-28T14:08:29Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Implicit Bias and Fast Convergence Rates for Self-attention [26.766649949420746]
We study the fundamental optimization principles of self-attention, the defining mechanism of transformers.<n>We analyze the implicit bias of gradient-baseds in a self-attention layer with a decoder in a linear classification.
arXiv Detail & Related papers (2024-02-08T15:15:09Z) - Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal
Inference [0.0]
This paper presents a groundbreaking approach to causal inference by integrating continuous normalizing flows with parametric submodels.
We leverage optimal transport and Wasserstein gradient flows to develop causal inference methodologies with minimal variance in finite-sample settings.
Preliminary experiments showcase our method's superiority, yielding lower mean-squared errors compared to standard flows.
arXiv Detail & Related papers (2023-11-30T18:59:05Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates [17.777466668123886]
We introduce PROMISE ($textbfPr$econditioned $textbfO$ptimization $textbfM$ethods by $textbfI$ncorporating $textbfS$calable Curvature $textbfE$stimates), a suite of sketching-based preconditioned gradient algorithms.
PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha.
arXiv Detail & Related papers (2023-09-05T07:49:10Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI)
Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.