Related papers: Orthogonal Self-Attention

Orthogonal Self-Attention

URL: http://arxiv.org/abs/2602.05996v1
Date: Thu, 05 Feb 2026 18:42:57 GMT
Title: Orthogonal Self-Attention
Authors: Leo Zhang, James Martens,
Abstract summary: Softmax Self-Attention (SSA) is a key component of Transformer architectures.<n>Recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians.<n>We design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues.
Score: 4.235348155087336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians. In this work, we design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues with SSA, in order to allow for (non-causal) Transformers without skip connections and normalisation layers to be more easily trained. In particular, OSA parametrises the attention matrix to be orthogonal via mapping a skew-symmetric matrix, formed from query-key values, through the matrix exponential. We show that this can be practically implemented, by exploiting the low-rank structure of our query-key values, resulting in the computational complexity and memory cost of OSA scaling linearly with sequence length. Furthermore, we derive an initialisation scheme for which we prove ensures that the Jacobian of OSA is well-conditioned.

Related papers

PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
Deep Delta Learning [91.75868893250662]
We introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection.<n>We provide a spectral analysis of this operator, demonstrating that the gate $(mathbfX)$ enables dynamic between identity mapping, projection, and geometric reflection.<n>This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics.
arXiv Detail & Related papers (2026-01-01T18:11:38Z)
Vision Transformers are Circulant Attention Learners [30.300457741980846]
Self-attention mechanism has been a key factor in the advancement of vision Transformers.<n>We present a novel attention paradigm termed textbfCirculant Attention by exploiting the inherent efficient pattern of self-attention.
arXiv Detail & Related papers (2025-12-25T07:28:33Z)
Higher-order Linear Attention [59.92962330635185]
quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts.<n>We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics.
arXiv Detail & Related papers (2025-10-31T07:54:37Z)
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models [68.31088463716269]
We propose a structured sparse parametrization of transition matrices in state-space models (SSMs)<n>Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$)<n>The model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks.
arXiv Detail & Related papers (2025-09-26T12:46:30Z)
Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z)
Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction [29.12836710966048]
We propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens.<n>Our results call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures.
arXiv Detail & Related papers (2024-12-23T18:59:21Z)
Reducing the Transformer Architecture to a Minimum [5.352839075466439]
Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV) The attention mechanism itself is nonlinear through its internal use of similarity measures. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10.
arXiv Detail & Related papers (2024-10-17T16:36:14Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Rank Reduction Autoencoders [3.180674374101366]
We introduce a new class of deterministic autoencoders, Rank Reduction Autoencoders (RRAEs)<n>In RRAEs, the bottleneck is defined by the rank of the latent matrix, thereby alleviating the dependence of the encoder/decoder architecture on the bottleneck size.<n>We empirically demonstrate that both RRAEs and aRRAEs are stable, scalable, and reliable.
arXiv Detail & Related papers (2024-05-22T20:33:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.