Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
- URL: http://arxiv.org/abs/2602.23057v1
- Date: Thu, 26 Feb 2026 14:42:16 GMT
- Title: Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
- Authors: Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee, Junhee Yoo, Sunghyeon Woo, Jiwon Ryu, Se Jung Kwon, Dongsoo Lee,
- Abstract summary: Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.<n>We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights.
- Score: 14.827874140211328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.
Related papers
- AMPS: Adaptive Modality Preference Steering via Functional Entropy [66.69992693275061]
We introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering.<n> Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference.
arXiv Detail & Related papers (2026-02-13T02:29:06Z) - A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training [86.64715217940274]
Outliers function jointly with normalization.<n>Outliers serve more as rescale factors rather than contributors.<n>Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling.
arXiv Detail & Related papers (2026-01-30T13:29:45Z) - From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers [0.0]
Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information.<n>We observe that existing models tend to allocate attention primarily to common words, overlooking less popular yet highly task-relevant terms.<n>We propose an Adversarial Feedback for Attention(AFA) training mechanism that enables the model to automatically redistribute attention weights to appropriate focal points.
arXiv Detail & Related papers (2025-12-19T01:48:25Z) - FAIR: Focused Attention Is All You Need for Generative Recommendation [43.65370600297507]
We propose the first generative recommendation framework with focused attention, which enhances attention scores to relevant context while suppressing those to irrelevant ones.<n>Specifically, we propose (1) a focused attention mechanism integrated into the standard Transformer, which learns two separate sets of Q and K attention weights and computes their difference as the final attention scores.<n>We validate the effectiveness of FAIR on four public benchmarks, demonstrating its superior performance compared to existing methods.
arXiv Detail & Related papers (2025-12-12T03:25:12Z) - Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z) - Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
This paper proposes a new design principle for attention, viewing it as a two-stage process.<n>In the first stage, we replace the standard exponential function with the more numerically stable Softplus activation.<n>In the second stage, we introduce a re-weighting mechanism that sharpens the attention distribution.
arXiv Detail & Related papers (2025-01-23T07:21:08Z) - Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference.<n>In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z) - More Expressive Attention with Negative Weights [36.40344438470477]
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness.<n>Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention.
arXiv Detail & Related papers (2024-11-11T17:56:28Z) - Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability.
We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention.
Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z) - Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.