Related papers: Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

URL: http://arxiv.org/abs/2501.13428v3
Date: Mon, 12 May 2025 03:16:04 GMT
Title: Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
Authors: Bo Gao, Michael W. Spratling,
Abstract summary: Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.<n>This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.<n>We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
Score: 7.80071686970278
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a novel re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach maintains nearly constant validation loss even at 16$\times$ the training token length, ensures numerical stability, and achieves superior results on downstream benchmarks.

Related papers

Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective [69.72942835553228]
This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than its softmax counterpart.<n>We represent the self-attention matrix as a mixture of experts and show that experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
arXiv Detail & Related papers (2025-02-01T02:36:14Z)
Towards Scalable and Deep Graph Neural Networks via Noise Masking [59.058558158296265]
Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks. scaling them to large graphs is challenging due to the high computational and storage costs. We present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works.
arXiv Detail & Related papers (2024-12-19T07:48:14Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference. We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z)
Cottention: Linear Transformers With Cosine Attention [2.762180345826837]
We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention.
arXiv Detail & Related papers (2024-09-27T13:38:36Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models [4.497551890206997]
Self-attention mechanism scales quadratically with sequence length. LongLoRA proposed shifted sparse attention (S(2)-Attn), effectively enabling context extension. SinkLoRA is still not as efficient as vanilla attention, reaching only 39% of the perplexity improvement compared to full attention.
arXiv Detail & Related papers (2024-06-09T07:23:34Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Is It a Free Lunch for Removing Outliers during Pretraining? [7.621880623381026]
We introduce a novel softmax function aimed at pretraining models in an outlier-free manner. We show that such an approach leads to performance degradation in full precision. We enhance the method by ensuring its normalization is invariant to sequence length.
arXiv Detail & Related papers (2024-02-19T12:45:52Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
How Smooth Is Attention? [26.322030088685928]
We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios. We show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $sqrtn$ up to a constant factor. Our mean-field framework for masked self-attention is novel and of independent interest.
arXiv Detail & Related papers (2023-12-22T16:47:10Z)
Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z)
Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention [28.98187418889448]
Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. The attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity.
arXiv Detail & Related papers (2023-10-18T03:17:57Z)
TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer [34.790081960470964]
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) We make advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization. We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus.
arXiv Detail & Related papers (2023-07-27T16:45:33Z)
cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops. We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.