Stick-breaking Attention
- URL: http://arxiv.org/abs/2410.17980v1
- Date: Wed, 23 Oct 2024 15:51:13 GMT
- Title: Stick-breaking Attention
- Authors: Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda,
- Abstract summary: Self-attention mechanism traditionally relies on the softmax operator.
Current methods using still face length generalisation challenges.
We propose an alternative attention mechanism based on the stick-breaking process.
- Score: 38.492552119793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point $\beta_{i,j}$, which represents the proportion of the remaining stick to allocate to the current token. We repeat the process until the stick is fully allocated, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et. al., 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with $2^{11}$ context window to perform well at $2^{14}$ with perplexity improvements.
Related papers
- Learning to Attribute with Attention [75.61481181755744]
We propose treating attention weights of different attention heads as features.
This way, we can learn how to effectively leverage attention weights for attribution.
Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations.
arXiv Detail & Related papers (2025-04-18T15:36:28Z) - Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.
This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.
We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z) - Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - When Attention Sink Emerges in Language Models: An Empirical View [39.36282162213973]
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important.
This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others.
We first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models.
arXiv Detail & Related papers (2024-10-14T17:50:28Z) - SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models [4.497551890206997]
Self-attention mechanism scales quadratically with sequence length.
LongLoRA proposed shifted sparse attention (S(2)-Attn), effectively enabling context extension.
SinkLoRA is still not as efficient as vanilla attention, reaching only 39% of the perplexity improvement compared to full attention.
arXiv Detail & Related papers (2024-06-09T07:23:34Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Landmark Attention: Random-Access Infinite Context Length for
Transformers [45.69864961773124]
We present a novel approach that allows access to the complete context while retaining random-access flexibility.
Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks.
We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step.
arXiv Detail & Related papers (2023-05-25T17:53:42Z) - Input-length-shortening and text generation via attention values [1.8222946691865871]
We show that the first layer's attention sums can be used to filter tokens in a given sequence.
We also show that retaining approximately 6% of the original sequence is sufficient to obtain 86.5% accuracy.
arXiv Detail & Related papers (2023-03-14T02:11:24Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - Coarse-to-fine Q-attention with Tree Expansion [95.00518278458908]
Coarse-to-fine Q-attention enables sample-efficient robot manipulation by discretizing the translation space in a coarse-to-fine manner.
Q-attention suffers from "coarse ambiguity" - when voxelization is significantly coarse, it is not feasible to distinguish similar-looking objects without first inspecting at a finer resolution.
We propose to envision Q-attention as a tree that can be expanded and used to accumulate value estimates across the top-k voxels at each Q-attention depth.
arXiv Detail & Related papers (2022-04-26T17:41:28Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.