Related papers: Breaking the Attention Bottleneck

Breaking the Attention Bottleneck

URL: http://arxiv.org/abs/2406.10906v1
Date: Sun, 16 Jun 2024 12:06:58 GMT
Title: Breaking the Attention Bottleneck
Authors: Kalle Hilsenbek,
Abstract summary: This paper develops a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. The concept of attention replacement is distributed under the AGPL v3 license at https://gitlab.com/Bachstelzecausal_generation.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.

Related papers

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis [15.71443217369106]
We develop a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow.
arXiv Detail & Related papers (2025-02-26T01:00:46Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently. We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm. We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z)
Harnessing Attention Mechanisms: Efficient Sequence Reduction using Attention-based Autoencoders [14.25761027376296]
We introduce a novel attention-based method that allows for the direct manipulation of sequence lengths. We show that the autoencoder retains all the significant information when reducing the original sequence to half its original size.
arXiv Detail & Related papers (2023-10-23T11:57:44Z)
Are We Using Autoencoders in a Wrong Way? [3.110260251019273]
Autoencoders are used for dimensionality reduction, anomaly detection and feature extraction. We revisited the standard training for the undercomplete Autoencoder modifying the shape of the latent space. We also explored the behaviour of the latent space in the case of reconstruction of a random sample from the whole dataset.
arXiv Detail & Related papers (2023-09-04T11:22:43Z)
DAE-Former: Dual Attention-guided Efficient Transformer for Medical Image Segmentation [3.9548535445908928]
We propose DAE-Former, a novel method that seeks to provide an alternative perspective by efficiently designing the self-attention mechanism. Our method outperforms state-of-the-art methods on multi-organ cardiac and skin lesion segmentation datasets without requiring pre-training weights.
arXiv Detail & Related papers (2022-12-27T14:39:39Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it. Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Vector Embeddings with Subvector Permutation Invariance using a Triplet Enhanced Autoencoder [0.0]
In this paper, we use an autoencoder enhanced with triplet loss to promote the clustering of vectors that are related through permutations of constituent subvectors. We can then use these invariant embeddings as inputs to other problems, like classification and clustering, and improve detection accuracy in those problems.
arXiv Detail & Related papers (2020-11-18T21:24:07Z)
Learning to Encode Position for Transformer with Continuous Dynamical Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.