Related papers: TransformerFAM: Feedback attention is working memory

TransformerFAM: Feedback attention is working memory

URL: http://arxiv.org/abs/2404.09173v3
Date: Tue, 7 May 2024 13:23:46 GMT
Title: TransformerFAM: Feedback attention is working memory
Authors: Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno Mengibar,
Abstract summary: We propose a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models.
Score: 18.005034679674274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.

Related papers

A temporal scale transformer framework for precise remaining useful life prediction in fuel cells [10.899223392837936]
Temporal Scale Transformer (TSTransformer) is an enhanced version of the inverted Transformer (iTransformer)<n>Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling.<n>It improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs.
arXiv Detail & Related papers (2025-04-08T23:42:54Z)
Forgetting Transformer: Softmax Attention with a Forget Gate [4.484298224007183]
We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX) FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks. FoX is compatible with the FlashAttention algorithm and does not require any positional embeddings.
arXiv Detail & Related papers (2025-03-03T23:35:23Z)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance. improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z)
Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z)
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z)
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z)
Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer) It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.