A Study on ReLU and Softmax in Transformer
- URL: http://arxiv.org/abs/2302.06461v1
- Date: Mon, 13 Feb 2023 15:41:20 GMT
- Title: A Study on ReLU and Softmax in Transformer
- Authors: Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian
- Abstract summary: The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories.
We first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax.
In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large.
- Score: 51.0740713922741
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Transformer architecture consists of self-attention and feed-forward
networks (FFNs) which can be viewed as key-value memories according to previous
works. However, FFN and traditional memory utilize different activation
functions (i.e., ReLU and Softmax respectively), which makes them not
equivalent. In this paper, we first rebuild the connections between FFN and
key-value memory by conducting extensive studies on ReLU and Softmax, and find
they are equivalent when adding an additional layer normalization module on
Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory
when the number of value slots is large. We analyze the reasons and then
explore this good property of ReLU on the self-attention network where the
original Softmax activation performs poorly on long input sequences. We then
propose a full ReLU architecture named ReLUFormer which performs better than
the baseline Transformer on long sequence tasks such as document translation.
This paper sheds light on the following points: 1) Softmax and ReLU use
different normalization methods over elements which lead to different variances
of results, and ReLU is good at dealing with a large number of key-value slots;
2) FFN and key-value memory are equivalent, and thus the Transformer can be
viewed as a memory network where FFNs and self-attention networks are both
key-value memories.
Related papers
- Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.
This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.
Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization.
We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions.
In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z) - MetaMixer Is All You Need [6.8410780175245165]
Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks.
Recent works also show that FFN functions like key-value memories.
We propose converting self-attention into a more FFN-like efficient token mixer with only convolutions.
arXiv Detail & Related papers (2024-06-04T07:00:14Z) - Empirical Study on Updating Key-Value Memories in Transformer
Feed-forward Layers [27.636372947415186]
The feed-forward networks (FFNs) in transformers are recognized as a group of key-value neural memories to restore abstract high-level knowledge.
We conduct an empirical ablation study on updating keys (the 1st layer in the FFNs layer) or values.
We compare those two methods in various knowledge editing and fine-tuning tasks of large language models to draw insights to understand FFNs further.
arXiv Detail & Related papers (2024-02-19T15:42:54Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Revisiting the Architectures like Pointer Networks to Efficiently
Improve the Next Word Distribution, Summarization Factuality, and Beyond [37.96043934146189]
We propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers.
In GPT-2, our proposals are significantly better and more efficient than mixture of softmax.
Our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
arXiv Detail & Related papers (2023-05-20T21:52:24Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Breaking the Softmax Bottleneck for Sequential Recommender Systems with
Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs.
We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems.
Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.