Related papers: One Wide Feedforward is All You Need

One Wide Feedforward is All You Need

URL: http://arxiv.org/abs/2309.01826v2
Date: Sat, 21 Oct 2023 08:33:44 GMT
Title: One Wide Feedforward is All You Need
Authors: Telmo Pessoa Pires, Ant\'onio V. Lopes, Yannick Assogba, Hendra Setiawan
Abstract summary: The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN) In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. We are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder.
Score: 3.043080042012617
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Related papers

Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models [0.0]
State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text.<n>We show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.
arXiv Detail & Related papers (2025-05-10T12:54:21Z)
Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification [60.9670254833103]
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID.
arXiv Detail & Related papers (2024-12-23T03:19:19Z)
FFNet: MetaMixer-based Efficient Convolutional Mixer Design [6.8410780175245165]
We present a family of Fast-Forward Networks (FFNet) Despite being composed of only simple operators, FFNet outperforms sophisticated and highly specialized methods in each domain. We propose MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework.
arXiv Detail & Related papers (2024-06-04T07:00:14Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
How Powerful Potential of Attention on Image Restoration? [97.9777639562205]
We conduct an empirical study to explore the potential of attention mechanisms without using FFN. We propose Continuous Scaling Attention (textbfCSAttn), a method that computes attention continuously in three stages without using FFN. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance.
arXiv Detail & Related papers (2024-03-15T14:23:12Z)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop. In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z)
PartialFormer: Modeling Part Instead of Whole for Machine Translation [40.67489864907433]
We introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple smaller FFNs. These smaller FFNs are integrated into a multi-head attention mechanism for effective collaboration. Experiments on 9 translation tasks and 1 abstractive summarization task validate the effectiveness of our PartialFormer approach.
arXiv Detail & Related papers (2023-10-23T13:25:54Z)
ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2. It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z)
Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction [9.431484068349903]
Recent Transformer-based methods have achieved advanced performance in point cloud registration. Recent CNNs fail to model global relations due to their local fields receptive. shallow-wide architecture of Transformers and lack of positional encoding lead to indistinct feature extraction.
arXiv Detail & Related papers (2021-12-17T08:40:52Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
Towards Deep and Efficient: A Deep Siamese Self-Attention Fully Efficient Convolutional Network for Change Detection in VHR Images [28.36808011351123]
We present a very deep and efficient CD network, entitled EffCDNet. In EffCDNet, an efficient convolution consisting of depth-wise convolution and group convolution with a channel shuffle mechanism is introduced. On two challenging CD datasets, our approach outperforms other SOTA FCN-based methods.
arXiv Detail & Related papers (2021-08-18T14:02:38Z)
Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection. With the global view in very shallow layers, the transformer encoder preserves more local representations. Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z)
ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z)
Rate Region for Indirect Multiterminal Source Coding in Federated Learning [49.574683687858126]
A large number of edge devices send their updates to the edge at each round of the local model. Existing works do not leverage the focus in the information transmitted by different edges. This paper studies the rate region for the indirect multiterminal source coding FL.
arXiv Detail & Related papers (2021-01-21T16:17:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.