Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier
Layers
- URL: http://arxiv.org/abs/2209.12816v2
- Date: Tue, 16 May 2023 13:16:00 GMT
- Title: Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier
Layers
- Authors: Nurullah Sevim, Ege Ozan \"Ozyedek, Furkan \c{S}ahinu\c{c}, Aykut
Ko\c{c}
- Abstract summary: Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks.
Recent works focused on eliminating the disadvantages of computational inefficiency and showed that transformer-based models can still reach competitive results without the attention layer.
A pioneering study proposed the FNet, which replaces the attention layer with the Fourier Transform (FT) in the transformer encoder architecture.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models utilize the attention mechanism for
substantial performance improvements in almost all natural language processing
(NLP) tasks. Similar attention structures are also extensively studied in
several other areas. Although the attention mechanism enhances the model
performances significantly, its quadratic complexity prevents efficient
processing of long sequences. Recent works focused on eliminating the
disadvantages of computational inefficiency and showed that transformer-based
models can still reach competitive results without the attention layer. A
pioneering study proposed the FNet, which replaces the attention layer with the
Fourier Transform (FT) in the transformer encoder architecture. FNet achieves
competitive performances concerning the original transformer encoder model
while accelerating training process by removing the computational burden of the
attention mechanism. However, the FNet model ignores essential properties of
the FT from the classical signal processing that can be leveraged to increase
model efficiency further. We propose different methods to deploy FT efficiently
in transformer encoder models. Our proposed architectures have smaller number
of model parameters, shorter training times, less memory usage, and some
additional performance improvements. We demonstrate these improvements through
extensive experiments on common benchmarks.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning [42.862705980039784]
Transformer has shown promise in reinforcement learning to model time-varying features.
It still suffers from the issues of low data efficiency and high inference latency.
In this paper, we propose to investigate the task from a new perspective of the frequency domain.
arXiv Detail & Related papers (2024-05-30T09:43:59Z) - Efficient Language Model Architectures for Differentially Private
Federated Learning [21.280600854272716]
Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices.
In centralized training of language models, adaptives are preferred as they offer improved stability and performance.
We propose a scale-in Coupled Input Forget Gate (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in neural recurrent cell.
arXiv Detail & Related papers (2024-03-12T22:21:48Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - LegoNet: A Fast and Exact Unlearning Architecture [59.49058450583149]
Machine unlearning aims to erase the impact of specific training samples upon deleted requests from a trained model.
We present a novel network, namely textitLegoNet, which adopts the framework of fixed encoder + multiple adapters''
We show that LegoNet accomplishes fast and exact unlearning while maintaining acceptable performance, synthetically outperforming unlearning baselines.
arXiv Detail & Related papers (2022-10-28T09:53:05Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - Efficient Trainable Front-Ends for Neural Speech Enhancement [22.313111311130665]
We present an efficient, trainable front-end based on the butterfly mechanism to compute the Fast Fourier Transform.
We show its accuracy and efficiency benefits for low-compute neural speech enhancement models.
arXiv Detail & Related papers (2020-02-20T01:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.