A Practical Survey on Faster and Lighter Transformers
- URL: http://arxiv.org/abs/2103.14636v2
- Date: Mon, 27 Mar 2023 15:10:28 GMT
- Title: A Practical Survey on Faster and Lighter Transformers
- Authors: Quentin Fournier, Ga\'etan Marceau Caron, and Daniel Aloise
- Abstract summary: The Transformer is a model solely based on the attention mechanism that is able to relate any two positions of the input sequence.
It has improved the state-of-the-art across numerous sequence modelling tasks.
However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length.
- Score: 0.9176056742068811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent neural networks are effective models to process sequences. However,
they are unable to learn long-term dependencies because of their inherent
sequential nature. As a solution, Vaswani et al. introduced the Transformer, a
model solely based on the attention mechanism that is able to relate any two
positions of the input sequence, hence modelling arbitrary long dependencies.
The Transformer has improved the state-of-the-art across numerous sequence
modelling tasks. However, its effectiveness comes at the expense of a quadratic
computational and memory complexity with respect to the sequence length,
hindering its adoption. Fortunately, the deep learning community has always
been interested in improving the models' efficiency, leading to a plethora of
solutions such as parameter sharing, pruning, mixed-precision, and knowledge
distillation. Recently, researchers have directly addressed the Transformer's
limitation by designing lower-complexity alternatives such as the Longformer,
Reformer, Linformer, and Performer. However, due to the wide range of
solutions, it has become challenging for researchers and practitioners to
determine which methods to apply in practice in order to meet the desired
trade-off between capacity, computation, and memory. This survey addresses this
issue by investigating popular approaches to make Transformers faster and
lighter and by providing a comprehensive explanation of the methods' strengths,
limitations, and underlying assumptions.
Related papers
- State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era [59.279784235147254]
This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing.
The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time.
arXiv Detail & Related papers (2024-06-13T12:51:22Z) - On the Resurgence of Recurrent Models for Long Sequences -- Survey and
Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence.
It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z) - Large Sequence Models for Sequential Decision-Making: A Survey [33.35835438923926]
The Transformer has attracted increasing interest in the RL communities, spawning numerous approaches with notable effectiveness and generalizability.
This paper puts forth various potential avenues for future research intending to improve the effectiveness of large sequence models for sequential decision-making.
arXiv Detail & Related papers (2023-06-24T12:06:26Z) - Decision S4: Efficient Sequence-Based RL via State Spaces Layers [87.3063565438089]
We present an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model.
An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism.
arXiv Detail & Related papers (2023-06-08T13:03:53Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Continual Learning with Transformers for Image Classification [12.028617058465333]
In computer vision, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past.
We develop a solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning.
We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model.
arXiv Detail & Related papers (2022-06-28T15:30:10Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.