Related papers: A Survey on Efficient Training of Transformers

A Survey on Efficient Training of Transformers

URL: http://arxiv.org/abs/2302.01107v3
Date: Thu, 4 May 2023 01:23:12 GMT
Title: A Survey on Efficient Training of Transformers
Authors: Bohan Zhuang, Jing Liu, Zizheng Pan, Haoyu He, Yuetian Weng, Chunhua Shen
Abstract summary: This survey provides the first systematic overview of the efficient training of Transformers. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design.
Score: 72.31868024970674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.

Related papers

Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets [0.0]
This paper investigates the applicability of the early-bird ticket hypothesis to optimize the training efficiency of Transformer models. We propose a methodology that combines iterative pruning, masked distance calculation, and selective retraining to identify early-bird tickets.
arXiv Detail & Related papers (2024-05-02T23:03:45Z)
LOTUS: Improving Transformer Efficiency with Sparsity Pruning and Data Lottery Tickets [0.0]
Vision transformers have revolutionized computer vision, but their computational demands present challenges for training and deployment. This paper introduces LOTUS, a novel method that leverages data lottery ticket selection and sparsity pruning to accelerate vision transformer training while maintaining accuracy.
arXiv Detail & Related papers (2024-05-01T23:30:12Z)
From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers [16.90294414874585]
We introduce multi-phase training of audio spectrogram transformers by connecting the idea of coarse-to-fine with transformer models. By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy.
arXiv Detail & Related papers (2024-01-16T14:59:37Z)
Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems. We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action. We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z)
A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from. Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes. One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline. While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z)
Data Movement Is All You Need: A Case Study on Optimizing Transformers [16.62346773613343]
We present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks.
arXiv Detail & Related papers (2020-06-30T19:26:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.