AutoTrans: Automating Transformer Design via Reinforced Architecture
Search
- URL: http://arxiv.org/abs/2009.02070v2
- Date: Sun, 30 May 2021 12:45:31 GMT
- Title: AutoTrans: Automating Transformer Design via Reinforced Architecture
Search
- Authors: Wei Zhu, Xiaoling Wang, Xipeng Qiu, Yuan Ni, Guotong Xie
- Abstract summary: This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
- Score: 52.48985245743108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though the transformer architectures have shown dominance in many natural
language understanding tasks, there are still unsolved issues for the training
of transformer models, especially the need for a principled way of warm-up
which has shown importance for stable training of a transformer, as well as
whether the task at hand prefer to scale the attention product or not. In this
paper, we empirically explore automating the design choices in the transformer
model, i.e., how to set layer-norm, whether to scale, number of layers, number
of heads, activation function, etc, so that one can obtain a transformer
architecture that better suits the tasks at hand. RL is employed to navigate
along search space, and special parameter sharing strategies are designed to
accelerate the search. It is shown that sampling a proportion of training data
per epoch during search help to improve the search quality. Experiments on the
CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer
model can outperform the standard transformers. In particular, we show that our
learned model can be trained more robustly with large learning rates without
warm-up.
Related papers
- Comprehensive Performance Modeling and System Design Insights for Foundation Models [1.4455936781559149]
Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry.
We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features.
Our analysis emphasizes the need for closer performance modeling of different transformer types keeping system features in mind.
arXiv Detail & Related papers (2024-09-30T22:56:42Z) - Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer.
Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size.
We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z) - Transformers in Reinforcement Learning: A Survey [7.622978576824539]
Transformers have impacted domains like natural language processing, computer vision, and robotics, where they improve performance compared to other neural networks.
This survey explores how transformers are used in reinforcement learning (RL), where they are seen as a promising solution for addressing challenges such as unstable training, credit assignment, lack of interpretability, and partial observability.
arXiv Detail & Related papers (2023-07-12T07:51:12Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.