Foundation Transformers
- URL: http://arxiv.org/abs/2210.06423v1
- Date: Wed, 12 Oct 2022 17:16:27 GMT
- Title: Foundation Transformers
- Authors: Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang
Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun
Liu, Vishrav Chaudhary, Xia Song, Furu Wei
- Abstract summary: We call for the development of Foundation Transformer for true general-purpose modeling.
In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal.
- Score: 105.06915886136524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A big convergence of model architectures across language, vision, speech, and
multimodal is emerging. However, under the same name "Transformers", the above
areas use different implementations for better performance, e.g.,
Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We
call for the development of Foundation Transformer for true general-purpose
modeling, which serves as a go-to architecture for various tasks and modalities
with guaranteed training stability. In this work, we introduce a Transformer
variant, named Magneto, to fulfill the goal. Specifically, we propose
Sub-LayerNorm for good expressivity, and the initialization strategy
theoretically derived from DeepNet for stable scaling up. Extensive experiments
demonstrate its superior performance and better stability than the de facto
Transformer variants designed for various applications, including language
modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e.,
BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).
Related papers
- Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Enhanced Transformer Architecture for Natural Language Processing [2.6071653283020915]
Transformer is a state-of-the-art model in the field of natural language processing (NLP)
In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention.
The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset.
arXiv Detail & Related papers (2023-10-17T01:59:07Z) - Semantic-visual Guided Transformer for Few-shot Class-incremental
Learning [6.300141694311465]
We develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes.
Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone.
arXiv Detail & Related papers (2023-03-27T15:06:49Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.