Do Transformer Modifications Transfer Across Implementations and
Applications?
- URL: http://arxiv.org/abs/2102.11972v1
- Date: Tue, 23 Feb 2021 22:44:54 GMT
- Title: Do Transformer Modifications Transfer Across Implementations and
Applications?
- Authors: Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry,
Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan,
Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, Colin Raffel
- Abstract summary: We comprehensively evaluate many of these modifications in a shared experimental setting.
We find that most modifications do not meaningfully improve performance.
Most Transformer variants we found beneficial were either developed in the same that we used or are relatively minor changes.
- Score: 52.09138231841911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The research community has proposed copious modifications to the Transformer
architecture since it was introduced over three years ago, relatively few of
which have seen widespread adoption. In this paper, we comprehensively evaluate
many of these modifications in a shared experimental setting that covers most
of the common uses of the Transformer in natural language processing.
Surprisingly, we find that most modifications do not meaningfully improve
performance. Furthermore, most of the Transformer variants we found beneficial
were either developed in the same codebase that we used or are relatively minor
changes. We conjecture that performance improvements may strongly depend on
implementation details and correspondingly make some recommendations for
improving the generality of experimental results.
Related papers
- Interpretation of the Transformer and Improvement of the Extractor [3.9693969407364427]
It has been over six years since the Transformer architecture was put forward.
Surprisingly, the vanilla Transformer architecture is still widely used today.
The lack of deep understanding and comprehensive interpretation of the Transformer architecture makes it more challenging to improve the Transformer architecture.
arXiv Detail & Related papers (2023-11-21T15:36:20Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Foundation Transformers [105.06915886136524]
We call for the development of Foundation Transformer for true general-purpose modeling.
In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal.
arXiv Detail & Related papers (2022-10-12T17:16:27Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - Optimizing Inference Performance of Transformers on CPUs [0.0]
Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc.
This paper presents an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs.
arXiv Detail & Related papers (2021-02-12T17:01:35Z) - Efficient Transformers: A Survey [98.23264445730645]
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.
This paper characterizes a large and thoughtful selection of recent efficiency-flavored "X-former" models.
arXiv Detail & Related papers (2020-09-14T20:38:14Z) - Transformer on a Diet [81.09119185568296]
Transformer has been widely used thanks to its ability to capture sequence information in an efficient way.
Recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness.
We explore three carefully-designed light Transformer architectures to figure out whether the Transformer with less computations could produce competitive results.
arXiv Detail & Related papers (2020-02-14T18:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.