Efficient pre-training objectives for Transformers
- URL: http://arxiv.org/abs/2104.09694v1
- Date: Tue, 20 Apr 2021 00:09:37 GMT
- Title: Efficient pre-training objectives for Transformers
- Authors: Luca Di Liello, Matteo Gabburo, Alessandro Moschitti
- Abstract summary: We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
- Score: 84.64393460397471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Transformer architecture deeply changed the natural language processing,
outperforming all previous state-of-the-art models. However, well-known
Transformer models like BERT, RoBERTa, and GPT-2 require a huge compute budget
to create a high quality contextualised representation. In this paper, we study
several efficient pre-training objectives for Transformers-based models. By
testing these objectives on different tasks, we determine which of the ELECTRA
model's new features is the most relevant. We confirm that Transformers
pre-training is improved when the input does not contain masked tokens and that
the usage of the whole output to compute the loss reduces training time.
Moreover, inspired by ELECTRA, we study a model composed of two blocks; a
discriminator and a simple generator based on a statistical model with no
impact on the computational performances. Besides, we prove that eliminating
the MASK token and considering the whole output during the loss computation are
essential choices to improve performance. Furthermore, we show that it is
possible to efficiently train BERT-like models using a discriminative approach
as in ELECTRA but without a complex generator, which is expensive. Finally, we
show that ELECTRA benefits heavily from a state-of-the-art hyper-parameters
search.
Related papers
- Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Effective Pre-Training Objectives for Transformer-based Autoencoders [97.99741848756302]
We study trade-offs between efficiency, cost and accuracy of Transformer encoders.
We combine features of common objectives and create new effective pre-training approaches.
arXiv Detail & Related papers (2022-10-24T18:39:44Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - Transformer-based Conditional Variational Autoencoder for Controllable
Story Generation [39.577220559911055]
We investigate large-scale latent variable models (LVMs) for neural story generation with objectives in two threads: generation effectiveness and controllability.
We advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers.
Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE)
arXiv Detail & Related papers (2021-01-04T08:31:11Z) - AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models [4.247712017691596]
AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task.
Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
arXiv Detail & Related papers (2020-10-07T23:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.