Transformers: "The End of History" for NLP?
- URL: http://arxiv.org/abs/2105.00813v1
- Date: Fri, 9 Apr 2021 08:29:42 GMT
- Title: Transformers: "The End of History" for NLP?
- Authors: Anton Chernyavskiy, Dmitry Ilvovsky, Preslav Nakov
- Abstract summary: We shed light on some important theoretical limitations of pre-trained BERT-style models.
We show that addressing these limitations can yield sizable improvements over vanilla RoBERTa and XLNet.
We offer a more general discussion on desiderata for future additions to the Transformer architecture.
- Score: 17.36054090232896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in neural architectures, such as the Transformer, coupled
with the emergence of large-scale pre-trained models such as BERT, have
revolutionized the field of Natural Language Processing (NLP), pushing the
state-of-the-art for a number of NLP tasks. A rich family of variations of
these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but
fundamentally, they all remain limited in their ability to model certain kinds
of information, and they cannot cope with certain information sources, which
was easy for pre-existing models. Thus, here we aim to shed some light on some
important theoretical limitations of pre-trained BERT-style models that are
inherent in the general Transformer architecture. First, we demonstrate in
practice on two general types of tasks -- segmentation and segment labeling --
and four datasets that these limitations are indeed harmful and that addressing
them, even in some very simple and naive ways, can yield sizable improvements
over vanilla RoBERTa and XLNet. Then, we offer a more general discussion on
desiderata for future additions to the Transformer architecture that would
increase its expressiveness, which we hope could help in the design of the next
generation of deep NLP architectures.
Related papers
- Introduction to Transformers: an NLP Perspective [59.0241868728732]
We introduce basic concepts of Transformers and present key techniques that form the recent advances of these models.
This includes a description of the standard Transformer architecture, a series of model refinements, and common applications.
arXiv Detail & Related papers (2023-11-29T13:51:04Z) - Converting Transformers to Polynomial Form for Secure Inference Over
Homomorphic Encryption [45.00129952368691]
Homomorphic Encryption (HE) has emerged as one of the most promising approaches in deep learning.
We introduce the first transformer, providing the first demonstration of secure inference over HE with transformers.
Our models yield results comparable to traditional methods, bridging the performance gap with transformers of similar scale and underscoring the viability of HE for state-of-the-art applications.
arXiv Detail & Related papers (2023-11-15T00:23:58Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence.
We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z) - Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks.
We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z) - Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics [78.6177778161625]
We conduct a case study of generalization in NLI in a range of BERT-based architectures.
We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.
arXiv Detail & Related papers (2021-10-04T15:37:07Z) - Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem.
We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.
Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z) - Updater-Extractor Architecture for Inductive World State Representations [0.0]
We propose a transformer-based Updater-Extractor architecture and a training procedure that can work with sequences of arbitrary length.
We explicitly train the model to incorporate incoming information into its world state representation.
Empirically, we investigate the model performance on three different tasks, demonstrating its promise.
arXiv Detail & Related papers (2021-04-12T14:30:11Z) - E.T.: Entity-Transformers. Coreference augmented Neural Language Model
for richer mention representations via Entity-Transformer blocks [3.42658286826597]
We present an extension over the Transformer-block architecture used in neural language models, specifically in GPT2.
Our model, GPT2E, extends the Transformer layers architecture of GPT2 to Entity-Transformers, an architecture designed to handle coreference information when present.
arXiv Detail & Related papers (2020-11-10T22:28:00Z) - Compression of Deep Learning Models for Text: A Survey [6.532867867011488]
In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress.
Deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs)networks, and Transformer [120] based models like Bidirectional Representations from Transformers (BERT) [24], GenerativePre-training Transformer (GPT-2) [94], Multi-task Deep Neural Network (MT-DNN) [73], Extra-Long Network (XLNet) [134], Text-to-text transfer
arXiv Detail & Related papers (2020-08-12T10:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.