Related papers: Transformers: "The End of History" for NLP?

Transformers: "The End of History" for NLP?

URL: http://arxiv.org/abs/2105.00813v1
Date: Fri, 9 Apr 2021 08:29:42 GMT
Title: Transformers: "The End of History" for NLP?
Authors: Anton Chernyavskiy, Dmitry Ilvovsky, Preslav Nakov
Abstract summary: We shed light on some important theoretical limitations of pre-trained BERT-style models. We show that addressing these limitations can yield sizable improvements over vanilla RoBERTa and XLNet. We offer a more general discussion on desiderata for future additions to the Transformer architecture.
Score: 17.36054090232896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state-of-the-art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed some light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks -- segmentation and segment labeling -- and four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naive ways, can yield sizable improvements over vanilla RoBERTa and XLNet. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.

Related papers

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers. By treating model parameters as tokens, we replace all the linear projections in Transformers. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z)
Introduction to Transformers: an NLP Perspective [59.0241868728732]
We introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications.
arXiv Detail & Related papers (2023-11-29T13:51:04Z)
Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption [45.00129952368691]
Homomorphic Encryption (HE) has emerged as one of the most promising approaches in deep learning. We introduce the first transformer, providing the first demonstration of secure inference over HE with transformers. Our models yield results comparable to traditional methods, bridging the performance gap with transformers of similar scale and underscoring the viability of HE for state-of-the-art applications.
arXiv Detail & Related papers (2023-11-15T00:23:58Z)
Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z)
N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z)
Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z)
Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics [78.6177778161625]
We conduct a case study of generalization in NLI in a range of BERT-based architectures. We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.
arXiv Detail & Related papers (2021-10-04T15:37:07Z)
Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z)
Updater-Extractor Architecture for Inductive World State Representations [0.0]
We propose a transformer-based Updater-Extractor architecture and a training procedure that can work with sequences of arbitrary length. We explicitly train the model to incorporate incoming information into its world state representation. Empirically, we investigate the model performance on three different tasks, demonstrating its promise.
arXiv Detail & Related papers (2021-04-12T14:30:11Z)
E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks [3.42658286826597]
We present an extension over the Transformer-block architecture used in neural language models, specifically in GPT2. Our model, GPT2E, extends the Transformer layers architecture of GPT2 to Entity-Transformers, an architecture designed to handle coreference information when present.
arXiv Detail & Related papers (2020-11-10T22:28:00Z)
Compression of Deep Learning Models for Text: A Survey [6.532867867011488]
In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress. Deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs)networks, and Transformer [120] based models like Bidirectional Representations from Transformers (BERT) [24], GenerativePre-training Transformer (GPT-2) [94], Multi-task Deep Neural Network (MT-DNN) [73], Extra-Long Network (XLNet) [134], Text-to-text transfer
arXiv Detail & Related papers (2020-08-12T10:42:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.