Related papers: RealFormer: Transformer Likes Residual Attention

RealFormer: Transformer Likes Residual Attention

URL: http://arxiv.org/abs/2012.11747v2
Date: Wed, 23 Dec 2020 20:44:30 GMT
Title: RealFormer: Transformer Likes Residual Attention
Authors: Ruining He and Anirudh Ravula and Bhargav Kanagal and Joshua Ainslie
Abstract summary: RealFormer is a simple Residual Attention Layer Transformer architecture. It significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD.
Score: 5.841046725396454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple Residual Attention Layer Transformer architecture that significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD. Qualitatively, RealFormer is easy to implement and requires minimal hyper-parameter tuning. It also stabilizes training and leads to models with sparser attentions. Code will be open-sourced upon paper acceptance.

Related papers

BoostTransformer: Enhancing Transformer Models with Subgrid Selection and Importance Sampling [11.246174442827282]
BoostTransformer augments transformers with boosting principles through subgrid token selection and importance-weighted sampling.<n>Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance.
arXiv Detail & Related papers (2025-08-04T21:54:16Z)
ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition [5.311735227179715]
We explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets. Our model has achieved state-of-the-art results on single and multimodal inputs.
arXiv Detail & Related papers (2024-11-11T16:45:18Z)
Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z)
Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers. We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
A Closer Look at In-Context Learning under Distribution Shifts [24.59271215602147]
We aim to better understand the generality and limitations of in-context learning from the lens of the simple yet fundamental task of linear regression. We find that both transformers and set-based distributions exhibit in-context learning under-distribution evaluations, but transformers more closely emulate the performance of ordinary least squares (OLS) Transformers also display better resilience to mild distribution shifts, where set-based distributions falter.
arXiv Detail & Related papers (2023-05-26T07:47:21Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.