Wide Attention Is The Way Forward For Transformers
- URL: http://arxiv.org/abs/2210.00640v1
- Date: Sun, 2 Oct 2022 21:49:54 GMT
- Title: Wide Attention Is The Way Forward For Transformers
- Authors: Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D Mullins
- Abstract summary: We show that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks.
Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
- Score: 9.252523881586054
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Transformer is an extremely powerful and prominent deep learning
architecture. In this work, we challenge the commonly held belief in deep
learning that going deeper is better, and show an alternative design approach
that is building wider attention Transformers. We demonstrate that wide single
layer Transformer models can compete with or outperform deeper ones in a
variety of Natural Language Processing (NLP) tasks when both are trained from
scratch. The impact of changing the model aspect ratio on Transformers is then
studied systematically. This ratio balances the number of layers and the number
of attention heads per layer while keeping the total number of attention heads
and all other hyperparameters constant. On average, across 4 NLP tasks and 10
attention types, single layer wide models perform 0.3% better than their deep
counterparts. We show an in-depth evaluation and demonstrate how wide models
require a far smaller memory footprint and can run faster on commodity
hardware, in addition, these wider models are also more interpretable. For
example, a single layer Transformer on the IMDb byte level text classification
has 3.1x faster inference latency on a CPU than its equally accurate deeper
counterpart, and is half the size. Our results suggest that the critical
direction for building better Transformers for NLP is their width, and that
their depth is less relevant.
Related papers
- An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles [65.54857068975068]
In this paper, we argue that this additional bulk is unnecessary.
By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer.
We create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models.
arXiv Detail & Related papers (2023-06-01T17:59:58Z) - Brainformers: Trading Simplicity for Efficiency [39.53511089374572]
We develop a complex block, named Brainformer, that consists of a diverse sets of layers.
Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers.
A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time.
arXiv Detail & Related papers (2023-05-29T18:42:01Z) - A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences.
We introduce a relative position embedding to explicitly maximize attention resolution.
We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications.
Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z) - AutoTrans: Automating Transformer Design via Reinforced Architecture
Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.