Related papers: Wide Attention Is The Way Forward For Transformers

Wide Attention Is The Way Forward For Transformers

URL: http://arxiv.org/abs/2210.00640v1
Date: Sun, 2 Oct 2022 21:49:54 GMT
Title: Wide Attention Is The Way Forward For Transformers
Authors: Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D Mullins
Abstract summary: We show that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
Score: 9.252523881586054
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.

Related papers

An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers. We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets. We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z)
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles [65.54857068975068]
In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer. We create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models.
arXiv Detail & Related papers (2023-06-01T17:59:58Z)
Brainformers: Trading Simplicity for Efficiency [39.53511089374572]
We develop a complex block, named Brainformer, that consists of a diverse sets of layers. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time.
arXiv Detail & Related papers (2023-05-29T18:42:01Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z)
Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z)
Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
AutoTrans: Automating Transformer Design via Reinforced Architecture Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand. Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.