Leaner Transformers: More Heads, Less Depth
- URL: http://arxiv.org/abs/2505.20802v1
- Date: Tue, 27 May 2025 07:06:54 GMT
- Title: Leaner Transformers: More Heads, Less Depth
- Authors: Hemanth Saratchandran, Damien Teney, Simon Lucey,
- Abstract summary: Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets.<n>This paper challenge this belief by showing that many existing transformers might be unnecessarily oversized.<n>We exploit this theoretical insight and redesign popular architectures with an increased number of heads.
- Score: 39.80661571556767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).
Related papers
- Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing [18.673619610942197]
Modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize.
We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual.
We propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention.
arXiv Detail & Related papers (2023-06-22T14:39:04Z) - Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles [65.54857068975068]
In this paper, we argue that this additional bulk is unnecessary.
By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer.
We create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models.
arXiv Detail & Related papers (2023-06-01T17:59:58Z) - Wide Attention Is The Way Forward For Transformers [9.252523881586054]
We show that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks.
Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
arXiv Detail & Related papers (2022-10-02T21:49:54Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Language Modeling using LMUs: 10x Better Data Efficiency or Improved
Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing.
We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z) - Armour: Generalizable Compact Self-Attention for Vision Transformers [0.0]
This paper introduces a compact self-attention mechanism that is fundamental and highly generalizable.
We show its drop-in applicability for both the regular attention mechanism and some most recent variants in vision transformers.
arXiv Detail & Related papers (2021-08-03T22:33:58Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision
Transformers [81.31885548824926]
Self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision.
We present an Interpretability-Aware REDundancy REDuction framework (IA-RED$2$)
We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up.
arXiv Detail & Related papers (2021-06-23T18:29:23Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.