What Matters in Transformers? Not All Attention is Needed
- URL: http://arxiv.org/abs/2406.15786v6
- Date: Thu, 17 Oct 2024 02:43:35 GMT
- Title: What Matters in Transformers? Not All Attention is Needed
- Authors: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li,
- Abstract summary: Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks.
It also introduces redundant architectures, posing efficiency challenges for real-world deployment.
We investigate redundancy across different modules within Transformers, including Blocks, Attention layers, using a similarity-based metric.
- Score: 7.857824255138334
- License:
- Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.
Related papers
- Value Residual Learning For Alleviating Attention Concentration In Transformers [14.898656879574622]
stacking multiple attention layers leads to attention concentration.
One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers.
We propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers.
arXiv Detail & Related papers (2024-10-23T14:15:07Z) - How Lightweight Can A Vision Transformer Be [0.0]
We explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers.
Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer.
We found that the architecture is competitive even at a size of 0.67M parameters.
arXiv Detail & Related papers (2024-07-25T05:23:20Z) - Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models [14.957045047543405]
We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups.
We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
arXiv Detail & Related papers (2024-07-22T10:09:05Z) - Transformer Layers as Painters [16.43731831488477]
We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity.
We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel.
Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
arXiv Detail & Related papers (2024-07-12T14:31:05Z) - FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models [54.787308652357794]
FinerCut is a new form of fine-grained layer pruning for transformer networks.
Our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction.
arXiv Detail & Related papers (2024-05-28T14:21:15Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.