Related papers: What Matters in Transformers? Not All Attention is Needed

What Matters in Transformers? Not All Attention is Needed

URL: http://arxiv.org/abs/2406.15786v3
Date: Fri, 19 Jul 2024 18:31:44 GMT
Title: What Matters in Transformers? Not All Attention is Needed
Authors: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li,
Abstract summary: Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. However, it also introduces redundant structures, posing challenges for real-world deployment. We investigate the varying redundancy across different modules, including Blocks, Transformers and Attention layers, using a similarity-based metric.
Score: 7.857824255138334
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. However, it also introduces redundant structures, posing challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different modules, such as MLP and Attention layers, is under-explored. In this work, we investigate the varying redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. This metric operates on the premise that redundant structures produce outputs highly similar to their inputs. Surprisingly, while attention layers are essential for transformers and distinguish them from other mainstream architectures, we found that a large proportion of attention layers exhibit excessively high similarity and can be safely pruned without degrading performance, leading to reduced memory and computation costs. Additionally, we further propose a method that jointly drops Attention and MLP layers, achieving improved performance and dropping ratios. Extensive experiments demonstrate the effectiveness of our methods, e.g., Llama-3-70B maintains comparable performance even after pruning half of the attention layers. Our findings provide valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.

Related papers

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals [0.0]
We propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. We also introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time.
arXiv Detail & Related papers (2025-03-09T06:20:11Z)
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths [17.68867710994329]
UniMoD is a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks.
arXiv Detail & Related papers (2025-02-10T13:52:52Z)
Value Residual Learning For Alleviating Attention Concentration In Transformers [14.898656879574622]
stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. We propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers.
arXiv Detail & Related papers (2024-10-23T14:15:07Z)
How Lightweight Can A Vision Transformer Be [0.0]
We explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers. Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer. We found that the architecture is competitive even at a size of 0.67M parameters.
arXiv Detail & Related papers (2024-07-25T05:23:20Z)
Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models [14.957045047543405]
We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
arXiv Detail & Related papers (2024-07-22T10:09:05Z)
Transformer Layers as Painters [16.43731831488477]
We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
arXiv Detail & Related papers (2024-07-12T14:31:05Z)
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models [54.787308652357794]
FinerCut is a new form of fine-grained layer pruning for transformer networks. Our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction.
arXiv Detail & Related papers (2024-05-28T14:21:15Z)
MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z)
Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Existing MIM models conduct reconstruction task only at the top layer of encoder. We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z)
DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.