Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
- URL: http://arxiv.org/abs/2407.15516v1
- Date: Mon, 22 Jul 2024 10:09:05 GMT
- Title: Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
- Authors: Georgy Tyukin, Gbetondji J-S Dovonon, Jean Kaddour, Pasquale Minervini,
- Abstract summary: We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups.
We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
- Score: 14.957045047543405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
Related papers
- Reassessing Layer Pruning in LLMs: New Insights and Methods [24.394438652261982]
We show that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the textttlm_head and the remaining last three layer, yields remarkably strong performance.
We release the optimal model weights on Hface, and the code is available on GitHub.
arXiv Detail & Related papers (2024-11-23T13:31:16Z) - Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection [73.06596715100859]
We study the importance of each layer in finding the optimal layer range for knowledge injection.
We propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones.
Based on this strategy, we introduce Llama Slayer-8B and Llama Slayer-8B-Instruct.
arXiv Detail & Related papers (2024-10-03T09:28:59Z) - Investigating Layer Importance in Large Language Models [28.156622049937216]
Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts.
The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models.
This study identifies cornerstone layers in LLMs and underscores their critical role for future research.
arXiv Detail & Related papers (2024-09-22T09:53:13Z) - Cross-layer Attention Sharing for Large Language Models [44.53618643180393]
LiSA is a lightweight substitute for self-attention in well-trained large language models.
Our implementations achieve a 6X compression of Q and K, with maximum throughput improvements of 19.5% for LLaMA3-8B and 32.3% for LLaMA2-7B.
arXiv Detail & Related papers (2024-08-04T00:38:34Z) - What Matters in Transformers? Not All Attention is Needed [7.857824255138334]
Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks.
It also introduces redundant architectures, posing efficiency challenges for real-world deployment.
We investigate redundancy across different modules within Transformers, including Blocks, Attention layers, using a similarity-based metric.
arXiv Detail & Related papers (2024-06-22T08:41:48Z) - OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning [18.102930806071978]
Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) is a memory-efficient fine-tuning approach.
OwLore consistently outperforms baseline approaches, including full fine-tuning.
arXiv Detail & Related papers (2024-05-28T17:22:22Z) - FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models [54.787308652357794]
FinerCut is a new form of fine-grained layer pruning for transformer networks.
Our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction.
arXiv Detail & Related papers (2024-05-28T14:21:15Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers.
We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z) - Layer Grafted Pre-training: Bridging Contrastive Learning And Masked
Image Modeling For Label-Efficient Representations [130.05189514598996]
Mask Image Modeling (MIM) and Contrastive Learning (CL) demonstrate that self-supervision is powerful to learn good representations.
In this paper, we make the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions.
Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively.
We propose a surprisingly simple, "sequential cascade" fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss.
arXiv Detail & Related papers (2023-02-27T20:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.