Efficient and Economic Large Language Model Inference with Attention Offloading
- URL: http://arxiv.org/abs/2405.01814v1
- Date: Fri, 3 May 2024 02:15:15 GMT
- Title: Efficient and Economic Large Language Model Inference with Attention Offloading
- Authors: Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu,
- Abstract summary: Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving.
This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands.
To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading.
- Score: 11.698376311689456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.
Related papers
- GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments [1.0558515062670693]
Large language models (LLMs) in real-world scenarios remains a critical challenge.
These challenges often lead to inefficiencies in memory utilization, latency, and throughput.
We develop a framework to address these issues, achieving prediction errors between 9.9% and 42.3% for key metrics such as batch latency, TTFT, and decode throughput.
arXiv Detail & Related papers (2024-12-06T05:46:43Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [24.1144641404561]
We propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators.
We show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario.
arXiv Detail & Related papers (2024-11-20T19:44:26Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - MELTing point: Mobile Evaluation of Language Transformers [8.238355633015068]
We explore the current state of mobile execution of Large Language Models (LLMs)
We have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device.
We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance.
arXiv Detail & Related papers (2024-03-19T15:51:21Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.