Related papers: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

URL: http://arxiv.org/abs/2407.02490v2
Date: Wed, 30 Oct 2024 14:53:22 GMT
Title: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Authors: Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu,
Abstract summary: MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing. We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
Score: 36.49445805074941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

Related papers

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention [61.025422435235456]
MMInference is a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. We show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy.
arXiv Detail & Related papers (2025-04-22T17:59:51Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs) We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z)
Online Scheduling for LLM Inference with KV Cache Constraints [22.155429544207827]
Large Language Model (LLM) inference is an intensive process requiring efficient scheduling to optimize latency and resource utilization. We propose novel and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory. Our results offer a path toward more sustainable and cost-effective LLM deployment.
arXiv Detail & Related papers (2025-02-10T23:11:44Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
MagicPIG: LSH Sampling for Efficient LLM Generation [41.75038064509643]
We show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. We propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH) MagicPIG significantly reduces the workload of attention while preserving high accuracy for diverse tasks.
arXiv Detail & Related papers (2024-10-21T16:44:51Z)
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction [47.38471103190534]
Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. We propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing.
arXiv Detail & Related papers (2024-09-25T23:14:47Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference [11.194752361478567]
Existing cluster-level LLM scheduling strategies primarily target short-input requests with lengths below 2K.<n>We propose PecSched, a preemptive and efficient cluster-level LLM inference scheduler.<n>We show that PecSched reduces the 99th percentile queueing delay of short-input requests by up to 92% and improves their throughput by up to 595%.
arXiv Detail & Related papers (2024-09-23T15:16:29Z)
Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU [10.80559106452755]
mllm-NPU is an algorithm-system co-design that tackles a few semantic gaps between the LLM architecture and contemporary NPU design. For the first time, mllm-NPU achieves more than 1,000 tokens/sec prefilling for a billion-sized model.
arXiv Detail & Related papers (2024-07-08T12:20:45Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware. We explore trading off the computational precision to reduce the time complexity via sampling-based approximation. We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.