Related papers: HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference

HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference

URL: http://arxiv.org/abs/2507.03153v1
Date: Thu, 03 Jul 2025 20:20:33 GMT
Title: HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
Authors: Weishu Deng, Yujie Yang, Peiran Du, Lingfeng Xiang, Zhen Lin, Chen Zhong, Song Jiang, Hui Lu, Jia Rao,
Abstract summary: We present HGCA, a hybrid CPU- GPU attention mechanism for large language models.<n>We show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy.<n> Experiments across diverse models and workloads show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy.
Score: 8.826966369389893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling inference for large language models (LLMs) is increasingly constrained by limited GPU memory, especially due to growing key-value (KV) caches required for long-context generation. While existing approaches offload KV caches to CPU memory or apply sparse attention to reduce GPU load, they often underutilize CPU compute resources and compromise accuracy. We present HGCA, a hybrid CPU-GPU attention mechanism that enables scalable, high-throughput LLM inference with near-full attention quality. HGCA performs dense attention on recently generated KV entries retained in GPU memory and parallel sparse attention on selected, salient KV entries in CPU memory. The attention outputs are efficiently merged using log-sum-exp fusion, minimizing PCIe transfer overhead. HGCA also introduces a finegrained, per-head sparsification strategy optimized for CPU execution, preserving contextual relevance while reducing computation. Our implementation seamlessly integrates into existing LLM frameworks without requiring model retraining. Experiments across diverse models and workloads show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy -- all on commodity GPU hardware.

Related papers

ScalableHD: Scalable and High-Throughput Hyperdimensional Computing Inference on Multi-Core CPUs [0.0]
Hyperdimensional Computing (HDC) represents and manipulates information using high-dimensional vectors, called hypervectors (HV)<n>Traditional HDC methods rely on single-pass, non-parametric training and often suffer from low accuracy.<n>Inference, however, remains lightweight and well-suited for real-time execution.
arXiv Detail & Related papers (2025-06-10T22:46:12Z)
SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices [16.407669822378487]
SpecOffload embeds speculative decoding into offloading.<n>Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x.
arXiv Detail & Related papers (2025-05-15T13:10:31Z)
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference [27.69137902678418]
RetroInfer is a novel system that exploits the inherent attention sparsity to accelerate long-context inference.<n>We show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory.
arXiv Detail & Related papers (2025-05-05T18:01:17Z)
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference.<n>In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck.
arXiv Detail & Related papers (2025-03-11T11:21:35Z)
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU.<n>We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z)
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation [7.204881999658682]
Key-Value cache is used to store intermediate activations for large language models.<n>The memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory.<n>Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution.<n>We introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations.<n> KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T04:03:14Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.