Related papers: HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

URL: http://arxiv.org/abs/2504.16112v1
Date: Fri, 18 Apr 2025 03:31:08 GMT
Title: HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
Authors: Myunghyun Rhee, Joonseop Sim, Taeyoung Ahn, Seungyong Lee, Daegun Yoon, Euiseok Kim, Kyoung Park, Youngpyo Joo, Hosik Kim,
Abstract summary: We propose a memory-intensive co-processor that enhances GPU resource utilization during large-batched LLM inference.<n>By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency.<n>Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system.
Score: 3.50604837678178
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

Related papers

GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z)
Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z)
Investigating Matrix Repartitioning to Address the Over- and Undersubscription Challenge for a GPU-based CFD Solver [0.688204255655161]
Existing approaches either fully or use plugin-based GPU solvers, each facing trade-offs between performance and development effort.<n>We propose a repartitioning strategy that better balances CPU matrix assembly and GPU-based linear solves.<n>Our results show that the proposed method significantly mitigates oversubscription issues, improving solver performance and resource utilization.
arXiv Detail & Related papers (2025-10-09T17:53:12Z)
HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference [8.826966369389893]
We present HGCA, a hybrid CPU- GPU attention mechanism for large language models.<n>We show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy.<n> Experiments across diverse models and workloads show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy.
arXiv Detail & Related papers (2025-07-03T20:20:33Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices [16.407669822378487]
SpecOffload embeds speculative decoding into offloading.<n>Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x.
arXiv Detail & Related papers (2025-05-15T13:10:31Z)
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference [4.497936996651617]
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck.
arXiv Detail & Related papers (2025-03-11T11:21:35Z)
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z)
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation [7.204881999658682]
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value ( KV) caching is used to store intermediate activations. The memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU.
arXiv Detail & Related papers (2024-11-26T04:03:14Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions" In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z)
The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems [45.479582612113205]
We show how to improve the performance and power efficiency of RL training on CPU-GPU systems. We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework. We also introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources.
arXiv Detail & Related papers (2020-12-08T04:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.