Related papers: AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

URL: http://arxiv.org/abs/2505.03745v1
Date: Mon, 07 Apr 2025 02:52:30 GMT
Title: AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Authors: Yanbiao Liang, Huihong Shi, Haikuo Shao, Zhongfeng Wang,
Abstract summary: Large language models (LLMs) have achieved huge success in the natural language processing (NLP) field.<n>We propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference.<n>We validate AccLLM on the Xilinx Alveo U280 FPGA, demonstrating a 4.07x energy efficiency and a 2.98x throughput.
Score: 3.64182026498686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. To address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) {\Lambda}-shaped attention, and (3) an innovative W2A8KV4 (2-bit weights, 8-bit activations, and 4-bit KV cache) quantization scheme, thus effectively reducing memory and bandwidth requirements while facilitating LLMs' long-sequence generation. At the hardware level, we design a dedicated FPGA-based accelerator with a reconfigurable computing engine to effectively and flexibly accommodate diverse operations arising from our compression algorithm, thereby fully translating the algorithmic innovations into tangible hardware efficiency. We validate AccLLM on the Xilinx Alveo U280 FPGA, demonstrating a 4.07x energy efficiency and a 2.98x throughput compared to the state-of-the-art work FlightLLM.

Related papers

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding [144.70522923640095]
Large language models (LLMs) face low hardware efficiency during decoding.<n>This paper introduces Step-3, a hardware-aware model-system co-design optimized for minimizing decoding costs.<n>Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B.
arXiv Detail & Related papers (2025-07-25T16:53:13Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.<n> Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
Dynamic Range Reduction via Branch-and-Bound [1.533133219129073]
Key strategy to enhance hardware accelerators is the reduction of precision in arithmetic operations. This paper introduces a fully principled Branch-and-Bound algorithm for reducing precision needs in QUBO problems. Experiments validate our algorithm's effectiveness on an actual quantum annealer.
arXiv Detail & Related papers (2024-09-17T03:07:56Z)
Designing Efficient LLM Accelerators for Edge Devices [1.4128048241287314]
Large Language Models (LLMs) can be deployed on resource-constrained edge devices to reduce reliance on network connections and provide more privacy. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. We propose SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators.
arXiv Detail & Related papers (2024-08-01T11:06:05Z)
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [9.884452250478216]
We propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching. On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. On the system level, ALISA employs three-phase token-level dynamical scheduling and optimize the trade-off between caching and recomputation.
arXiv Detail & Related papers (2024-03-26T01:46:34Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.