FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference
- URL: http://arxiv.org/abs/2512.24713v1
- Date: Wed, 31 Dec 2025 08:27:40 GMT
- Title: FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference
- Authors: Fen-Yu Hsieh, Yun-Chang Teng, Ding-Yong Hong, Jan-Jan Wu,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks.<n>This work introduces an automation framework that leverages weight pruning and low-bit quantization.<n>We present a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform.
- Score: 0.8749675983608171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes their deployment in resource-constrained environments. To address this challenge, this work introduces an automation framework that leverages weight pruning and low-bit quantization, and presents a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform. In particular, we implement a unified pipeline that applies N:M structured pruning and 4-bit integer quantization to reduce the memory footprint, followed by optimized dequantization and matrix multiplication to enhance LLM inference on several hardware platforms, including CPUs, NVIDIA GPUs with Dense and 2:4 Sparse Tensor Cores, and a custom systolic-array-based FPGA accelerator. Utilizing 2:4 sparsity combined with quantization on $4096 \times 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines. Scaling analysis on the LLaMA-7B model further shows that structured sparsity enhances the throughput per token by $1.36\times$. These results demonstrate the synergy of fine-grained N:M sparsity and quantization for enabling efficient and deployable LLM inference, while the proposed FPGA accelerator offers a flexible architectural path for supporting a broader class of sparsity patterns beyond the fixed 2:4 hardware constraints.
Related papers
- TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs [9.646882213709814]
TeLLMe is a table-lookup-based ternary LLM accelerator for low-power edge FPGAs.<n>It supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations.<n>Under a 5W power budget, TeLLMe delivers up to 25tokens/s decoding throughput.
arXiv Detail & Related papers (2025-10-03T05:37:51Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z) - GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models [2.1388885579612804]
GANQ is a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM.<n>Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization.
arXiv Detail & Related papers (2025-01-22T15:29:09Z) - FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV)<n>This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs)<n>It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability.
There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization.
We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.