Related papers: FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design

FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design

URL: http://arxiv.org/abs/2601.15710v1
Date: Thu, 22 Jan 2026 07:31:51 GMT
Title: FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design
Authors: Jiahao Zhang, Zifan He, Nicholas Fraser, Michaela Blott, Yizhou Sun, Jason Cong,
Abstract summary: We present FlexLLM, a composable High-Level Synthesis library for rapid development of domain-specific LLM accelerators.<n>We build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code.<n>On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$times$ end-to-end speedup, 1.64$times$ higher decode throughput, and 3.14$times$ better energy efficiency than an NVIDIA A100 GPU running BF16 inference.
Score: 40.39807270881305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$\times$ end-to-end speedup, 1.64$\times$ higher decode throughput, and 3.14$\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$\times$, 6.55$\times$, and 4.13$\times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$\times$ and extends the context window by 64$\times$, delivering 1.10$\times$/4.86$\times$ lower end-to-end latency and 5.21$\times$/6.27$\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.

Related papers

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity [86.71343842875878]
NVIDIA's 2:4 Sparse Cores deliver 2x throughput but demand strict 50% pruning.<n>Milder $(2N-2):2N$ patterns preserve accuracy yet receive no hardware support.<n>We present SlideSparse, the first system to unlock Sparse Core acceleration.
arXiv Detail & Related papers (2026-03-05T14:49:16Z)
FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference [0.8749675983608171]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks.<n>This work introduces an automation framework that leverages weight pruning and low-bit quantization.<n>We present a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform.
arXiv Detail & Related papers (2025-12-31T08:27:40Z)
dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z)
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z)
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design [13.062940916273973]
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs.<n>Existing INT4/INT8 quantization reduces these costs, but they often degrade accuracy or lack optimal efficiency.<n>We propose FlexQ, a novel framework combining algorithmic innovation with system-level evaluations.
arXiv Detail & Related papers (2025-08-06T12:47:05Z)
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs [5.889337608109388]
TeLLMe is the first ternary LLM accelerator for low-power FPGAs.<n>It supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations.<n>Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts.
arXiv Detail & Related papers (2025-04-22T21:00:58Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees [19.58773369944074]
Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters.<n>We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing at the token level.<n>At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration.
arXiv Detail & Related papers (2024-02-29T01:33:08Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU. When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.