FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design
- URL: http://arxiv.org/abs/2601.15710v1
- Date: Thu, 22 Jan 2026 07:31:51 GMT
- Title: FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design
- Authors: Jiahao Zhang, Zifan He, Nicholas Fraser, Michaela Blott, Yizhou Sun, Jason Cong,
- Abstract summary: We present FlexLLM, a composable High-Level Synthesis library for rapid development of domain-specific LLM accelerators.<n>We build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code.<n>On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$times$ end-to-end speedup, 1.64$times$ higher decode throughput, and 3.14$times$ better energy efficiency than an NVIDIA A100 GPU running BF16 inference.
- Score: 40.39807270881305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$\times$ end-to-end speedup, 1.64$\times$ higher decode throughput, and 3.14$\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$\times$, 6.55$\times$, and 4.13$\times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$\times$ and extends the context window by 64$\times$, delivering 1.10$\times$/4.86$\times$ lower end-to-end latency and 5.21$\times$/6.27$\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.
Related papers
- SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity [86.71343842875878]
NVIDIA's 2:4 Sparse Cores deliver 2x throughput but demand strict 50% pruning.<n>Milder $(2N-2):2N$ patterns preserve accuracy yet receive no hardware support.<n>We present SlideSparse, the first system to unlock Sparse Core acceleration.
arXiv Detail & Related papers (2026-03-05T14:49:16Z) - FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference [0.8749675983608171]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks.<n>This work introduces an automation framework that leverages weight pruning and low-bit quantization.<n>We present a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform.
arXiv Detail & Related papers (2025-12-31T08:27:40Z) - dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z) - APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z) - FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design [13.062940916273973]
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs.<n>Existing INT4/INT8 quantization reduces these costs, but they often degrade accuracy or lack optimal efficiency.<n>We propose FlexQ, a novel framework combining algorithmic innovation with system-level evaluations.
arXiv Detail & Related papers (2025-08-06T12:47:05Z) - TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs [5.889337608109388]
TeLLMe is the first ternary LLM accelerator for low-power FPGAs.<n>It supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations.<n>Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts.
arXiv Detail & Related papers (2025-04-22T21:00:58Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees [19.58773369944074]
Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters.<n>We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing at the token level.<n>At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration.
arXiv Detail & Related papers (2024-02-29T01:33:08Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.