F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs
- URL: http://arxiv.org/abs/2510.13401v1
- Date: Wed, 15 Oct 2025 10:56:37 GMT
- Title: F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs
- Authors: Jude Haris, José Cano,
- Abstract summary: Large Language Models (LLMs) have become increasingly prominent for daily tasks.<n>LLMs can be run on resource-constrained edge devices.<n>LLMs are typically quantized with mixed BFP quantization across the model layers.
- Score: 0.6302369456012739
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as llama.cpp, which support optimizations such as KV-caching and quantization, it is now easier than ever to deploy LLMs on edge devices. Quantization is fundamental to enable LLMs on resource-constrained edge devices, and llama.cpp utilizes block floating point (BFP) quantization to drastically reduce the bit width of weights and input tensors, the memory footprint, and the computational power required to run LLMs. LLMs are typically quantized with mixed BFP quantization across the model layers to reduce the loss of model accuracy due to quantization. Therefore, to efficiently accelerate across the layers of BFP-quantized LLMs, specialized accelerators need to support different BFP variants without reconfiguration. To address this issue, we propose a Flexible Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically switch between two BFP quantization variants and perform matrix multiplication (MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD Kria board, reduces inference time by 1.4x on average over the Arm NEON-based CPU execution across three BFP quantized LLMs while achieving 5.2 tokens per second (~3.9 words per second).
Related papers
- P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats [10.43214279354138]
We introduce P3-LLM, a novel integrated accelerator for inference using hybrid numerical formats.<n>P3-LLM achieves state-of-the-art accuracy in terms of both KV-cache quantization and weight-activation quantization.
arXiv Detail & Related papers (2025-11-10T08:29:34Z) - AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization [7.413057271242686]
Quantization, particularly floating-point quantization, is known to be capable of speeding up large language models (LLMs) inference.<n>We propose AMS-Quant, which explores floating-point quantization exploration from integer bitwidths to non-integer bit-widths.<n>We show that AMS-Quant can quantize the model to FP-5.33-e2m3 and FP4.25-e2m2, and significantly speed up the decoding over FP16 inference.
arXiv Detail & Related papers (2025-10-16T15:37:23Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models [0.562479170374811]
We present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks.
OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency.
As a result, we are able to improve the energy efficiency by 1.62.2x, and reduce the area by 2.43.1x with negligible accuracy loss.
arXiv Detail & Related papers (2024-09-06T02:33:20Z) - Designing Efficient LLM Accelerators for Edge Devices [1.4128048241287314]
Large Language Models (LLMs) can be deployed on resource-constrained edge devices to reduce reliance on network connections and provide more privacy.
To address this issue, designing new and efficient edge accelerators for LLM inference is crucial.
We propose SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators.
arXiv Detail & Related papers (2024-08-01T11:06:05Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.