Related papers: FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

URL: http://arxiv.org/abs/2508.04405v1
Date: Wed, 06 Aug 2025 12:47:05 GMT
Title: FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Authors: Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He,
Abstract summary: Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs.<n>Existing INT4/INT8 quantization reduces these costs, but they often degrade accuracy or lack optimal efficiency.<n>We propose FlexQ, a novel framework combining algorithmic innovation with system-level evaluations.
Score: 18.37843481770631
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

Related papers

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z)
FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration [1.6127639408026697]
FireQ is a co-designed PTQ framework and an INT4-FP8 matrix multiplication kernel.<n>FireQ quantizes linear layer weights and key-values to INT4, and activations and queries to FP8.<n>Three-stage pipelining for the prefill phase reduces time-to-first-token in the prefill phase.
arXiv Detail & Related papers (2025-05-27T07:58:35Z)
Towards Accurate and Efficient Sub-8-Bit Integer Training [24.853958178296587]
Quantization enables low-bitwidth formats in neural network training. Recent methods have developed new data formats and additional pre-processing operations on quantizers. It remains quite challenging to achieve high accuracy and efficiency simultaneously.
arXiv Detail & Related papers (2024-11-17T03:32:36Z)
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM.<n>It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [52.31791050376249]
Quantization can accelerate large language model (LLM) inference.<n>We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.<n>QServe improves the maximum achievable serving of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen-721.5B by 2.4x on A100, 3.5x on L40S.
arXiv Detail & Related papers (2024-05-07T17:59:30Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design [30.594788583458893]
Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) Existing systems do not provide Core support for FP6 quantization. We propose TCFPx, the first full-stack GPU kernel design scheme with unified Core support of float-point weights for various quantization bit-width.
arXiv Detail & Related papers (2024-01-25T11:46:38Z)
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability. There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization. We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks. Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM. We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU. When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.