Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models
- URL: http://arxiv.org/abs/2510.03274v1
- Date: Sat, 27 Sep 2025 13:50:42 GMT
- Title: Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models
- Authors: Tianao Zhang, Zhiteng Li, Xianglong Yan, Haotong Qin, Yong Guo, Yulun Zhang,
- Abstract summary: Diffusion large language models (dLLMs) offer bidirectional context and flexible masked-denoising generation.<n>We propose Quant-dLLM, an ultra-lowbit PTQ framework tailored to dLLMs.<n> Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs.
- Score: 47.41616630151171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. The code and models will be available at: https://github.com/ZTA2785/Quant-dLLM.
Related papers
- SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size [5.229694155440675]
Large language models (LLMs) face significant computational and memory challenges.<n>We introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size.<n>A distinctive feature of SDQ-LLM is the continuous layer of the Over-Sampling Ratio (OSR)
arXiv Detail & Related papers (2025-09-27T14:49:58Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - SBVR: Summation of BitVector Representation for Efficient LLM Quantization [3.7018544730078413]
Quantization compression by limiting the number of representable points in the data is key to efficient quantization.<n>Existing Post-Training Quantization (PTQ) solutions adopt two major approaches to this problem: Round-To-Nearest (RTN)-based methods and codebook-based methods.<n>We propose SBVR (Summation of Bitplex Representation), that enables Gaussian-like code representation in a hardware-friendly manner for fast inference.
arXiv Detail & Related papers (2025-09-17T13:51:27Z) - DLLMQuant: Quantizing Diffusion-based Large Language Models [15.318057331535982]
Diffusion-based large language models (Ms) have shown promise for non-autoregressive text generation.<n>Post-training quantization (PTQ) suffers from severe accuracy degradation and reduced performance when applied to allocateMs.<n>We proposeMQuant, a PTQ framework tailored forMs, which incorporates three novel techniques.
arXiv Detail & Related papers (2025-08-14T09:30:17Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.