Fugu-MT 論文翻訳(概要): DLLMQuant: Quantizing Diffusion-based Large Language Models

論文の概要: DLLMQuant: Quantizing Diffusion-based Large Language Models

arxiv url: http://arxiv.org/abs/2508.14090v1
Date: Thu, 14 Aug 2025 09:30:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.177861
Title: DLLMQuant: Quantizing Diffusion-based Large Language Models
Title（参考訳）: DLLMQuant:拡散に基づく大規模言語モデルの量子化
Authors: Chen Xu, Dawei Yang,
Abstract要約: 拡散に基づく大規模言語モデル(Ms)は、非自己回帰的なテキスト生成を約束している。ポストトレーニング量子化(PTQ)は、アロケーションMに適用した場合、精度が著しく低下し、性能が低下する。 3つの新しい技法を取り入れたPTQフレームワークであるMQuantを提案する。
参考スコア（独自算出の注目度）: 7.970411645859868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs' key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified progressively during iteration in DLLMs, causing quantized models to perform worse as decoding steps progress; 3) Unmasked tokens stabilize while masked remain probabilistic, making overall feature distribution incompatible with existing PTQ methods. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention's interaction signals to dynamically allocate quantization resources. 3) Certainty-Guided Quantization (CGQ), which integrates mask status and token scores as key weighting criteria into error compensation, making weight quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency.
Abstract（参考訳）: 拡散に基づく大規模言語モデル (DLLM) は非自己回帰的なテキスト生成を約束しているが、その展開は大きなモデルサイズと計算コストに制約されている。大規模言語モデル(LLM)の圧縮と高速化に広く用いられているPTQは、DLLMに直接適用した場合、精度の低下と一般化性能の低下に悩まされる(例えば、AWQはW4A4の下でLLADAに16%の精度低下を被る)。本稿では,DLLMの鍵となるメカニズムである動的マスキング,反復生成,双方向の注意 – が量子化とどのように衝突するかを考察する。私たちは3つの問題を特定します。 1) 反復生成と動的マスキング比は,既存のPTQキャリブレーション法で適切に捉えられていない復号ステップ間で異なるトークン分布をもたらす。 2) DLLMの繰り返しにおける量子化誤差の蓄積と増幅により、復号処理が進むにつれて量子化モデルはさらに悪化する。 3) マスク付きトークンは安定であり, 従来のPTQ法とは相容れない特徴分布を保っている。これらの課題に対処するために,DLLM に適した PTQ フレームワークである DLLMQuant を提案する。 1)時間的マスク適応サンプリング(TMAS)は,時間的要因とマスク的要因の両方を考慮したキャリブレーション手法である。 2) 双方向アテンションの相互作用信号を用いて量子化資源を動的に割り当てるインタラクション・アウェア・アクティベーション・量子化(IA-AQ)。 3)マスクの状態とトークンスコアを重要な重み付け基準として組み込んだCGQ(Certainty-Guided Quantization)により,DLLMの重み付けがより適している。 DLLMQuantは効率を向上しながら、大幅な性能向上を実現している。

論文の概要: DLLMQuant: Quantizing Diffusion-based Large Language Models

関連論文リスト