Fugu-MT 論文翻訳(概要): PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

論文の概要: PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

arxiv url: http://arxiv.org/abs/2605.09503v1
Date: Sun, 10 May 2026 12:26:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.283082
Title: PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models
Title（参考訳）: PermuQuant: 拡散モデルのためのチャネルの順序変更によるグループごとの量子化誤差の低減
Authors: Yongsen Cheng, Kai Liu, Kaiwen Tao, Junxian Li, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang,
Abstract要約: ポストトレーニング量子化(PTQ)は、高価なリトレーニングなしで事前トレーニングされたモデルを圧縮することで、実用的なソリューションを提供する。既存のPTQ手法は、非常に低ビット設定で深刻な品質劣化に悩まされている。低ビット拡散モデルのための単純かつ効果的なPTQフレームワークPermuQuantを提案する。
参考スコア（独自算出の注目度）: 31.647243569492446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.
Abstract（参考訳）: 大規模視覚生成モデルは目覚ましい性能を達成した。しかし、その高い計算とメモリコストは、対話型アプリケーションや個人用シングルGPUの使用など、リソース制約のあるシナリオでのデプロイメントを困難にしている。ポストトレーニング量子化(PTQ)は、高価なリトレーニングなしで事前トレーニングされたモデルを圧縮することで、実用的なソリューションを提供する。しかし、既存のPTQ法は、非常に低ビット設定で深刻な品質劣化に悩まされている。本稿では,グループ単位の量子化において,チャネルオーダリングが重要だが未探索の要素であることを示す。この設定では、各連続群は1つの量子化スケールを共有する。非常に異なる統計を持つチャネルを同じグループに配置すると、スケールは外れ値に支配され、大きな量子化誤差を引き起こす。そこで本研究では,低ビット拡散モデルのための簡易かつ効果的なPTQフレームワークPermuQuantを提案する。 PermuQuantは、グループごとの量子化の前に、共同の第二モーメント基準によってチャネルをソートし、同じグループに同様のアクティベーションと重み統計を持つチャネルを配置する。さらに、キャリブレーションベースの受け入れルールを使用して、選択された置換がキャリブレーションデータに対する量子化誤差を低減する場合にのみ、リオーダーを適用する。選択された置換は隣接モジュールに吸収されるか、あるいはオフラインで重みに適用される。複数の大きな拡散モデルに関する大規模な実験により、PermuQuantは量子化誤差を一貫して減らし、既存のPTQベースラインを上回ることを示した。 RTX 5090を搭載したFLUX.1-devでは、PermuQuantは1.8$\times$シングルステップの高速化を実現し、W4A4 NVFP4量子化の下でDiTメモリフットプリントを3.5$\times$に削減する。コードはhttps://github.com/yscheng04/PermuQuant.comから入手できる。

論文の概要: PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

関連論文リスト