Fugu-MT 論文翻訳(概要): HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

論文の概要: HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

arxiv url: http://arxiv.org/abs/2606.23406v1
Date: Mon, 22 Jun 2026 14:30:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 19:11:46.714829
Title: HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models
Title（参考訳）: HyperQuant: 大規模言語と拡散モデルのためのレート歪み最適量子化パイプライン
Authors: Yuval Domb, Hadar Sackstein, Tomer Solberg,
Abstract要約: HyperQuantは、大きな言語と拡散トランスフォーマーの重みとKVキャッシュのための、トレーニング後の統一的な量子化パイプラインである。最近のHIGGS方式は、重量でスカラー(bps)あたり3ビットから5ビットに上回り、KV量子化でTurboQuantとOCTOPUSを1.7bpsに上回ります。線形重み 3.9x と KV キャッシュ 3.79x をほぼロスレス品質で圧縮する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the weights and the KV cache of large language and diffusion transformers. Across a suite of self-contained experiments (Table 1), HyperQuant outperforms the recent HIGGS scheme at every operating point from 3 to 5 bits per scalar (bps) on weights, and beats both TurboQuant and OCTOPUS on KV quantization down to 1.7 bps. Beyond the LLM setting, HyperQuant quantizes the 19B-parameter LTX-2 DiT video model with no observable per-frame artifacts. End-to-end on an H100 at 4 bps, HyperQuant compresses the linear weights ~3.9x and the KV cache ~3.79x at near-lossless quality. HyperQuant combines four known ideas into a single construction: (i) a per-tile Randomized Hadamard Transform that makes the per-coordinate distribution of weights and activations approximately Gaussian; (ii) quantization to a low-dimensional optimal lattice (E8, D4, A2, or Z); (iii) lossless bit-stripping and near-entropy-optimal variable-length Rice coding of the lattice indices; and (iv) bias-correction methods for the KV cache that keep the reconstruction unbiased under inner products, preserving attention semantics. We further integrate the pipeline with 8-bit and 4-bit Tensor-Core MMA paths (fp8-e4m3, int8, nvfp4, mxfp4), and find that int8 beats fp8 on the post-RHT lattice output. Project page: https://moonmath.ai/hyperquant/
Abstract（参考訳）: 本稿では,重みの学習後量子化パイプラインであるHyperQuant(Hadamard,OptimicalY Packing,Entropy Rice-coding)と,大規模言語と拡散変換器のKVキャッシュについて述べる。一連の自己完結実験(Table 1)において、HyperQuantは最近のHIGGSスキームを重量で3ビットから5ビット毎のスカラー(bps)に上回り、KV量子化でTurboQuantとOCTOPUSを1.7bpsまで上回ります。 LLM設定の他に、HyperQuantは、19BパラメータのLTX-2 DiTビデオモデルを、観測可能なフレーム単位のアーティファクトなしで量子化する。 4bpsのH100では、HyperQuantは線形重み ~3.9x と KV キャッシュ ~3.79x をほぼ粗い品質で圧縮する。 HyperQuantは4つの既知のアイデアを1つの構成にまとめる。 i) ウェイトと約ガウスの活性化の座標分布を調整したタイルごとのランダム化アダマール変換 (ii)低次元最適格子(E8, D4, A2, Z)への量子化三格子指標の損失のないビットストリッピング及び近エントロピー最適可変長米符号化 (4)KVキャッシュのバイアス補正手法は、内部製品の下での復元を未バイアスに保ち、注意の意味を保っている。さらに、パイプラインを8ビット、4ビットのTensor-Core MMAパス(fp8-e4m3, int8, nvfp4, mxfp4)と統合し、int8が後RHT格子出力でfp8を上回ります。プロジェクトページ:https://moonmath.ai/hyperquant/

論文の概要: HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

関連論文リスト