Fugu-MT 論文翻訳(概要): SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

論文の概要: SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

arxiv url: http://arxiv.org/abs/2512.05409v1
Date: Fri, 05 Dec 2025 03:58:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-13 22:40:56.890516
Title: SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs
Title（参考訳）: SQ-format: LLM用統一スパース量子ハードウェアフレンドリなデータフォーマット
Authors: Ruixuan Huang, Hao Zeng, Hantao Huang, Jinyuan Shi, Minghui Yu, Ian En-Hsu Yen, Shuai Wang,
Abstract要約: 後学習量子化(PTQ)は、大規模言語モデル(LLM)の民主化において重要な役割を果たす既存の低ビット量子化とスペーサー化技術は、ハードウェアサポートが限られているため、精度と効率のバランスをとるのが難しい。本稿では,量子化とスパース化のための統一データフォーマットであるスパース量子フォーマット(SQ-format)を提案する。
参考スコア（独自算出の注目度）: 8.787017031267482
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the limited hardware support. For example, W4A8 can only achieve the same peak TOPS as W8A8 whereas the GPU-supported sparse data format (2:4 semi-structure sparse) is seldomly adopted due to the loss of accuracy. To bridge this gap, in this paper, we propose the Sparse-Quantized Format (SQ-format), which is a unified data format for quantization and sparsification potentially easily supported by new hardware and existing GPUs. SQ-format makes use of the fact that sparse matrix can be accelerated in high-precision, and low-precision matrix multiplication can also be accelerated accordingly. As such, SQ-format is proposed to achieve Pareto improvement between performance and throughput. This format is particularly suitable for activations with outlier inequality status and makes their static compression possible. We show the state-of-the-art PTQ performance with SQ-format, propose the hardware required to support it, and further offer the design exploration and insights for the next-generation AI accelerators.
Abstract（参考訳）: 学習後の量子化(PTQ)は、大規模言語モデル(LLM)の民主化において重要な役割を果たす。しかし、ハードウェアサポートが限られているため、既存の低ビット量子化とスペーサー化技術は精度と効率のバランスをとるのが難しい。例えば、W4A8はW8A8と同じピークTOPSしか達成できないが、GPUがサポートしているスパースデータフォーマット(2:4半構造スパース)は精度の低下によりほとんど採用されない。本稿では、このギャップを埋めるために、新しいハードウェアや既存のGPUで容易にサポート可能な量子化とスパース化のための統一データフォーマットであるスパース量子フォーマット(SQ-format)を提案する。 SQ-formatはスパース行列が高精度で加速できるという事実を利用し、それに応じて低精度行列乗算も加速できる。そのため、SQ-formatはパフォーマンスとスループットのPareto改善を実現するために提案されている。このフォーマットは、特に不等式が低いアクティベーションに適しており、静的圧縮を可能にする。我々は、SQ-formatによる最先端のPTQパフォーマンスを示し、それをサポートするために必要なハードウェアを提案し、さらに、次世代AIアクセラレータの設計調査と洞察を提供する。

論文の概要: SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

関連論文リスト