Fugu-MT 論文翻訳(概要): SBVR: Summation of BitVector Representation for Efficient LLM Quantization

論文の概要: SBVR: Summation of BitVector Representation for Efficient LLM Quantization

arxiv url: http://arxiv.org/abs/2509.18172v1
Date: Wed, 17 Sep 2025 13:51:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.452191
Title: SBVR: Summation of BitVector Representation for Efficient LLM Quantization
Title（参考訳）: SBVR: 効率的なLLM量子化のためのBitVector表現の要約
Authors: Wonjun Bang, Jongseok Park, Hongseung Yu, Kyungmin Bin, Kyunghan Lee,
Abstract要約: データ中の表現可能な点の数を制限することで量子化圧縮は、効率的な量子化の鍵となる。既存のPTQ(Post-Training Quantization)ソリューションでは、ラウンドツーネアレス(RTN)ベースの方法とコードブックベースの方法の2つの主要なアプローチが採用されている。 SBVR(Summation of Bitplex Representation, ビットプレックス表現の要約)を提案する。
参考スコア（独自算出の注目度）: 3.7018544730078413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advent of large language models (LLMs), numerous Post-Training Quantization (PTQ) strategies have been proposed to alleviate deployment barriers created by their enormous parameter counts. Quantization achieves compression by limiting the number of representable points in the data. Therefore, the key to achieving efficient quantization is selecting the optimal combination of representation points, or codes, for the given data. Existing PTQ solutions adopt two major approaches to this problem: Round-To-Nearest (RTN)-based methods and codebook-based methods. RTN-based methods map LLM weights onto uniformly distributed integer grids, failing to account for the Gaussian-like weight distribution of LLM weights. Codebook-based methods mitigate this issue by constructing distribution-aware codebooks; however, they suffer from random and strided memory access patterns, resulting in degraded inference speed that is exacerbated by the limited size of GPU L1 cache. To overcome these limitations, we propose a novel LLM quantization method, SBVR (Summation of BitVector Representation), that enables Gaussian-like code representation in a hardware-friendly manner for fast inference. SBVR maps weight values to non-uniform representation points whose distribution follows the actual distribution of LLM weights, enabling more accurate compression. Additionally, we design a custom CUDA kernel that allows matrix-vector multiplication directly in the SBVR format without decompression, thereby enabling high-performance execution of SBVR-compressed models. Our evaluations of SBVR on various models demonstrate state-of-the-art perplexity and accuracy benchmark performance while delivering a 2.21x- 3.04x end-to-end token-generation speedup over naive FP16 models in the 4-bit quantization regime.
Abstract（参考訳）: 大規模言語モデル(LLM)の出現に伴い、膨大なパラメータ数によるデプロイメント障壁を軽減するために、PTQ(Post-Training Quantization)戦略が多数提案されている。量子化はデータ内の表現可能な点の数を制限することで圧縮を実現する。したがって、効率的な量子化を実現する鍵は、与えられたデータに対する表現点または符号の最適な組み合わせを選択することである。既存のPTQソリューションでは、ラウンド・ツー・ナベレスト(RTN)ベースの方法とコードブックベースの方法の2つの主要なアプローチが採用されている。 RTNベースの手法は、LLM重みのガウス的な重み分布を考慮せず、一様分布の整数格子にLLM重みをマッピングする。コードブックベースの手法は、分散対応のコードブックを構築することでこの問題を軽減するが、ランダムで頑丈なメモリアクセスパターンに悩まされ、GPU L1キャッシュの限られたサイズによって悪化する推論速度が悪化する。これらの制限を克服するため,高速推論のためのハードウェアフレンドリな手法として,ガウス的なコード表現を可能にする新しいLLM量子化手法であるSBVR(Summation of BitVector Representation)を提案する。 SBVRは、LLM重みの実際の分布に従って分布する非一様表現点に重み値をマッピングし、より正確な圧縮を可能にする。さらに,行列ベクトルを圧縮せずに直接SBVRフォーマットで乗算できるカスタムCUDAカーネルを設計し,SBVR圧縮モデルの高性能な実行を可能にする。各種モデルにおけるSBVRの評価は,4ビット量子化方式におけるFP16モデルよりも2.21x-3.04xエンド・ツー・エンドトークン生成の高速化を実現しつつ,最先端のパープレキシティと精度のベンチマーク性能を示す。

論文の概要: SBVR: Summation of BitVector Representation for Efficient LLM Quantization

関連論文リスト