Fugu-MT 論文翻訳(概要): Binary Quantization For LLMs Through Dynamic Grouping

論文の概要: Binary Quantization For LLMs Through Dynamic Grouping

arxiv url: http://arxiv.org/abs/2509.03054v2
Date: Mon, 15 Sep 2025 05:32:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 15:23:16.402833
Title: Binary Quantization For LLMs Through Dynamic Grouping
Title（参考訳）: 動的グルーピングによるLLMのバイナリ量子化
Authors: Xinzhe Zheng, Zhen-Qun Yang, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin,
Abstract要約: 大規模言語モデル(LLM)は、幅広い自然言語処理(NLP)タスクで顕著なパフォーマンスを示している。 16ビットのBrain Floatから-1,1の1ビットの表現にモデル重みを圧縮するバイナリ量子化は、ストレージと推論コストを大幅に削減する。本稿では,2値量子化に適した新しい最適化目標と,これを効果的に実現するための3つのアルゴリズムを提案する。
参考スコア（独自算出の注目度）: 13.578307208515819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks, but require substantial memory and computational resources. Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in {-1, 1}, offers significant reductions in storage and inference costs. However, such aggressive quantization often leads to notable performance degradation compared to more conservative 4-bit quantization methods. In this research, we propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively. Our method enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. Experimental results demonstrate that our approach achieves an average bit length of just 1.007 bits, while maintaining high model quality. Specifically, our quantized LLaMA 3.2 3B model attains a perplexity of 8.23, remarkably close to the original 7.81, and surpasses previous SOTA BiLLM with a perplexity of only 123.90. Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ in both performance and efficiency. The compression process is highly efficient, requiring only 14 seconds to quantize the full LLaMA 3.2 3B weights on a single CPU core, with the entire process completing in under 100 minutes and exhibiting embarrassingly parallel properties. Code - https://github.com/johnnyzheng0636/WGM_bi_quan
Abstract（参考訳）: 大規模言語モデル(LLM)は、広範囲の自然言語処理(NLP)タスクにおいて顕著な性能を示すが、かなりのメモリと計算資源を必要とする。 16ビットのBrain Floatから1ビットの{-1, 1}へのモデル重みを圧縮したバイナリ量子化は、ストレージと推論コストを大幅に削減する。しかし、このような攻撃的な量子化は、より保守的な4ビット量子化法と比較して顕著な性能劣化をもたらすことが多い。本研究では,2値量子化に適した新しい最適化目標と,これを効果的に実現するための3つのアルゴリズムを提案する。本手法は,適応的グループ化戦略により最適非構造部分行列を動的に同定することにより,ブロック量子化を向上させる。実験により, モデル品質を維持しつつ, 平均ビット長が1.007ビットであることを示す。具体的には、我々の量子化LLaMA 3.2 3Bモデルは、元の7.81に非常に近い8.23のパープレキシティを獲得し、以前のSOTA BiLLMをわずか123.90のパープレキシティで上回っている。さらに,本手法は,GPTQなどのSOTA 4ビットアプローチと性能と効率の両面で競合する。圧縮プロセスは非常に効率的で、1つのCPUコア上のLLaMA 3.2 3B重みを量子化するのに14秒しかかからない。コード - https://github.com/johnnyzheng0636/WGM_bi_quan

論文の概要: Binary Quantization For LLMs Through Dynamic Grouping

関連論文リスト