Fugu-MT 論文翻訳(概要): CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

論文の概要: CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

arxiv url: http://arxiv.org/abs/2510.12721v1
Date: Tue, 14 Oct 2025 17:00:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 21:19:15.001875
Title: CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Title（参考訳）: CARVQ:LLM埋め込み圧縮のためのグループ残留ベクトル量子化補正アダプタ
Authors: Dayin Gou, Sanghyun Byun, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung,
Abstract要約: 大規模言語モデル(LLM)はトークンの埋め込みに多数のパラメータを依存しているため、かなりのストレージ要件とメモリフットプリントに繋がる。本稿では, グループ残差ベクトル量子化を併用したポストトレーニング小説Corrective AdaptorであるCARVQを紹介する。 CarVQは、低ビットストレージをサポートするための特別なハードウェアを必要とせず、約1.6ビットまで圧縮するためにオリジナルのモデルを模倣している。
参考スコア（独自算出の注目度）: 0.4104352271917982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
Abstract（参考訳）: 大規模言語モデル(LLM)は一般的にトークンの埋め込みに多数のパラメータを依存しているため、かなりのストレージ要件とメモリフットプリントに繋がる。特に、エッジデバイスにデプロイされるLCMはメモリバウンドであり、埋め込み層を圧縮することでメモリのフットプリントを減らし、メモリ帯域幅を解放するだけでなく、推論を高速化する。そこで本研究では, 学習後の誤り訂正適応とグループ残差ベクトル量子化を併用したCARVQを紹介する。 CARVQはリニアマップと非線形マップの両方の構成に依存しており、低ビットストレージをサポートするために特別なハードウェアを必要とせず、約1.6ビットに圧縮するためにオリジナルのモデルが埋め込まれたことを模倣している。我々は,LLaMA-3.2-1B,LLaMA-3.2-3B,LLaMA-3.2-3B-インストラクト,LLaMA-3.1-8B,Qwen2.5-7B,Qwen2.5-Math-7B,Phi-4といった事前学習LLMを用いて,共通生成,識別,数学,推論タスクの評価を行った。多くの場合、CARVQはスカラー量子化と比較して適切なパープレキシティと精度を維持しつつ、平均ビット幅/パラメータを低くすることができる。我々のコントリビューションには、最先端のトランスフォーマー量子化手法と互換性があり、4ビットメモリをサポートするハードウェアにシームレスに統合することで、メモリ制限されたデバイスにおけるモデルのメモリフットプリントを削減できる新しい圧縮技術が含まれている。この研究は、エッジデバイスへのLLMの効率的なデプロイに向けた重要なステップを示している。

論文の概要: CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

関連論文リスト