Fugu-MT 論文翻訳(概要): RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

論文の概要: RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

arxiv url: http://arxiv.org/abs/2603.17891v1
Date: Wed, 18 Mar 2026 16:16:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.817882
Title: RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
Title（参考訳）: RAMP:高効率オンデバイスLCM推論のための強化適応混合精密量子化
Authors: Arpit Singh Gautam, Saurabh Jha,
Abstract要約: RAMP (Reinforcement Adaptive Mixed Precision) は、グローバルビット予算の下でパープレキシティを最小限に抑えるために、層幅の割り当て毎に学習する。 Llama 2 7Bでは、RAMPは3.68GB (3.65 ビット)で5.54パープレキシティを実現し、均一な4ビット AWQ (5.60 ビット、3.90 GB)と GPTQ を6%、品質は1%から3%向上した。
参考スコア（独自算出の注目度）: 1.1100764382749708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.
Abstract（参考訳）: ポストトレーニング量子化は、リソース制約のあるハードウェア上に大規模言語モデル(LLM)をデプロイするためには不可欠である。本稿では,グローバルビット予算下での難易度を最小化するために,レイヤ幅の割り当て毎に学習するオフポリシーのソフトアクタ批判フレームワークであるRAMPを提案する。アクティベーション統計、重み特性、構造記述子の11次元埋め込みに関するポリシー条件は、モデルファミリとスケール間のゼロショット転送を可能にする。安定なサブ4ビット量子化を実現するために,各チャネルのスケーリングと正規化層補償により,アクティベーションアウトレーヤを重みに移行するプリコンディショニング手法であるScale Foldingを導入する。非対称な罰則と予算の崖による品質優先報酬は、急速に収束する。 Llama 2 7Bでは、RAMPは3.68GB (3.65 ビット)で5.54パープレキシティを実現し、均一な4ビット AWQ (5.60 ビット、3.90 GB)と GPTQ を6%、品質は1%から3%向上した。批判的に言えば、Llama 2 7Bでのみ訓練されたポリシーは、ゼロショットをLlama 2 13BとMistral 7Bに一般化し、しばしば目標とする特定の訓練を超越し、量子化感度が主にアーキテクチャであるという仮説を支持する。 HALOパイプラインはCPU、GPU、エッジデバイス上でのカーネルフリー推論のためのGGUFフォーマットにアロケーションをエクスポートし、FP16コモンセンス推論性能の99.5%を維持している。

論文の概要: RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

関連論文リスト