Fugu-MT 論文翻訳(概要): Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

論文の概要: Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

arxiv url: http://arxiv.org/abs/2510.10964v1
Date: Mon, 13 Oct 2025 03:14:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.176021
Title: Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models
Title（参考訳）: すべてのビットが等しくなるわけではない: 共振モデルのためのスケール依存メモリ最適化戦略
Authors: Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos,
Abstract要約: 4ビット量子化は、非推論モデルとスケールにわたるゼロショットタスクのメモリ最適選択として登場した。モデルサイズではなくKVキャッシュがメモリを支配できるような推論モデルでは,この万能処方は失敗することを示す。 8ビットの4Bパラメータ未満の有効サイズを持つモデルでは、より長い生成ではなく、メモリをより多くの重みに割り当てることで、精度が向上する。
参考スコア（独自算出の注目度）: 10.604862875916103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.
Abstract（参考訳）: 4ビット量子化は,非推論モデルとゼロショットタスクのメモリ最適選択として現れてきたが,モデルサイズではなくKVキャッシュがメモリを支配的とする推論モデルでは,この普遍的処方は失敗することを示した。 AIME25とGPQA-Diamondの1,700の推論シナリオの体系的な実験により、スケール依存のトレードオフが見つかる: 8ビットの4Bパラメータ未満の有効サイズを持つモデルは、より長い世代にメモリを割り当てることでより正確な精度を得る。このスケール閾値は、並列スケーリングがメモリ効率になるタイミングと、KVキャッシュ消去がKV量子化より優れているかどうかも決定する。この結果から,LLMのメモリ最適化は,小規模推論モデルではテスト時間計算よりもモデルキャパシティを優先し,大規模処理ではテスト時間計算を最大化するという,原則的ガイドラインを提供する一方で,スケールに依存しないことが示唆された。この結果から, 配置のための推論モデルの最適化には, 非推論モデルで確立した手法と根本的に異なる戦略が必要であることが示唆された。

論文の概要: Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

関連論文リスト