Fugu-MT 論文翻訳(概要): Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models

論文の概要: Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2602.01237v1
Date: Sun, 01 Feb 2026 13:58:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.672616
Title: Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models
Title（参考訳）: 大規模言語モデルにおける効率的な推論時間推論のための予測スケジューリング
Authors: Katrina Brown, Aneesh Muppidi, Rana Shahout,
Abstract要約: 大規模言語モデル(LLM)は複雑な推論タスクにおいて最先端の精度を達成する。しかし、クエリ毎に固定されたトークン予算を使用することで、簡単な入力の過剰計算とハードな入力の過小計算につながる。プラグイン・アンド・プレイのフレームワークであるPredictive Schedulingを導入する。このフレームワークは軽量な予測器を事前実行し、各クエリの最適な推論の長さや難易度を全世代前に推定する。
参考スコア（独自算出の注目度）: 6.002670452103349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) achieve state-of-the-art accuracy on complex reasoning tasks by generating multiple chain-of-thought (CoT) traces, but using a fixed token budget per query leads to over-computation on easy inputs and under-computation on hard ones. We introduce Predictive Scheduling, a plug-and-play framework that pre-runs lightweight predictors, an MLP on intermediate transformer hidden states or a LoRA-fine-tuned classifier on raw question text, to estimate each query's optimal reasoning length or difficulty before any full generation. Our greedy batch allocator dynamically distributes a fixed total token budget across queries to maximize expected accuracy. On the GSM8K arithmetic benchmark, predictive scheduling yields up to 7.9 percentage points of absolute accuracy gain over uniform budgeting at identical token cost, closing over 50\% of the gap to an oracle with perfect foresight. A systematic layer-wise study reveals that middle layers (12 - 17) of the transformer carry the richest signals for size estimation. These results demonstrate that pre-run budget prediction enables fine-grained control of the compute-accuracy trade-off, offering a concrete path toward latency-sensitive, cost-efficient LLM deployments.
Abstract（参考訳）: 大規模言語モデル(LLM)は、複数のチェーン・オブ・シークレット(CoT)トレースを生成することで、複雑な推論タスクの最先端の精度を達成するが、クエリ毎に固定されたトークン予算を使用することで、簡単な入力の過剰計算やハードな処理の過小評価につながる。我々は,軽量な予測器,中間変圧器隠蔽状態のMLP,あるいは生の質問文のLoRA微調整分類器をプリランするプラグイン・アンド・プレイのフレームワークであるPredictive Schedulingを導入し,各クエリの最適な推論長や難易度を,全世代前に推定する。我々の欲求バッチアロケータは、予測精度を最大化するために、クエリ間で固定された全トークン予算を動的に分散する。 GSM8Kの算術ベンチマークでは、予測的スケジューリングは同一のトークンコストでの均一な予算化よりも最大7.9パーセントの精度向上を達成し、完全なフォアビジョンを持つオラクルとのギャップの50%以上を閉じる。系統的な層ワイド研究により、トランスの中間層(12～17)が、最もリッチな信号を持っていて、サイズを推定できることがわかった。これらの結果は,事前予算予測によって計算精度のトレードオフをきめ細かな制御が可能であり,遅延に敏感でコスト効率のよいLCMデプロイメントへの具体的な経路を提供することを示す。

論文の概要: Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models

関連論文リスト