Fugu-MT 論文翻訳(概要): GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

論文の概要: GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

arxiv url: http://arxiv.org/abs/2601.05110v1
Date: Thu, 08 Jan 2026 16:58:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:53.290969
Title: GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
Title（参考訳）: GlimpRouter: 思考の1つのトークンをグリップすることで、効果的な協調的推論を行う
Authors: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu,
Abstract要約: 協調推論は、軽量モデルと大規模モデルの間の作業を選択的に割り当てることで、有望なソリューションを提供する。ステップワイドなコラボレーションに関する新しい視点を提案する。推論ステップの難しさは,最初のトークンから推測できる。 Glimpは軽量なモデルを使用して、各推論ステップの最初のトークンのみを生成し、初期トークンエントロピーがしきい値を超えた場合にのみ、ステップをより大きなモデルにルーティングする。
参考スコア（独自算出の注目度）: 10.808072653940263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
Abstract（参考訳）: 大規模推論モデル(LRM)は、多段階の思考の連鎖を明示的に生成することで、顕著な性能を達成するが、この能力は相当な推論遅延と計算コストを引き起こす。協調推論は、軽量モデルと大規模モデルの間で作業を選択的に割り当てることによって、有望な解決策を提供するが、基本的な課題は残る: 推論ステップが大きなモデルのキャパシティや小さなモデルの効率を必要とするかどうかを決定する。既存のルーティング戦略は、ローカルトークンの確率に依存するか、ポストホック検証に依存し、大きな推論オーバーヘッドを発生させる。本稿では,ステップワイズコラボレーションの新たな視点として,推論ステップの難しさを第1のトークンから推測する。 LRMの「アハモーメント」現象に着想を得て、初期トークンのエントロピーがステップ困難の強い予測因子となることを示す。この洞察に基づいて、トレーニング不要なステップワイドコラボレーションフレームワークであるGlimpRouterを紹介します。 GlimpRouterは軽量なモデルを使用して、各推論ステップの最初のトークンのみを生成し、初期トークンエントロピーがしきい値を超えた場合にのみ、より大きなモデルにステップをルーティングする。複数のベンチマーク実験により,提案手法は精度を保ちながら推論遅延を大幅に低減することが示された。例えば、GlimpRouterは、AIME25のスタンドアロンの大規模モデルと比較して、推論遅延を25.9%削減しながら、精度が10.7%向上した。これらの結果は、推論のシンプルで効果的なメカニズムを示唆している: 完全な段階評価ではなく、思考を垣間見ることによって計算を割り当てることである。

論文の概要: GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

関連論文リスト