Fugu-MT 論文翻訳(概要): Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search

論文の概要: Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search

arxiv url: http://arxiv.org/abs/2509.25420v1
Date: Mon, 29 Sep 2025 19:27:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.287206
Title: Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search
Title（参考訳）: Reward-Guided Dual-Phase Searchによる適応的テスト時間推論
Authors: Yingqian Cui, Zhenwei Dai, Pengfei He, Bing He, Hui Liu, Xianfeng Tang, Jingying Zeng, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin,
Abstract要約: 本稿では、推論を計画と実行に分離する二相テストタイムスケーリングフレームワークを提案する。具体的には、推論軌跡を分解し、各フェーズの報酬モデルを構築し、探索者が個別に計画と実行を探索、実行できるようにする。数学的推論とコード生成ベンチマークの両方の実験により、我々の手法は計算の冗長性を低減しつつ、常に精度を向上することを示した。
参考スコア（独自算出の注目度）: 62.1546099504045
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved significant advances in reasoning tasks. A key approach is tree-based search with verifiers, which expand candidate reasoning paths and use reward models to guide pruning and selection. Although effective in improving accuracy, these methods are not optimal in terms of efficiency: they perform simple decomposition on the reasoning process, but ignore the planning-execution nature of tasks such as math reasoning or code generation. This results in inefficient exploration of reasoning process. To address this, we propose a dual-phase test-time scaling framework that explicitly separates reasoning into planning and execution, and performs search over the two phases individually. Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately. We further introduce a dynamic budget allocation mechanism that adaptively redistributes sampling effort based on reward feedback, allowing early stopping on confident steps and reallocation of computation to more challenging parts of the reasoning process. Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing redundant computation.
Abstract（参考訳）: 大規模言語モデル(LLM)は推論タスクにおいて大きな進歩を遂げた。鍵となるアプローチは、候補推論パスを拡張し、報酬モデルを使用してプルーニングとセレクションをガイドする検証器によるツリーベースの探索である。精度の向上には有効であるが、これらの手法は効率の面で最適ではなく、推論プロセスで単純な分解を行うが、数学推論やコード生成のようなタスクの計画実行性は無視する。これは推論過程の非効率な探索をもたらす。そこで本研究では,推論を計画と実行に明確に分離した二相テストタイムスケーリングフレームワークを提案し,その2つのフェーズを個別に探索する。具体的には、推論軌跡を分解し、各フェーズの報酬モデルを構築し、探索者が個別に計画と実行を探索、実行できるようにする。さらに、報酬フィードバックに基づいてサンプリング作業を適応的に再分配する動的予算配分機構を導入し、信頼度の高いステップの早期停止と、推論プロセスのより困難な部分に計算の再配置を可能にする。数学的推論とコード生成ベンチマークの両方の実験により、我々の手法は冗長な計算を減らしながら、常に精度を向上することを示した。

論文の概要: Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search

関連論文リスト