Fugu-MT 論文翻訳(概要): Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

論文の概要: Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

arxiv url: http://arxiv.org/abs/2603.10887v1
Date: Wed, 11 Mar 2026 15:31:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:33.03304
Title: Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models
Title（参考訳）: 大規模共振モデルのアクティブRLファインタニングのためのダイナミクス予測サンプリング
Authors: Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji,
Abstract要約: 強化学習(RL)ファインタニングは,大規模言語モデル(LLM)の推論能力を高める重要な手法となっている。近年の進歩は、部分的に解決されたり、適度に難しい例にトレーニングを集中させるオンラインプロンプト選択手法の重要性を浮き彫りにしている。本研究はDPS(Dynamics-Predictive Smpling)を提案する。DPS(Dynamics-Predictive Smpling)は,コストのかかるロールアウトに先立って,学習ダイナミクスを推定して情報的プロンプトを予測し,選択する。
参考スコア（独自算出の注目度）: 49.04912820721943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt's solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.
Abstract（参考訳）: 強化学習(RL)ファインタニングは,大規模言語モデル(LLM)の推論能力を高める重要な手法となっている。しかし、その効果はトレーニングデータの選択に大きく依存する。近年の進歩は、オンラインのプロンプト選択手法の重要性を強調している。これは、通常、現在の政策の下で部分的に解決されたり、適度に困難な例にトレーニングを集中させることで、より効果的なモデル更新をもたらす。訓練段階においてRLファインタニングを著しく加速する一方で、インフォメーションサンプルを特定するために大規模な候補バッチ上でのLLMロールアウトが必要であり、そのコストは微調整プロセス自体を上回る。この課題に対処するために、この研究はDPS(Dynamics-Predictive Smpling)を提案する。具体的には、RLファインタニング中の各プロンプトの解の進行を動的システムとしてモデル化し、解の程度を状態として表現し、遷移を隠れマルコフモデルにより特徴付けることにより、新しい視点を導入する。過去のロールアウト報奨信号を用いて,オンラインベイズ推定を行い,その推定結果から,ロールアウト集約フィルタを使わずに効率的なプロンプト選択を予測できる。数学、計画、視覚幾何学など多様な推論タスクにまたがる実証的な結果は、DPSが冗長なロールアウトを大幅に減らし、トレーニングプロセスを加速し、優れた推論性能を達成することを実証している。

論文の概要: Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

関連論文リスト