Fugu-MT 論文翻訳(概要): Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

論文の概要: Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

arxiv url: http://arxiv.org/abs/2605.15726v1
Date: Fri, 15 May 2026 08:22:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.219299
Title: Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
Title（参考訳）: 快適ゾーンを超えて育つ:RLVRの効率的な戦略ガイドによる探索
Authors: Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang,
Abstract要約: 検証可能な報酬付き強化学習(RLVR)は、大規模言語モデルの推論能力を改善するためのスケーラブルなパラダイムとして登場した。我々は、RLVRにおける構造化及び多様性駆動探索のためのフレームワークであるNudgeRLを提案する。当社のアプローチでは,各ロールアウトを,軽量で戦略レベルのコンテキストに設定するストラテジーナッジを導入しています。
参考スコア（独自算出の注目度）: 53.27792011950384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、大規模言語モデルの推論能力を改善するためのスケーラブルなパラダイムとして登場した。しかし、その効果は基本的に探索によって制限されており、この方針は既にサンプリング済みの軌道でしか改善できない。ロールアウトの数を増やすことでこの問題が軽減される一方で、このようなブルートフォーススケーリングは計算コストが高く、最適化目標を変更する既存のアプローチは、探索対象に対して限定的な制御を提供する。本研究では,RLVRにおける構造化及び多様性駆動探索のためのフレームワークであるNudgeRLを提案する。当社のアプローチでは,軽量かつ戦略レベルのコンテキストにそれぞれロールアウトして,高価なオラクル管理に頼ることなく,多様な推論軌道を誘導するストラテジー・ナッジを導入している。このような構造的探索から効果的に学習するために,我々はさらに,報酬信号をテキスト内コンポーネントとコンテキスト内コンポーネントに分解し,蒸留目標を組み込んで,発見した振る舞いを基本方針に戻す,統一的な目的を提案する。経験的に、NudgeRLはGRPOを最大8倍のロールアウト予算で上回り、一方、オラクル誘導のRLベースラインは5つの挑戦的なベンチマークで平均で上回ります。これらの結果は、構造化されたコンテキスト駆動の探索が、ブルートフォースのロールアウトスケーリングと特権情報に基づく実現可能性指向の手法の両方に、効率的でスケーラブルな代替手段として機能することを示す。私たちのコードはhttps://github.com/tally0818/NudgeRL.comで利用可能です。

論文の概要: Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

関連論文リスト