Fugu-MT 論文翻訳(概要): What Can You Do When You Have Zero Rewards During RL?

論文の概要: What Can You Do When You Have Zero Rewards During RL?

arxiv url: http://arxiv.org/abs/2510.03971v1
Date: Sat, 04 Oct 2025 23:10:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.365519
Title: What Can You Do When You Have Zero Rewards During RL?
Title（参考訳）: RL中にゼロのリワードがあると何ができるのか?
Authors: Jatin Prakash, Anirudh Buvanesh,
Abstract要約: 結果に基づく報酬を伴う強化学習(RL)は、複雑な推論タスクにおいて大きな言語モデル(LLM)を改善するのに有効であることが証明されている。本稿では,Bachmann et al. (2024) で導入されたグラフ検索タスクを通じて,このシナリオを検証し,望ましいコンポーネントを組み込んだ最近の手法を評価する。トレーニングセットに簡単なサンプルを追加するという単純なデータ中心の介入によって、報酬のゼロから始まるにもかかわらず、モデルが最終的に元のハードタスクを解決できることが分かりました。
参考スコア（独自算出の注目度）: 3.0795668932789515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks. However, its success often depends on the base model occasionally sampling correct solutions. When no correct solutions are sampled, training encounters a zero-reward barrier where learning stalls due to zero gradients. We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components such as dense rewards, diversity incentives, and improved credit assignment. Our experiments show that none of these approaches overcome the zero-reward barrier if the base model never produces a correct answer. In contrast, we find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward. Importantly, this succeeds without modifying the RL algorithm itself. Because official implementations of several baselines were unavailable, we developed our own, which allowed us to conduct a detailed analysis of their failure modes. We release these implementations to support further research at: https://github.com/rl4reasoning/rl-baselines
Abstract（参考訳）: 結果に基づく報酬を伴う強化学習(RL)は、複雑な推論タスクにおいて大きな言語モデル(LLM)を改善するのに有効であることが証明されている。しかし、その成功はしばしば、時々正しい解をサンプリングするベースモデルに依存する。正しい解がサンプリングされない場合、トレーニングはゼロ勾配のため学習が停止するゼロ逆障壁に遭遇する。 Bachmann et al (2024) で導入されたグラフ検索タスクを通じてこのシナリオを検証し、高密度報酬、多様性インセンティブ、信用代入の改善など、望ましいコンポーネントを組み込んだ最近の手法を評価する。我々の実験は、ベースモデルが正しい答えを導き出さない場合、これらのアプローチがゼロ逆障壁を克服することはないことを示した。対照的に、トレーニングセットに簡単なサンプルを追加するという単純なデータ中心の介入によって、報酬ゼロから始まるにもかかわらず、モデルが最終的に元のハードタスクを解決できることがわかりました。重要なことに、これはRLアルゴリズム自体を変更することなく成功する。いくつかのベースラインの公式実装は利用できないため、私たちは独自のものを開発しました。 https://github.com/rl4reasoning/rl-baselines。

論文の概要: What Can You Do When You Have Zero Rewards During RL?

関連論文リスト