Fugu-MT 論文翻訳(概要): DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

論文の概要: DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

arxiv url: http://arxiv.org/abs/2603.16157v1
Date: Tue, 17 Mar 2026 06:20:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.126503
Title: DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay
Title（参考訳）: DyJR:動的ジェンセン・シャノンリプレイによる検証可能なリワードによる強化学習における多様性の保存
Authors: Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi,
Abstract要約: 既存のエクスペリエンスリプレイメソッドは、直接ポリシー更新のための正確なサンプルを再利用することで、この問題に対処する。歴史的データは単に正確性を強化するのではなく、持続的な多様性を優先すべきである、と我々は主張する。本稿では,シンプルで効果的な正規化フレームワークであるLEPJRを提案する。
参考スコア（独自算出の注目度）: 57.80564154223355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.
Abstract（参考訳）: Reinforcement Learning (RL)は、大規模言語モデルの推論を強化するが、GRPOのようなオンポリティクスアルゴリズムは、過去のロールアウトを捨てる際に、サンプル非効率である。既存の経験リプレイ手法は、直接ポリシー更新のための正確なサンプルを再利用することでこの問題に対処するが、これはしばしば高い計算コストを発生させ、オーバーフィッティングによるモード崩壊を引き起こす。歴史的データは単に正確性を強化するのではなく、持続的な多様性を優先すべきである、と我々は主張する。この目的のために,最近の軌道からの動的参照分布を用いたシンプルで効果的な正規化フレームワークであるDynamic Jensen-Shannon Replay (DyJR)を提案する。 DyJRは, 時間知覚動的バッファ(FIFO)と適応サイズ(アダプティブサイズ)を用いて, 時間的近位標本のみを保持し, モデル進化と同期させる)と, 直接勾配更新を分散制約に置き換え, 多様性の崩壊を防ぐJensen-Shannon分散正規化(Jensen-Shannon Divergence Regularization)の2つのイノベーションを導入している。数学的推論とText-to-SQLベンチマークの実験により、DyJRはオリジナルのGRPOに匹敵するトレーニング効率を維持しながら、GRPOとRLEPやEx-GRPOのようなベースラインを著しく上回っていることが示された。さらに、Ranc-k$トークン確率の進化の観点から、DyJRは多様性を高め、Ranc-1トークンの過度信頼を緩和し、DyJRの特定のサブモジュールがトレーニングダイナミクスにどのように影響するかを明らかにする。

論文の概要: DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

関連論文リスト