Fugu-MT 論文翻訳(概要): Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

論文の概要: Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.10346v1
Date: Tue, 09 Jun 2026 02:55:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.27639
Title: Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning
Title（参考訳）: 推論と記憶 : LLM強化学習における方向性を考慮した多様性探索
Authors: Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu,
Abstract要約: 強化学習は、大規模言語モデルにおいて推論能力を引き出すための重要なパラダイムとなっている。政策の内的推論・記憶方向を探索する方向認識強化学習フレームワークであるDiRLを提案する。
参考スコア（独自算出の注目度）: 40.73985999918812
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.
Abstract（参考訳）: 強化学習は大規模言語モデルにおいて推論能力を引き出すための重要なパラダイムとなり、効率的な解軌道の発見には探索が不可欠である。既存の探索手法は一般的に、この多様性を導くものを区別することなく、意味空間や勾配空間の多様性を促進する。軌跡は、新しい推論プロセスに従うか、記憶されたパターンやショートカットを変えるため、新しく見えるかもしれない。両方のケースを等しくリワードすることは、真の推論の改善よりも、記憶への探索を後押しする可能性がある。本稿では,ディレクテーション・アウェア・強化学習フレームワークであるDiRLを提案する。具体的には、DiRLは、モデル表現からこの方向を抽出し、ロールアウト更新を特徴付ける方向重み付き勾配特徴を構築し、メモリ化整列変動を抑えつつ、推論整列探索を増幅する報酬を形作る。 DiRLは、標準グループ相対ポリシー最適化(GRPO)にシームレスに統合される。数学的および一般的な推論ベンチマークに関する大規模な実験は、様々な既存探査法よりも大幅に改善されたDiRLの有効性を示した。

論文の概要: Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

関連論文リスト