Fugu-MT 論文翻訳(概要): Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

論文の概要: Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.16918v1
Date: Sat, 18 Apr 2026 08:51:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.236361
Title: Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Title（参考訳）: LLM/VLM強化学習のための鮮度を考慮した優先体験再生
Authors: Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny,
Abstract要約: 強化学習(RL)は、学習後の大規模言語モデル(LLM)と視覚言語モデル(VLM)において驚くべき成功を収めた。これらの方法は、単一の勾配更新後に収集された全ての軌道を破棄し、結果としてサンプル効率が低下する。本稿では, PER に基づく優先度を乗算指数年齢減衰で増大させることにより, この優先度の安定化問題に対処する Freshness-Aware PER を提案する。
参考スコア（独自算出の注目度）: 43.63475878891097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.
Abstract（参考訳）: 強化学習(RL)は、大規模言語モデル(LLM)と視覚言語モデル(VLM)のポストトレーニングにおいて、PPO、GRPO、REINFORCE++といった政治アルゴリズムが支配的なパラダイムとなっている。しかしながら、これらの手法は、単一の勾配更新後に収集された全ての軌道を破棄し、結果としてサンプル効率が低下し、特にマルチターン環境相互作用が高価であるエージェントタスクに無駄になる。 Experience Replayは従来のRLのサンプル効率を向上させる一方で、エージェントが過去のトラジェクトリを再利用し、インフォメーションを優先順位付けできるようにし、直接LLMに優先順位付けされたエクスペリエンス・リプレイ(PER)を適用することは失敗する。数十億パラメータモデルの急速な政策進化は、保存された優先順位を安定させ、古い優先度の高い軌道が不定形化されてから長い間サンプリングを支配した。本稿では,この優先度安定度問題に有効なサンプルサイズ解析を基礎とした乗算指数年齢減衰法を用いて,PERに基づく優先度を増大させることにより対処する Freshness-Aware PERを提案する。我々の知る限り、Freshness-Aware PER は LLM/VLM 強化学習にPER をうまく応用する最初の試みである。本研究では, 0.5B, 3B, 7Bモデルを用いたマルチステップエージェント, 推論, 数学の競争課題について検討した。 Freshness-Aware PERは、NQ Searchで+46%、Sokobanで+367%、VLM FrozenLakeで+133%、老朽化した標準PERは一貫してパフォーマンスを低下させる。私たちのコードはhttps://github.com/Vision-CAIR/Freshness-Aware-PERで公開されています。

論文の概要: Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

関連論文リスト