Fugu-MT 論文翻訳(概要): D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learninging Robotic Manipulation

論文の概要: D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learninging Robotic Manipulation

arxiv url: http://arxiv.org/abs/2603.27346v1
Date: Sat, 28 Mar 2026 17:34:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.914977
Title: D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learninging Robotic Manipulation
Title（参考訳）: D-SPEAR:ロボットマニピュレーションによる安定強化学習のためのデュアルストリーム優先体験適応リプレイ
Authors: Yu Zhang, Karl Mason,
Abstract要約: D-SPEARは、共有リプレイバッファを維持しながらアクターと批評家のサンプリングを分離するリプレイフレームワークである。我々は,ブロックリフティングやドアオープンを含むロボスーツベンチマークから,ロボット操作の課題に対するD-SPEARの評価を行った。
参考スコア（独自算出の注目度）: 4.39988340059705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robotic manipulation remains challenging for reinforcement learning due to contact-rich dynamics, long horizons, and training instability. Although off-policy actor-critic algorithms such as SAC and TD3 perform well in simulation, they often suffer from policy oscillations and performance collapse in realistic settings, partly due to experience replay strategies that ignore the differing data requirements of the actor and the critic. We propose D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay, a replay framework that decouples actor and critic sampling while maintaining a shared replay buffer. The critic leverages prioritized replay for efficient value learning, whereas the actor is updated using low-error transitions to stabilize policy optimization. An adaptive anchor mechanism balances uniform and prioritized sampling based on the coefficient of variation of TD errors, and a Huber-based critic objective further improves robustness under heterogeneous reward scales. We evaluate D-SPEAR on challenging robotic manipulation tasks from the robosuite benchmark, including Block-Lifting and Door-Opening. Results demonstrate that D-SPEAR consistently outperforms strong off-policy baselines, including SAC, TD3, and DDPG, in both final performance and training stability, with ablation studies confirming the complementary roles of the actorside and critic-side replay streams.
Abstract（参考訳）: ロボット操作は、コンタクトリッチなダイナミクス、長い地平線、トレーニング不安定性により、強化学習において依然として困難である。 SACやTD3のような非政治的なアクター批判アルゴリズムはシミュレーションではうまく機能するが、アクターと批評家の異なるデータ要求を無視した経験的なリプレイ戦略のために、現実的な環境ではポリシーの振動やパフォーマンスの崩壊に悩まされることが多い。本稿では,D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replayを提案する。批評家は優先順位付けされたリプレイを効果的な価値学習に利用し、アクターは低エラー遷移を使用してポリシー最適化を安定化する。適応アンカー機構は、TD誤差の変動係数に基づいて均一かつ優先順位付けされたサンプリングのバランスを保ち、ハマーに基づく批判目的は、不均一な報酬スケール下でのロバスト性をさらに向上させる。我々は,ブロックリフティングやドアオープンを含むロボスーツベンチマークから,ロボット操作の課題に対するD-SPEARの評価を行った。以上の結果から, D-SPEARは, SAC, TD3, DDPGなど, 最終的なパフォーマンスおよびトレーニングの安定性において, 強い非政治的基盤線を一貫して上回り, アクター側と批評家側のリプレイストリームの相補的役割を検証した。

論文の概要: D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learninging Robotic Manipulation

関連論文リスト