Fugu-MT 論文翻訳(概要): Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

論文の概要: Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

arxiv url: http://arxiv.org/abs/2512.23703v1
Date: Mon, 29 Dec 2025 18:57:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.561587
Title: Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
Title（参考訳）: ロボドーパミン:高精度ロボットマニピュレーションのための一般的なプロセスリワードモデリング
Authors: Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang,
Abstract要約: マルチビュー入力からプロセス報酬モデルを学ぶための新しい報酬モデルであるDopamine-Rewardを紹介する。コアとなるのはGeneral Reward Model(GRM)で、これは3400時間以上のデータセットでトレーニングされています。ドパミン・リワードを基盤として,ロバストな政策学習フレームワークであるドパミン・RLを提案する。
参考スコア（独自算出の注目度）: 42.7004446545722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io
Abstract（参考訳）: 実世界のロボット工学に強化学習(RL)を適用する主な障害は、効果的な報酬関数の設計である。最近の学習ベースのプロセス・リワード・モデル(Process Reward Models, PRM)は有望な方向性であるが、しばしば2つの基本的な制限によって妨げられている。そこで本稿では,多視点入力から汎用のステップ認識プロセス報酬モデルを学ぶための新たな報酬モデリング手法であるDopamine-Rewardを紹介する。中心となるのは、構造的理解のためにステップワイズ・リワードの離散化と、知覚的制限を克服するためにマルチパースペクティブ・リワード・フュージョンを活用する、広大な3400時間以上のデータセットでトレーニングされた、ジェネラル・リワード・モデル(GRM)です。提案手法は,ドパミン・リワード(Dopamine-Reward)に基づくロバストな政策学習フレームワークであるDopamine-RLを提案する。多様なシミュレートされた実世界のタスクにわたる大規模な実験は、我々のアプローチを検証する。 GRMは報酬評価における最先端の精度を達成し、GRM上に構築されたドーパミン-RLは政策学習効率を著しく向上させる。例えば、GRMが1つの専門家軌道から1ショットで新しいタスクに適応した後、結果として得られる報酬モデルにより、Dopamine-RLは、タスク間の強力な一般化を維持しながら、150のオンラインロールアウト(実際のロボットインタラクションの約1時間)で、ほぼゼロから95%の成功まで、ポリシーを改善することができる。プロジェクトサイト: https://robo-dopamine.github.io

論文の概要: Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

関連論文リスト