Fugu-MT 論文翻訳(概要): RoboReward: General-Purpose Vision-Language Reward Models for Robotics

論文の概要: RoboReward: General-Purpose Vision-Language Reward Models for Robotics

arxiv url: http://arxiv.org/abs/2601.00675v2
Date: Thu, 08 Jan 2026 08:49:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.607751
Title: RoboReward: General-Purpose Vision-Language Reward Models for Robotics
Title（参考訳）: RoboReward:ロボットのための汎用ビジョンランゲージリワードモデル
Authors: Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, Chelsea Finn,
Abstract要約: 視覚言語モデル(VLM)は、自動報酬モデルとして期待されているが、実際のロボットタスクにおけるそれらの効果は理解されていない。大規模な実ロボットコーパス上に構築されたロボティクス報酬データセットとベンチマークであるRoboRewardを導入することで、このギャップを埋めることを目指している。
参考スコア（独自算出の注目度）: 124.34685604054312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotics, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) RoboReward, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a negative examples data augmentation pipeline that generates calibrated negative and near-misses via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we build a large training and evaluation dataset spanning diverse tasks and embodiments to test whether state-of-the-art VLMs can reliably provide rewards for robot learning. Our evaluation of open and proprietary VLMs finds that no model excels across tasks, highlighting substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B model in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5 while narrowing the gap to RL training with human-provided rewards. We release the full dataset, trained reward models, and evaluation suite on our website to advance the development of general-purpose reward models in robotics: https://crfm.stanford.edu/helm/robo-reward-bench (project website).
Abstract（参考訳）: 十分に設計された報酬は、効果的な強化学習に基づく政策改善に不可欠である。現実世界のロボット工学では、そのような報酬を得るためには、労働集約的な人間のラベル付けか、不安定で手作りの目的が必要である。視覚言語モデル(VLM)は、自動報酬モデルとして期待されているが、実際のロボットタスクにおけるそれらの効果は理解されていない。本研究は,(1)Open X-Embodiment (OXE) とRoboArenaの大規模実ロボットコーパス上に構築されたロボット報酬データセットとベンチマークであるRoboRewardと,(2)このデータセットでトレーニングされた視覚言語報酬モデル(RoboReward 4B/8B)を導入することで,このギャップを解消することを目的とする。 OXEは成功度が高く,失敗例が欠如しているため,成功エピソードの反ファクト的緩和と時間的クリッピングによって正負値と近値の校正値を生成する負のサンプルデータ拡張パイプラインを提案し,同じビデオから部分的プログレス結果を生成する。このフレームワークを用いて,多種多様なタスクや実施状況にまたがる大規模なトレーニングと評価データセットを構築し,最先端のVLMがロボット学習に確実に報奨を与えることができるかどうかを検証した。オープンでプロプライエタリなVLMの評価では、タスク全体にわたってモデルが優れていないことが分かり、改善の余地がかなり浮かび上がっています。次に、より大規模なVLMよりも優れた汎用4Bおよび8Bパラメータモデルを訓練し、短距離ロボットタスクに対する報酬を割り当てる。最後に、実ロボット強化学習に8Bモデルを配置し、人為的な報酬によるRLトレーニングとのギャップを狭めつつ、Gemini Robotics-ER 1.5のポリシー学習を改善することを発見した。私たちは、ロボット工学における汎用報酬モデルの開発を進めるために、私たちのウェブサイトでデータセット、トレーニングされた報酬モデル、評価スイートをリリースしました。

論文の概要: RoboReward: General-Purpose Vision-Language Reward Models for Robotics

関連論文リスト