Fugu-MT 論文翻訳(概要): A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

論文の概要: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

arxiv url: http://arxiv.org/abs/2509.15937v1
Date: Fri, 19 Sep 2025 12:44:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.165395
Title: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Title（参考訳）: ロボット実世界強化学習のためのビジョン・ランゲージ・アクション・クリティカルモデル
Authors: Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, Jiangmiao Pang,
Abstract要約: 本稿では、InternVL上に構築された一般的なプロセス報酬モデルであるVLACを紹介する。密度の高い進行デルタと完了信号を出力し、タスク固有の報酬工学を除去する。 VLACは、知覚、ダイアログ、推論能力を強化するために、視覚言語データセットに基づいて訓練されている。
参考スコア（独自算出の注目度）: 26.546473157595482
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30\% to about 90\% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルを用いたロボット実世界の強化学習(RL)は、スパース、手作りの報酬、非効率な探索によってボトルネックとなる。本稿では、InternVL上に構築され、大規模な異種データセットに基づいて訓練された一般的なプロセス報酬モデルであるVLACを紹介する。ペアワイズな観察と言語目標が与えられたら、密集した進行デルタと完了信号を生成し、タスク固有の報酬工学を排除し、見知らぬタスクや環境へのワンショットのインコンテキスト転送をサポートする。 VLACは、知覚、対話、推論能力を強化するために視覚言語データセットをトレーニングし、ロボットと人間の軌跡データとともに行動生成と進行推定を基礎にし、さらに、多くの否定的および意味的ミスマッチしたサンプルを構築して、無関係なプロンプトを拒絶し、回帰や停滞を検出するように強化されている。即時制御により、単一のVLACモデルが報酬とアクショントークンを交互に生成し、批判とポリシーを統一する。非同期の現実世界のRLループ内に配置し、段階的なヒューマン・イン・ザ・ループプロトコル(オフラインのデモ・リプレイ、リターン・アンド・エクスプロイト、ヒューマンガイドによる探索)を階層化し、探索を加速し、早期学習を安定化させます。 4つの異なる実世界の操作タスク全体で、VLACは200の実世界のインタラクションエピソードにおいて、成功率を約30 %から約90 %に引き上げる。

論文の概要: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

関連論文リスト