Fugu-MT 論文翻訳(概要): Robots Need More than VLA and World Models

論文の概要: Robots Need More than VLA and World Models

arxiv url: http://arxiv.org/abs/2606.06556v1
Date: Thu, 04 Jun 2026 10:43:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.369802
Title: Robots Need More than VLA and World Models
Title（参考訳）: ロボットはVLAや世界モデル以上のものを必要としている
Authors: Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar,
Abstract要約: ジェネラリストロボットインテリジェンスは、しばしばポリシースケーリング問題として扱われる。本稿では、このフレーミングは不完全である、と論じる。次世代ロボティクスに欠落する4つのコンポーネントを特定します。
参考スコア（独自算出の注目度）: 38.16463528269755
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.
Abstract（参考訳）: 一般的なロボットインテリジェンスは、より多くのロボットデモを収集し、より大きなビジョン・ランゲージ・アクション(VLA)モデルを訓練し、より広範な一般化を期待する。この位置紙では、このフレーミングは不完全であると主張する。中心的なボトルネックは、政策学習だけでなく、世界の豊富な非構造的行動データを基盤となるロボットの監督に転換するメカニズムが欠如していることである。ヒューマンモーション、インターネットビデオ、シミュレーションロールアウト、インタラクティブなデモには、タスク、ゴール、連絡先、障害、物理的制約に関する豊富な情報が含まれているが、この情報のほとんどは、具体的アクションラベル、タスクセマンティクス、報酬構造が欠けているため、ロボットポリシーによって直接利用できない。次世代ロボットに欠落する4つのコンポーネントを識別する。非構造的動作の自動ラベリングのためのデータインターフェース、ロボット動作からロボット動作への人間の動きをターゲットする実施インターフェース、物理地上の3D推論のための世界モデルインタフェース、タスクの進行と成功をビデオや言語から推測するための報酬インターフェースである。本稿では,ロボット基礎モデル,クロスエボディメントデータセット,ビデオからの学習,世界モデル,報酬モデリングの最近の進歩を調査し,ロボットのデモンストレーションだけでなく,より広い物理世界からも学べるロボットシステムを構築するための研究課題を提案する。

論文の概要: Robots Need More than VLA and World Models

関連論文リスト