Fugu-MT 論文翻訳(概要): LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

論文の概要: LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

arxiv url: http://arxiv.org/abs/2606.11628v1
Date: Wed, 10 Jun 2026 03:49:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.276416
Title: LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition
Title（参考訳）: LUCID:スケーラブルなデクスタースロボットスキル獲得のための非構造ビデオからの身体非依存インテントモデルの学習
Authors: Harsh Gupta, Guanya Shi, Wenzhen Yuan,
Abstract要約: LUCIDは、構造化されていない人間のビデオからタスク意図を学ぶフレームワークである。大規模な並列シミュレーションでロボットの制御を学習する。実世界の5つの操作課題におけるLUCIDの評価を行った。
参考スコア（独自算出の注目度）: 11.86733592383987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: https://lucid-robot.github.io/.
Abstract（参考訳）: 現在最も広く採用されているロボット学習パイプラインは、ロボットのデモや構造化された人間のデータからスキルを学ぶ。対照的に、構造化されていない人間のビデオはスケーラブルな代替手段を提供する。それらは、オブジェクト、シーン、戦略にまたがる多様な操作デモを含むが、ロボットのアクションに直接関連しない。 LUCIDは、インターネットスケールのデータセットから抽出された非構造化人間ビデオからタスク意図を学習し、大規模並列シミュレーションでロボット制御を学習する2段階のフレームワークである。インテントモデルは、クローズドループにおける現在の観測から、ショートホライゾンインテント(シーンの次に何が起こるか)を予測する。エンボディメント固有の感覚運動器ポリシーは、この意図をロボットアクションに変換する。インテントインタフェースはコントローラ間で共有されるので、同じインテントモデルが、私たちの最初の手からパラレルジャウグリップパーまで、異なる実施形態に適用できる。実世界の5つの操作課題についてLUCIDの評価を行った。インターネットビデオのみによって監視され、新規シーンやオブジェクトインスタンスにゼロショットで転送され、セルフコンパイルされたスマートフォンビデオの1時間毎に1Tとケーブルルーティングが監視される。プロジェクトページ: https://lucid-robot.github.io/.com

論文の概要: LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

関連論文リスト