Fugu-MT 論文翻訳(概要): AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

論文の概要: AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

arxiv url: http://arxiv.org/abs/2508.07626v1
Date: Mon, 11 Aug 2025 05:09:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.950717
Title: AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning
Title（参考訳）: AR-VRM:解析推論による視覚ロボット操作のための人間の動作の省略
Authors: Dejie Yang, Zijing Zhao, Yang Liu,
Abstract要約: 視覚ロボットマニピュレーション(VRM)は、ロボットの状態と視覚的観察に基づいて、ロボットが自然言語の指示に従うことを可能にすることを目的としている。既存のアプローチでは、大規模データを用いた視覚言語事前学習が採用されている。我々は,大規模人間のアクションビデオデータセットから明示的な方法で学習することを提案する。
参考スコア（独自算出の注目度）: 5.371855090716962
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark {and real-world experiments}. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity.
Abstract（参考訳）: 視覚ロボットマニピュレーション(VRM)は、ロボットの状態と視覚的観察に基づいて、ロボットが自然言語の指示に従うことを可能にすることを目的としており、そのために高価なマルチモーダルデータを必要とする。ロボットデータの不足を補うために,既存のアプローチでは,大規模データを用いた視覚言語事前学習を採用している。しかし、ロボットと異なるWebデータを利用するか、あるいは暗黙的にモデルを訓練する(例えば、将来のフレームをピクセルレベルで予測する)。本稿では,大規模なヒューマンアクションビデオデータセット(例えば手動キーポイントからの人間のアクションを模倣する)から学習し,解析的推論を用いた視覚ロボットマニピュレーション(AR-VRM)を導入することを提案する。本稿では,人間の行動映像から行動知識を明示的に取得するために,人間の行動知識を学習し,手指のキーポイントを直接予測するキーポイント・ランゲージ・モデル(VLM)事前学習手法を提案する。ロボットデータの微調整において、人間の動作の動作パターンを模倣するロボットアームを容易にするため、我々はまず、同様の操作タスクを実行し、同様の歴史的な観察を行うヒューマンアクションビデオを取得し、次に、人間の手指キーポイントとロボットコンポーネント間のアナロジ的推論(AR)マップを学習する。本手法は,無関係な視覚的手がかりではなく,アクションキーポイントに焦点を合わせることで,CALVINベンチマーク(および実世界実験)における先行的な性能を実現する。少数のシナリオでは、我々のAR-VRMは従来の手法よりも大きなマージンで優れており、データ不足下での人間の行動を明示的に模倣する効果が強調されている。

論文の概要: AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

関連論文リスト