Fugu-MT 論文翻訳(概要): Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

論文の概要: Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

arxiv url: http://arxiv.org/abs/2602.19184v1
Date: Sun, 22 Feb 2026 13:26:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.632353
Title: Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation
Title（参考訳）: 人間とロボットのインタラクション:ロボット模倣のためのビデオデモから学ぶ
Authors: Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Nak Young Chong, Xiem HoangVan,
Abstract要約: 人間とロボットの模倣学習パイプラインは、ロボットが非構造化ビデオデモから直接操作スキルを取得することを可能にする。鍵となる革新は、学習プロセスを2つの異なる段階に分離するモジュラーフレームワークである。ロボット操作では,全ての動作の平均成功率は87.5%であり,タスク達成で100%,複雑なピック・アンド・プレイス操作で90%に達する。
参考スコア（独自算出の注目度）: 5.967530183571141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at https://thanhnguyencanh.github.io/LfD4hri.
Abstract（参考訳）: Demonstration(LfD)からの学習は、ロボットのスキル獲得に有望なパラダイムを提供する。最近のアプローチでは,映像のデモンストレーションから直接操作コマンドを抽出する試みがあるが,(1)タスク関連オブジェクトよりもグローバルシーンの特徴を優先する一般的なビデオキャプションモデル,(2)ロボットの正確な実行に適さない記述を生成するアーキテクチャ,(2)ポリシー学習と視覚的理解を結合するエンド・ツー・エンドアーキテクチャの2つの重要な課題に直面している。これらの制約に対処するために,ロボットが非構造化ビデオデモから直接操作スキルを習得することを可能にする,新しい「Human-to-Robot」模倣学習パイプラインを提案する。重要なイノベーションは,学習プロセスを2つの異なる段階に分離するモジュール・フレームワークである。(1) 時間シフトモジュール(TSM)と視覚言語モデル(VLM)を組み合わせて動作を抽出し,相互作用対象を識別するビデオ理解,(2) TD3をベースとした深層強化学習を用いたロボット模倣。我々は, UR5eマニピュレータを用いたPyBulletシミュレーション環境と, UF850マニピュレータを用いた実世界実験において, リーチ, ピック, 移動, 配置の4つの基本動作に関するアプローチを検証した。ビデオ理解では,標準対象では89.97%,標準対象では0.351,新規対象では0.265,それぞれ76.4%,最良基準では128.4%であった。ロボット操作では,全ての動作の平均成功率は87.5%であり,タスク達成に100%,複雑なピック・アンド・プレイス操作に90%が成功している。プロジェクトのWebサイトはhttps://thanhnguyencanh.github.io/LfD4hriで公開されている。

論文の概要: Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

関連論文リスト