Fugu-MT 論文翻訳(概要): 3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

論文の概要: 3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

arxiv url: http://arxiv.org/abs/2603.08485v1
Date: Mon, 09 Mar 2026 15:20:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.107989
Title: 3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos
Title（参考訳）: 3PoinTr:カジュアルビデオによるロボット操作のための3Dポイントトラック
Authors: Adam Hung, Bardienus Pieter Duisterhof, Jeffrey Ichnowski,
Abstract要約: 3PoinTrは、カジュアルで制約のない人間のビデオからロボットポリシーを事前訓練する手法だ。 3PoinTrは、トランスフォーマーアーキテクチャを使用して、3Dポイントトラックを中間的なエンボディメントに依存しない表現として予測する。 3PoinTrは軽量で表現力のあるアーキテクチャのため、より正確で高品質なポイントトラックを生成する。
参考スコア（独自算出の注目度）: 8.359830715928242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.
Abstract（参考訳）: 堅牢なロボットポリシーに関するデータ効率のトレーニングは、さまざまな新しいタスクにおいて自動化をアンロックする鍵となる。現在のシステムでは、ロバスト性を達成するために大量のデモが必要ですが、多くのアプリケーションでは現実的ではありません。人間のビデオから直接の学習ポリシーは、遠隔操作のコストを削減できる有望な代替手段だが、その課題はエンボディメントギャップ(体操とロボットと人間の戦略の違い)を克服することにある。本研究では,人間に自然な動きから学習できる3PoinTrを提案する。 3PoinTrは、トランスフォーマーアーキテクチャを使用して、3Dポイントトラックを中間的なエンボディメントに依存しない表現として予測する。 3Dポイントは、ゴール仕様、シーン幾何学、時空間関係をエンコードする。我々はPerceiver IOアーキテクチャを用いて、下流のエンボディメント固有の制約に違反した場合でも、サンプル効率の良い行動クローニングのためのコンパクトな表現を抽出する。我々はシミュレーションと実世界のタスクについて徹底的に評価を行い、3PoinTrは動作ラベル付きロボットのデモを20回しか行わない多種多様な操作タスクの空間的一般化を実現する。 3PoinTrは、人間のビデオから事前学習する以前の方法と同様に、行動クローニング方法を含むベースラインよりも優れています。また、3PoinTrの3次元トラック予測を既存のトラック予測ベースラインと比較して評価する。 3PoinTrは1つの変圧器上に構築された軽量で表現力に富んだアーキテクチャにより,より正確で高品質な点線を生成できる。プロジェクトページ:https://adamhung60.github.io/3PoinTr/。

論文の概要: 3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

関連論文リスト