Fugu-MT 論文翻訳(概要): ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

論文の概要: ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2602.00557v1
Date: Sat, 31 Jan 2026 06:40:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 05:17:44.701415
Title: ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation
Title（参考訳）: ConLA:ロボットマニピュレーションのための人間ビデオからの対照的な潜在行動学習
Authors: Weisheng Dai, Kai Lan, Jianyi Zhou, Bo Zhao, Xiu Su, Junwen Tong, Weili Guan, Shuo Yang,
Abstract要約: 人間のビデオからロボットポリシーを学習するための教師なし事前学習フレームワークであるConLAを提案する。人間のビデオのみに事前学習を行うことで、実際のロボット軌道事前学習で得られた性能を初めて上回ります。
参考スコア（独自算出の注目度）: 27.54751123419347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack of explicit action supervision hinders direct utilization. Prior work leverages VQ-VAE based frameworks to learn latent actions from human videos in an unsupervised manner. Nevertheless, since the training objective primarily focuses on reconstructing visual appearances rather than capturing inter-frame dynamics, the learned representations tend to rely on spurious visual cues, leading to shortcut learning and entangled latent representations that hinder transferability. To address this, we propose ConLA, an unsupervised pretraining framework for learning robotic policies from human videos. ConLA introduces a contrastive disentanglement mechanism that leverages action category priors and temporal cues to isolate motion dynamics from visual content, effectively mitigating shortcut learning. Extensive experiments show that ConLA achieves strong performance across diverse benchmarks. Notably, by pretraining solely on human videos, our method for the first time surpasses the performance obtained with real robot trajectory pretraining, highlighting its ability to extract pure and semantically consistent latent action representations for scalable robot learning.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは,大規模ロボット遠隔操作データセットの事前学習を通じて,予備的な一般化を実現する。しかし、多様なタスクや環境を包括的にカバーするデータセットを取得することは、非常にコストがかかり、スケールが難しい。対照的に、人間のデモビデオは多様なシーンや操作行動のリッチでスケーラブルなソースを提供するが、明示的な行動監督の欠如は直接的な利用を妨げる。以前の作業では、VQ-VAEベースのフレームワークを使用して、人間のビデオから教師なしの方法で潜伏アクションを学ぶ。それにもかかわらず、トレーニングの目的は、フレーム間のダイナミクスをキャプチャするよりも、視覚的な外観の再構築に重点を置いているため、学習された表現は、急激な視覚的手がかりに依存する傾向があり、短絡学習や、伝達可能性を妨げる潜伏表現に繋がる。これを解決するために,人間ビデオからロボットポリシーを学習するための教師なし事前学習フレームワークであるConLAを提案する。 ConLAは、アクションカテゴリの先行と時間的手がかりを活用して、視覚的コンテンツからモーションダイナミクスを分離し、ショートカット学習を効果的に緩和する、コントラスト的なアンタングルメントメカニズムを導入している。大規模な実験により、さまざまなベンチマークで、ConLAは強力なパフォーマンスを実現している。特に,人間のビデオのみを事前学習することにより,実際のロボットの軌道を事前学習することで得られた性能を初めて上回り,スケーラブルなロボット学習のための純粋で意味的に一貫性のある潜在動作表現を抽出する能力を強調した。

論文の概要: ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

関連論文リスト