Fugu-MT 論文翻訳(概要): CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

論文の概要: CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

arxiv url: http://arxiv.org/abs/2601.04061v1
Date: Wed, 07 Jan 2026 16:26:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 02:15:23.687788
Title: CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos
Title（参考訳）: CLAP:人間の映像から視覚・言語・行動モデルを学ぶための比較潜在行動訓練
Authors: Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, Yansong Tang,
Abstract要約: 本稿では,映像から視覚的潜伏空間をロボット軌道から受容的潜伏空間に整列させるフレームワークであるContrastive Latent Action Pretraining (CLAP)を提案する。 CLAPは、ビデオの遷移を量子化され、物理的に実行可能なコードブックにマッピングする。本稿では,命令追従やオブジェクトの一般化に優れた自己回帰モデルであるCLAP-NTPと,高頻度かつ高精度な操作のために設計されたRectified FlowベースのポリシーであるCLAP-RFの両方を提供する二重形式VLAフレームワークを提案する。
参考スコア（独自算出の注目度）: 73.51386721543135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: https://lin-shan.com/CLAP/.
Abstract（参考訳）: 一般のVision-Language-Actionモデルは、人間のビデオデモの数に比べて、ロボットデータの不足に悩まされている。既存の遅延アクションモデルはビデオデータを活用しようとするが、しばしば視覚的な絡み合いに悩まされ、操作スキルよりもノイズを捉えている。そこで本研究では,映像から視覚的潜伏空間をロボット軌道から受容的潜伏空間に整列させるフレームワークであるContrastive Latent Action Pretraining (CLAP)を提案する。対照的な学習を利用することで、CLAPはビデオの遷移を量子化され、物理的に実行可能なコードブックにマッピングする。この表現に基づいて、命令追従やオブジェクトの一般化に優れた自己回帰モデルであるCLAP-NTPと、高周波で正確な操作を意図した整流フローベースのポリシーであるCLAP-RFの両方を提供する二重形式VLAフレームワークを導入する。さらに, 微調整時の破滅的忘れを緩和するための知識マッチング(KM)正則化戦略を提案する。大規模な実験により、CLAPは強いベースラインを著しく上回り、人間のビデオからロボット実行への効果的な技術移転を可能にした。プロジェクトページ: https://lin-shan.com/CLAP/。

論文の概要: CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

関連論文リスト