Fugu-MT 論文翻訳(概要): Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

論文の概要: Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

arxiv url: http://arxiv.org/abs/2512.13080v1
Date: Mon, 15 Dec 2025 08:31:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.344558
Title: Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Title（参考訳）: 映像からの視覚的アライメントによる空間認識型VLAの事前学習
Authors: Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu,
Abstract要約: VLA(Vision-Language-Action)モデルでは、視覚認識と言語指導による政策学習を統合している。現在、既存のほとんどのアプローチは3D物理環境でアクションを実行するために2Dビジュアルインプットに依存している。本稿では,空間認識型VLA事前学習パラダイムを提案する。 3Dビジュアルエンコーダを組み込んだ2次元エンコーダアーキテクチャであるVIPA-VLAにより、このパラダイムをインスタンス化し、セマンティックビジュアル表現を3D認識機能で拡張する。
参考スコア（独自算出の注目度）: 39.05067965462225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは、視覚認識と言語誘導ポリシー学習を統合することで、ロボット学習に有望なパラダイムを提供する。しかし、既存のほとんどのアプローチは3次元の物理的環境での行動を実行するために2次元の視覚的入力に依存しており、知覚と行動基盤の間に大きなギャップが生じる。このギャップを埋めるために,ロボットポリシー学習の前に3次元空間理解を得ることが可能な空間認識型VLA事前学習パラダイムを提案する。事前学習された視覚言語モデルから、3次元の視覚的および3次元のアクションアノテーションを抽出し、2次元の視覚的観察と3次元の空間的推論を整合させる新しい監督源を形成する。 3Dビジュアルエンコーダを組み込んだ2次元エンコーダアーキテクチャであるVIPA-VLAにより、このパラダイムをインスタンス化し、セマンティックビジュアル表現を3D認識機能で拡張する。下流ロボットのタスクに適応すると、VIPA-VLAは2Dビジョンと3Dアクションの間のグラウンド化を大幅に改善し、より堅牢で一般化可能なロボットポリシーを実現する。

論文の概要: Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

関連論文リスト