Fugu-MT 論文翻訳(概要): An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

論文の概要: An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

arxiv url: http://arxiv.org/abs/2512.11362v3
Date: Fri, 19 Dec 2025 09:38:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.298725
Title: An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges
Title（参考訳）: ビジョン・ランゲージ・アクションモデルの解剖:モジュールからマイルストーン・チャレンジへ
Authors: Chao Xu, Suyu Zhang, Yang Liu, Baigui Sun, Weihong Chen, Bo Xu, Qi Liu, Juncheng Wang, Shujun Wang, Shan Luo, Jan Peters, Athanasios V. Vasilakos, Stefanos Zafeiriou, Jiankang Deng,
Abstract要約: VLA(Vision-Language-Action)モデルは、ロボット工学の革命を駆動し、機械が指示を理解し、物理的な世界と対話することを可能にする。この調査は、VLAのランドスケープを明確かつ構造化したガイドを提供する。
参考スコア（独自算出の注目度）: 87.35344276973537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our \href{https://suyuz1.github.io/VLA-Survey-Anatomy/}{project page}.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボット工学の革命を駆動し、機械が指示を理解し、物理的な世界と対話することを可能にする。この分野は、新しいモデルとデータセットで爆発しているため、エキサイティングで、ペースを保ち続けることが難しい。この調査は、VLAのランドスケープを明確かつ構造化したガイドを提供する。私たちは、VLAモデルの基本的なモジュールから始めて、主要なマイルストーンを通して歴史をトレースし、最近の研究フロンティアを定義するコアチャレンジに深く掘り下げます。主な貢献は,(1)表現,(2)実行,(3)一般化,(4)安全性,(5)データセットと評価の5つの大きな課題の詳細な概要である。この構造は、基本的な知覚-行動ループを確立すること、多様な実施形態と環境をまたいだスケーリング能力を確立すること、そして最終的には、不可欠なデータインフラストラクチャによってサポートされた信頼性の高いデプロイメントを確実にすること、という、ジェネラリストエージェントの開発ロードマップを反映します。それぞれについて、既存のアプローチをレビューし、今後の機会を強調します。我々は,本論文を,新参者のための基礎的ガイドと経験者研究者のための戦略的ロードマップの両方として位置づけ,学習の促進と,具体的インテリジェンスにおける新たなアイデアの創出を両立させることを目的としている。このサーベイのライブバージョンは、継続的に更新され、我々の \href{https://suyuz1.github.io/VLA-Survey-Anatomy/}{project page} で維持されます。

論文の概要: An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

関連論文リスト