Fugu-MT 論文翻訳(概要): Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

論文の概要: Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

arxiv url: http://arxiv.org/abs/2602.22663v1
Date: Thu, 26 Feb 2026 06:27:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.72668
Title: Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline
Title（参考訳）: 視覚言語行動モデルの実践性を再考する:包括的ベンチマークと改良されたベースライン
Authors: Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, Haoang Li,
Abstract要約: VLA(Vision-Language-Action)モデルは、汎用的なロボットエージェントとして登場した。既存のVLAは、過剰なパラメータスケール、禁制的な事前訓練要件、多様な実施法の適用性に障害がある。本稿では,領域ランダム化を考慮したCEBenchを提案する。
参考スコア（独自算出の注目度）: 38.41143967396976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、汎用的なロボットエージェントとして登場した。しかし、既存のVLAは、過剰なパラメータスケール、禁制的な事前訓練要件、多様な実施法の適用性に妨げられている。 VLAの実用性を改善するため,包括的なベンチマークと改良されたベースラインを提案する。まず,領域ランダム化を考慮したCEBenchを提案する。 CEBenchでのトレーニングを支援するため,14.4kの模擬軌道と1.6kの現実世界の専門家による軌道を収集した。第2に,テストベッドとしてCEBenchを用いて,VLAの実用性に関する3つの重要な側面について検討し,いくつかの重要な知見を得た。 LLaVA-VLAは、コンシューマグレードのGPUに実際にデプロイするために設計された軽量でパワフルなVLAである。アーキテクチャ上は、コンパクトなVLMバックボーンと、多視点認識、固有トークン化、アクションチャンキングを統合している。 LLaVA-VLAは、コストのかかる事前トレーニングへの依存を避けるため、後トレーニングと微調整を含む2段階のトレーニングパラダイムを採用している。さらに、LLaVA-VLAは、ナビゲーションと操作を統合するためにアクション空間を拡張している。実世界のモバイル操作実験では,モバイル操作のための最初のエンドツーエンドVLAモデルとして確立されている。すべてのデータセット、コード、チェックポイントをオープンソースにして、再現性と将来の研究を促進するつもりです。

論文の概要: Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

関連論文リスト