Pure Vision Language Action (VLA) Models: A Comprehensive Survey
- URL: http://arxiv.org/abs/2509.19012v2
- Date: Thu, 25 Sep 2025 13:59:28 GMT
- Title: Pure Vision Language Action (VLA) Models: A Comprehensive Survey
- Authors: Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, Qingguo Zhou,
- Abstract summary: The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics.<n>This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research.
- Score: 16.014856048038272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for manipulation and decision-making in complex, dynamic environments. This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research. It presents a comprehensive analysis of VLA applications across different scenarios and classifies VLA approaches into several paradigms: autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods; while examining their motivations, core strategies, and implementations in detail. In addition, foundational datasets, benchmarks, and simulation platforms are introduced. Building on the current VLA landscape, the review further proposes perspectives on key challenges and future directions to advance research in VLA models and generalizable robotics. By synthesizing insights from over three hundred recent studies, this survey maps the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose VLA methods.
Related papers
- Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - Survey of Vision-Language-Action Models for Embodied Manipulation [12.586030711502858]
Embodied intelligence systems enhance agent capabilities through continuous environment interactions.<n>Vision-Language-Action models, inspired by advancements in large foundation models, serve as universal robotic control frameworks.<n>This survey comprehensively reviews VLA models for embodied manipulation.
arXiv Detail & Related papers (2025-08-21T03:30:04Z) - Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey [45.10095869091538]
Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm.<n>This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation.
arXiv Detail & Related papers (2025-08-18T16:45:48Z) - Vision Language Action Models in Robotic Manipulation: A Systematic Review [1.1767330101986737]
Vision Language Action (VLA) models represent a transformative shift in robotics.<n>This review presents a comprehensive and forward-looking synthesis of the VLA paradigm.<n>We analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms.
arXiv Detail & Related papers (2025-07-14T18:00:34Z) - Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends [11.678954304546988]
Vision-language-action (VLA) models extend vision-language models (VLM)<n>This paper reviews post-training strategies for VLA models through the lens of human motor learning.
arXiv Detail & Related papers (2025-06-26T03:06:57Z) - Unified Vision-Language-Action Model [86.68814779303429]
We present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences.<n>Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge.<n>We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
arXiv Detail & Related papers (2025-06-24T17:59:57Z) - Generalizing vision-language models to novel domains: A comprehensive survey [55.97518817219619]
Vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities.<n>This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures.
arXiv Detail & Related papers (2025-06-23T10:56:37Z) - Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [49.073964142139495]
We systematically review the applications and advancements of multimodal fusion methods and vision-language models.<n>For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks.<n>We identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation [7.8735930411335895]
We present VLATest, a fuzzing framework designed to generate robotic manipulation scenes for testing VLA models.<n>Based on VLATest, we conducted an empirical study to assess the performance of seven representative VLA models.
arXiv Detail & Related papers (2024-09-19T16:33:00Z) - A Survey on Vision-Language-Action Models for Embodied AI [90.99896086619854]
Embodied AI is widely recognized as a key element of artificial general intelligence.<n>A new category of multimodal models has emerged to address language-conditioned robotic tasks in embodied AI.<n>We present the first survey on vision-language-action models for embodied AI.
arXiv Detail & Related papers (2024-05-23T01:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.