Vision Language Action Models in Robotic Manipulation: A Systematic Review
- URL: http://arxiv.org/abs/2507.10672v1
- Date: Mon, 14 Jul 2025 18:00:34 GMT
- Title: Vision Language Action Models in Robotic Manipulation: A Systematic Review
- Authors: Muhayy Ud Din, Waseem Akram, Lyes Saad Saoud, Jan Rosell, Irfan Hussain,
- Abstract summary: Vision Language Action (VLA) models represent a transformative shift in robotics.<n>This review presents a comprehensive and forward-looking synthesis of the VLA paradigm.<n>We analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms.
- Score: 1.1767330101986737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Language Action (VLA) models represent a transformative shift in robotics, with the aim of unifying visual perception, natural language understanding, and embodied control within a single learning framework. This review presents a comprehensive and forward-looking synthesis of the VLA paradigm, with a particular emphasis on robotic manipulation and instruction-driven autonomy. We comprehensively analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms that collectively shape the development and evaluation of VLAs models. These models are categorized into key architectural paradigms, each reflecting distinct strategies for integrating vision, language, and control in robotic systems. Foundational datasets are evaluated using a novel criterion based on task complexity, variety of modalities, and dataset scale, allowing a comparative analysis of their suitability for generalist policy learning. We introduce a two-dimensional characterization framework that organizes these datasets based on semantic richness and multimodal alignment, showing underexplored regions in the current data landscape. Simulation environments are evaluated for their effectiveness in generating large-scale data, as well as their ability to facilitate transfer from simulation to real-world settings and the variety of supported tasks. Using both academic and industrial contributions, we recognize ongoing challenges and outline strategic directions such as scalable pretraining protocols, modular architectural design, and robust multimodal alignment strategies. This review serves as both a technical reference and a conceptual roadmap for advancing embodiment and robotic control, providing insights that span from dataset generation to real world deployment of generalist robotic agents.
Related papers
- VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models [9.376810354990079]
We introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms.<n>We design a comprehensive suite of experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands)
arXiv Detail & Related papers (2025-06-21T03:07:48Z) - World Models for Cognitive Agents: Transforming Edge Intelligence in Future Networks [55.90051810762702]
We present a comprehensive overview of world models, highlighting their architecture, training paradigms, and applications across prediction, generation, planning, and causal reasoning.<n>We propose Wireless Dreamer, a novel world model-based reinforcement learning framework tailored for wireless edge intelligence optimization.
arXiv Detail & Related papers (2025-05-31T06:43:00Z) - Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives [36.297745473653166]
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language.<n>Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress.
arXiv Detail & Related papers (2025-05-20T13:47:40Z) - Vision-Language-Action Models: Concepts, Progress, Applications and Challenges [4.180065442680541]
Vision-Language-Action models aim to unify perception, natural language understanding, and embodied action within a single computational framework.<n>This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models.<n>Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations.
arXiv Detail & Related papers (2025-05-07T19:46:43Z) - A Survey of World Models for Autonomous Driving [63.33363128964687]
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling.<n>World models offer high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics.<n>This paper systematically reviews recent advances in world models for autonomous driving.
arXiv Detail & Related papers (2025-01-20T04:00:02Z) - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks [20.93006455952299]
Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems.<n>We present a comprehensive evaluation framework and benchmark suite for assessing VLA models.
arXiv Detail & Related papers (2024-11-04T18:01:34Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [57.40024206484446]
We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models.
BVS supports a large number of adjustable parameters at the scene level.
We showcase three example application scenarios.
arXiv Detail & Related papers (2024-05-15T17:57:56Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.