Vision-Language-Action Models: Concepts, Progress, Applications and Challenges
- URL: http://arxiv.org/abs/2505.04769v1
- Date: Wed, 07 May 2025 19:46:43 GMT
- Title: Vision-Language-Action Models: Concepts, Progress, Applications and Challenges
- Authors: Ranjan Sapkota, Yang Cao, Konstantinos I. Roumeliotis, Manoj Karkee,
- Abstract summary: Vision-Language-Action models aim to unify perception, natural language understanding, and embodied action within a single computational framework.<n>This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models.<n>Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations.
- Score: 4.180065442680541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. >Vision-language-action, Agentic AI, AI Agents, Vision-language Models
Related papers
- Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction [0.4786416643636131]
Foundation models, including large language models (LLMs) and vision-language models (VLMs) have recently enabled novel approaches to robot autonomy and human-robot interfaces.<n>In parallel, vision-language-action models (VLAs) or large behavior models (BLMs) are increasing the dexterity and capabilities of robotic systems.
arXiv Detail & Related papers (2025-08-07T11:48:03Z) - Vision Language Action Models in Robotic Manipulation: A Systematic Review [1.1767330101986737]
Vision Language Action (VLA) models represent a transformative shift in robotics.<n>This review presents a comprehensive and forward-looking synthesis of the VLA paradigm.<n>We analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms.
arXiv Detail & Related papers (2025-07-14T18:00:34Z) - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [56.3802428957899]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - World Models for Cognitive Agents: Transforming Edge Intelligence in Future Networks [55.90051810762702]
We present a comprehensive overview of world models, highlighting their architecture, training paradigms, and applications across prediction, generation, planning, and causal reasoning.<n>We propose Wireless Dreamer, a novel world model-based reinforcement learning framework tailored for wireless edge intelligence optimization.
arXiv Detail & Related papers (2025-05-31T06:43:00Z) - Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review [4.540236408836132]
We present the first systematic review of the integration of foundation models in mobile service robotics.<n>We explore the role of such models in enabling real-time sensor fusion, language-conditioned control, and adaptive task execution.<n>We also discuss real-world applications in the domestic assistance, healthcare, and service automation sectors.
arXiv Detail & Related papers (2025-05-26T20:08:09Z) - A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Embodied AI is widely recognized as a key element of artificial general intelligence.<n>A new category of multimodal models has emerged to address language-conditioned robotic tasks in embodied AI.<n>We present the first survey on vision-language-action models for embodied AI.
arXiv Detail & Related papers (2024-05-23T01:43:54Z) - Position Paper: Agent AI Towards a Holistic Intelligence [53.35971598180146]
We emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions.
In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model.
arXiv Detail & Related papers (2024-02-28T16:09:56Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - A call for embodied AI [1.7544885995294304]
We propose Embodied AI as the next fundamental step in the pursuit of Artificial General Intelligence.
By broadening the scope of Embodied AI, we introduce a theoretical framework based on cognitive architectures.
This framework is aligned with Friston's active inference principle, offering a comprehensive approach to EAI development.
arXiv Detail & Related papers (2024-02-06T09:11:20Z) - A Survey on Robotics with Foundation Models: toward Embodied AI [30.999414445286757]
Recent advances in computer vision, natural language processing, and multi-modality learning have shown that the foundation models have superhuman capabilities for specific tasks.
This survey aims to provide a comprehensive and up-to-date overview of foundation models in robotics, focusing on autonomous manipulation and encompassing high-level planning and low-level control.
arXiv Detail & Related papers (2024-02-04T07:55:01Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.