Related papers: DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

URL: http://arxiv.org/abs/2507.04447v3
Date: Tue, 26 Aug 2025 08:23:50 GMT
Title: DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin,
Abstract summary: We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
Score: 41.030494146004806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Related papers

Chain of World: World Model Thinking in Latent Motion [24.24061036481793]
Vision-Language-Action (VLA) models often overlook the predictive and temporal-causal structure underlying visual dynamics.<n>We introduce CoWVLA, a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation.<n>CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency.
arXiv Detail & Related papers (2026-03-03T17:52:06Z)
Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z)
Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception [8.542874528320004]
Existing vision models and fixed RGB-D camera systems fail to reconcile wide-area coverage with fine-grained detail acquisition.<n>We propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions.
arXiv Detail & Related papers (2025-11-19T09:42:08Z)
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z)
ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver [35.25196177784228]
We propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm.<n>Conditioned on the model's visual outputs, a diffusion transformer reconstructs the gaze region of the image.<n>This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention.
arXiv Detail & Related papers (2025-08-14T04:20:19Z)
WorldVLA: Towards Autoregressive Action World Model [43.74612972649639]
We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation.<n>WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework.
arXiv Detail & Related papers (2025-06-26T17:55:40Z)
LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [51.834607121538724]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z)
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z)
Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [28.825612240280822]
We propose a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control.<n>Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions.<n>We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation.
arXiv Detail & Related papers (2025-02-20T18:17:11Z)
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [45.03115608632622]
spatial understanding is the keypoint in robot manipulation.<n>We propose SpatialVLA to explore effective spatial representations for the robot foundation model.<n>We show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups.
arXiv Detail & Related papers (2025-01-27T07:34:33Z)
OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [12.004183122121042]
OccLLaMA is an occupancy-language-action generative world model. We build a unified multi-modal vocabulary for vision, language and action. OccLLaMA achieves competitive performance across multiple tasks.
arXiv Detail & Related papers (2024-09-05T06:30:01Z)
3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z)
DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation [107.5934592892763]
We propose DREAMWALKER -- a world model based VLN-CE agent. The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment. It can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions.
arXiv Detail & Related papers (2023-08-14T23:45:01Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.