Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
- URL: http://arxiv.org/abs/2603.01549v1
- Date: Mon, 02 Mar 2026 07:23:53 GMT
- Title: Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
- Authors: Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim,
- Abstract summary: We introduce Pri4R, a simple approach that endows V models with an implicit understanding of world dynamics.<n>Pri4R augments VLA models with a lightweight point track head that predicts 3D point tracks.<n>We show that Pri4R significantly improves performance on challenging manipulation tasks.
- Score: 58.21084913574353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
Related papers
- Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models [79.18306680174011]
DSR Suite bridges gap across aspects of dataset, benchmark and model.<n>We propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR.<n>The pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories.
arXiv Detail & Related papers (2025-12-23T17:56:36Z) - Robotic VLA Benefits from Joint Learning with Motion Image Diffusion [114.60268819583017]
Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions.<n>We propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities.<n> Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5%.
arXiv Detail & Related papers (2025-12-19T19:07:53Z) - VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation [54.81449795163812]
We develop a general VLA model with 4D awareness for temporally coherent robotic manipulation.<n>We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism.<n>Within this framework, the designed visual action jointly make robotic manipulation spatially-smooth and temporally temporally coherent.
arXiv Detail & Related papers (2025-11-21T12:26:30Z) - ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver [35.25196177784228]
We propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm.<n>Conditioned on the model's visual outputs, a diffusion transformer reconstructs the gaze region of the image.<n>This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention.
arXiv Detail & Related papers (2025-08-14T04:20:19Z) - CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning [7.780242426487376]
We propose Chunked RL, a novel reinforcement learning framework for Vision-Language-Action (VLA) models.<n>Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models.<n>We then propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations.
arXiv Detail & Related papers (2025-08-04T09:11:48Z) - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [41.030494146004806]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [11.222744122842023]
We introduce a plug-and-play module that implicitly incorporates 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [69.51086319339662]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.<n>Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.<n>Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.