VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
- URL: http://arxiv.org/abs/2511.17199v1
- Date: Fri, 21 Nov 2025 12:26:30 GMT
- Title: VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
- Authors: Hanyu Zhou, Chuanhao Ma, Gim Hee Lee,
- Abstract summary: We develop a general VLA model with 4D awareness for temporally coherent robotic manipulation.<n>We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism.<n>Within this framework, the designed visual action jointly make robotic manipulation spatially-smooth and temporally temporally coherent.
- Score: 54.81449795163812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
Related papers
- Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation [58.21084913574353]
We introduce Pri4R, a simple approach that endows V models with an implicit understanding of world dynamics.<n>Pri4R augments VLA models with a lightweight point track head that predicts 3D point tracks.<n>We show that Pri4R significantly improves performance on challenging manipulation tasks.
arXiv Detail & Related papers (2026-03-02T07:23:53Z) - StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation [6.0744834626758495]
StemVLA is a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D representations into action prediction.<n>We show that StemVLA significantly improves long-horizon task success and state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
arXiv Detail & Related papers (2026-02-27T06:43:37Z) - MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation [27.70398018267795]
This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation.<n>Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation.
arXiv Detail & Related papers (2026-02-10T15:19:17Z) - GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation [47.471097712217386]
Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations.<n>In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF)<n>Experiments demonstrate significant improvements, with achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success
arXiv Detail & Related papers (2025-06-17T02:55:20Z) - LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding [55.81291976637705]
We propose a general LMM framework with atemporal prompt for visual representation 4D scene understanding.<n>The prompt is generated by encoding 3D position and 1D time into dynamic-aware 4D coordinate embedding.<n>Experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
arXiv Detail & Related papers (2025-05-18T06:18:57Z) - OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving.
OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Learning 3D Dynamic Scene Representations for Robot Manipulation [21.6131570689398]
3D scene representation for robot manipulation should capture three key object properties: permanency, completeness, and continuity.
We introduce 3D Dynamic Representation (DSR), a 3D scene representation that simultaneously discovers, tracks, reconstructs objects, and predicts their dynamics.
We propose DSR-Net, which learns to aggregate visual observations over multiple interactions to gradually build and refine DSR.
arXiv Detail & Related papers (2020-11-03T19:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.