TesserAct: Learning 4D Embodied World Models
- URL: http://arxiv.org/abs/2504.20995v1
- Date: Tue, 29 Apr 2025 17:59:30 GMT
- Title: TesserAct: Learning 4D Embodied World Models
- Authors: Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan,
- Abstract summary: We learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos.<n>This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent.
- Score: 66.8519958275311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
Related papers
- MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation [27.70398018267795]
This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation.<n>Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation.
arXiv Detail & Related papers (2026-02-10T15:19:17Z) - VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z) - 3D and 4D World Modeling: A Survey [104.20852751473392]
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit.<n>We introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches.<n>We discuss practical applications, identify open challenges, and highlight promising research directions.
arXiv Detail & Related papers (2025-09-04T17:59:58Z) - 4DNeX: Feed-Forward 4D Generative Modeling Made Easy [51.79072580042173]
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image.<n>In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation.
arXiv Detail & Related papers (2025-08-18T17:59:55Z) - Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction [72.54905331756076]
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes.
By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data.
arXiv Detail & Related papers (2025-04-10T17:59:55Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems.<n>We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features.<n>This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume.<n> Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z) - Neural 4D Evolution under Large Topological Changes from 2D Images [5.678824325812255]
In this work, we address the challenges in extending 3D neural evolution to 4D under large topological changes.
We introduce (i) a new architecture to discretize and encode the deformation and learn the SDF and (ii) a technique to impose the temporal consistency.
To facilitate learning directly from 2D images, we propose a learning framework that can disentangle the geometry and appearance from RGB images.
arXiv Detail & Related papers (2024-11-22T15:47:42Z) - 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [53.89348957053395]
We introduce a novel pipeline designed for text-to-4D scene generation.
Our method begins by generating a reference video using the video generation model.
We then learn the canonical 3D representation of the video using a freeze-time video.
arXiv Detail & Related papers (2024-06-11T17:19:26Z) - T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts.
Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.