VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
- URL: http://arxiv.org/abs/2601.05138v1
- Date: Thu, 08 Jan 2026 17:28:52 GMT
- Title: VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
- Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu,
- Abstract summary: VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
- Score: 83.92729346325163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.
Related papers
- SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting [83.5106058182799]
We introduce SEE4D, a pose-free, trajectory-to-camera framework for 4D world modeling from casual videos.<n>A view-conditional video in model is trained to learn a robust geometry prior to denoising realistically synthesized images.<n>We validate See4D on cross-view video generation and sparse reconstruction benchmarks.
arXiv Detail & Related papers (2025-10-30T17:59:39Z) - Geometry-aware 4D Video Generation for Robot Manipulation [28.709339959536106]
We propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training.<n>This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints.<n>Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets.
arXiv Detail & Related papers (2025-07-01T18:01:41Z) - TesserAct: Learning 4D Embodied World Models [66.8519958275311]
We learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos.<n>This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent.
arXiv Detail & Related papers (2025-04-29T17:59:30Z) - Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization [31.956858341885436]
Video4DGen is a novel framework that excels in generating 4D representations from single or multiple generated videos.<n>Video4DGen offers a powerful tool for applications in virtual reality, animation, and beyond.
arXiv Detail & Related papers (2025-04-05T12:13:05Z) - Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [69.51086319339662]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.<n>Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.<n>Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z) - Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach [54.559847511280545]
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness.<n>To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space.<n>The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model.
arXiv Detail & Related papers (2025-02-05T21:49:06Z) - Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer [38.85054820740242]
We present a novel approach for generating high-quality, coherent human videos from a single image.
Our framework combines the strengths of diffusion transformers for capturing global correlations and CNNs for accurate condition injection.
We demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos.
arXiv Detail & Related papers (2024-05-27T17:53:29Z) - Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video [42.10482273572879]
We propose an efficient video-to-4D object generation framework called Efficient4D.<n>It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data.<n>Experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed.
arXiv Detail & Related papers (2024-01-16T18:58:36Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.