DeepVerse: 4D Autoregressive Video Generation as a World Model
- URL: http://arxiv.org/abs/2506.01103v1
- Date: Sun, 01 Jun 2025 17:58:36 GMT
- Title: DeepVerse: 4D Autoregressive Video Generation as a World Model
- Authors: Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, Tong He,
- Abstract summary: We introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions on actions.<n> Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer-temporal relationships and underlying physical dynamics.<n>This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences.
- Score: 16.877309608945566
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.
Related papers
- StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion [29.682018018059043]
StarPose is an autoregressive diffusion framework for 3D human pose estimation.<n>It incorporates historical 3D pose predictions and spatial-temporal physical guidance.<n>It achieves superior accuracy and temporal consistency in 3D human pose estimation.
arXiv Detail & Related papers (2025-08-04T04:50:05Z) - Next-Generation Conflict Forecasting: Unleashing Predictive Patterns through Spatiotemporal Learning [0.0]
This study presents a neural network architecture for forecasting three distinct types of violence up to 36 months in advance.<n>The model jointly performs probabilistic classification and regression tasks, producing both estimates and expected magnitudes of future events.<n>It is a promising tool for warning systems, humanitarian response planning, and evidence-based peacebuilding initiatives.
arXiv Detail & Related papers (2025-06-08T20:42:29Z) - StateSpaceDiffuser: Bringing Long Context to Diffusion World Models [53.05314852577144]
We introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model.<n>This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models.<n>Experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline.
arXiv Detail & Related papers (2025-05-28T11:27:54Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - Geometry-aware Active Learning of Spatiotemporal Dynamic Systems [4.251030047034566]
This paper proposes a geometry-aware active learning framework for modeling dynamic systems.<n>We develop an adaptive active learning strategy to strategically identify spatial locations for data collection and further maximize the prediction accuracy.
arXiv Detail & Related papers (2025-04-26T19:56:38Z) - Conservation-informed Graph Learning for Spatiotemporal Dynamics Prediction [84.26340606752763]
In this paper, we introduce the conservation-informed GNN (CiGNN), an end-to-end explainable learning framework.<n>The network is designed to conform to the general symmetry conservation law via symmetry where conservative and non-conservative information passes over a multiscale space by a latent temporal marching strategy.<n>Results demonstrate that CiGNN exhibits remarkable baseline accuracy and generalizability, and is readily applicable to learning for prediction of varioustemporal dynamics.
arXiv Detail & Related papers (2024-12-30T13:55:59Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction [63.3021778885906]
3D bounding boxes are a widespread intermediate representation in many computer vision applications.
We propose methods for leveraging our autoregressive model to make high confidence predictions and meaningful uncertainty measures.
We release a simulated dataset, COB-3D, which highlights new types of ambiguity that arise in real-world robotics applications.
arXiv Detail & Related papers (2022-10-13T23:57:40Z) - A Spatio-temporal Transformer for 3D Human Motion Prediction [39.31212055504893]
We propose a Transformer-based architecture for the task of generative modelling of 3D human motion.
We empirically show that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-gressive models.
arXiv Detail & Related papers (2020-04-18T19:49:28Z) - A Spatial-Temporal Attentive Network with Spatial Continuity for
Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC)
First, spatial-temporal attention mechanism is presented to explore the most useful and important information.
Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.