Physical Informed Driving World Model
- URL: http://arxiv.org/abs/2412.08410v2
- Date: Fri, 13 Dec 2024 02:05:34 GMT
- Title: Physical Informed Driving World Model
- Authors: Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu,
- Abstract summary: DrivePhysica is an innovative model designed to generate realistic driving videos that adhere to essential physical principles.
We achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.
- Score: 47.04423342994622
- License:
- Abstract: Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: https://metadrivescape.github.io/papers_project/DrivePhysica/page.html
Related papers
- InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.
Our key insight is that large video foundation models can act as both neurals and implicit physics simulators by learning interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model [83.31688383891871]
We propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes.
Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles.
It decouples spatial-temporal relationships and produces coherent driving videos.
arXiv Detail & Related papers (2024-12-06T18:59:56Z) - Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.
CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.
CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis.
To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space.
Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z) - Vision-based Discovery of Nonlinear Dynamics for 3D Moving Target [11.102585080028945]
We propose a vision-based approach to automatically uncover governing equations of nonlinear dynamics for 3D moving targets via raw videos recorded by a set of cameras.
This framework is capable of effectively handling the challenges associated with measurement data, e.g., noise in the video, imprecise tracking of the target that causes data missing, etc.
arXiv Detail & Related papers (2024-04-27T11:13:55Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Realistic Full-Body Tracking from Sparse Observations via Joint-Level
Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only.
Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage.
With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.