InfinityDrive: Breaking Time Limits in Driving World Models
- URL: http://arxiv.org/abs/2412.01522v2
- Date: Wed, 04 Dec 2024 02:09:07 GMT
- Title: InfinityDrive: Breaking Time Limits in Driving World Models
- Authors: Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weixuan Tang, Wei Wu,
- Abstract summary: We introduce InfinityDrive, the first driving world model with exceptional generalization capabilities.
It delivers state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation.
Tests in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios.
- Score: 12.041484892881057
- License:
- Abstract: Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576$\times$1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (more than 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: https://metadrivescape.github.io/papers_project/InfinityDrive/page.html
Related papers
- DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT [33.943125216555316]
We present DrivingWorld, a GPT-style world model for autonomous driving.
We propose a next-state prediction strategy to model temporal coherence between consecutive frames.
We also propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues.
arXiv Detail & Related papers (2024-12-27T07:44:07Z) - Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.
CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.
CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z) - DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving [38.867860153968394]
Diffusion model has emerged as a powerful generative technique for robotic policy learning.
We propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule.
The proposed model, DiffusionDrive, demonstrates 10$times$ reduction in denoising steps compared to vanilla diffusion policy.
arXiv Detail & Related papers (2024-11-22T18:59:47Z) - MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
We introduce MagicDriveDiT, a novel approach based on the DiT architecture.
By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents.
Experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames.
arXiv Detail & Related papers (2024-11-21T03:13:30Z) - DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics.
Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z) - GenAD: Generalized Predictive Model for Autonomous Driving [75.39517472462089]
We introduce the first large-scale video prediction model in the autonomous driving discipline.
Our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks.
It can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
arXiv Detail & Related papers (2024-03-14T17:58:33Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV [68.31957280416347]
Self-supervised monocular depth estimation (SS-MDE) has the potential to scale to vast quantities of data.
We propose a large-scale SlowTV dataset curated from YouTube, containing an order of magnitude more data than existing automotive datasets.
We train an SS-MDE model that provides zero-shot generalization to a large collection of indoor/outdoor datasets.
arXiv Detail & Related papers (2023-07-20T09:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.