DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model
- URL: http://arxiv.org/abs/2410.10429v1
- Date: Mon, 14 Oct 2024 12:24:32 GMT
- Title: DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model
- Authors: Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, Xiaoxiao Long,
- Abstract summary: DOME is a diffusion-based world model that predicts future occupancy frames based on past occupancy observations.
The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving.
- Score: 14.996395953240699
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose DOME, a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving. Compared to 2D video-based world models, the occupancy world model utilizes a native 3D representation, which features easily obtainable annotations and is modality-agnostic. This flexibility has the potential to facilitate the development of more advanced world models. Existing occupancy world models either suffer from detail loss due to discrete tokenization or rely on simplistic diffusion architectures, leading to inefficiencies and difficulties in predicting future occupancy with controllability. Our DOME exhibits two key features:(1) High-Fidelity and Long-Duration Generation. We adopt a spatial-temporal diffusion transformer to predict future occupancy frames based on historical context. This architecture efficiently captures spatial-temporal information, enabling high-fidelity details and the ability to generate predictions over long durations. (2)Fine-grained Controllability. We address the challenge of controllability in predictions by introducing a trajectory resampling method, which significantly enhances the model's ability to generate controlled predictions. Extensive experiments on the widely used nuScenes dataset demonstrate that our method surpasses existing baselines in both qualitative and quantitative evaluations, establishing a new state-of-the-art performance on nuScenes. Specifically, our approach surpasses the baseline by 10.5% in mIoU and 21.2% in IoU for occupancy reconstruction and by 36.0% in mIoU and 24.6% in IoU for 4D occupancy forecasting.
Related papers
- D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow [47.361822281431586]
This report summarizes the second-place solution for the Predictive World Model Challenge held at the CVPR-2024 Workshop on Foundation Models for Autonomous Systems.
We introduce D$2$-World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow.
Our approach achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model.
arXiv Detail & Related papers (2024-11-26T01:42:49Z) - WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making [40.53824201182517]
This paper introduces WHALE, a framework for learning generalizable world models.
We present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability.
We also propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-11-08T15:01:27Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction [56.72301849123049]
We present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ dataset challenge at CVPR 2024.
Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling.
Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth.
arXiv Detail & Related papers (2024-07-01T16:32:15Z) - Regions are Who Walk Them: a Large Pre-trained Spatiotemporal Model
Based on Human Mobility for Ubiquitous Urban Sensing [24.48869607589127]
We propose a largetemporal model based on trajectories (RAW) to tap into the rich information within human mobility data.
Our proposed method, relying solely on human mobility data without additional features, exhibits certain level of relevance in user profiling and region analysis.
arXiv Detail & Related papers (2023-11-17T11:55:11Z) - Foundation Models for Generalist Geospatial Artificial Intelligence [3.7002058945990415]
This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive data.
We have utilized this framework to create Prithvi, a transformer-based foundational model pre-trained on more than 1TB of multispectral satellite imagery.
arXiv Detail & Related papers (2023-10-28T10:19:55Z) - Generative Modeling with Phase Stochastic Bridges [49.4474628881673]
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs.
We introduce a novel generative modeling framework grounded in textbfphase space dynamics
Our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.
arXiv Detail & Related papers (2023-10-11T18:38:28Z) - Predictive World Models from Real-World Partial Observations [66.80340484148931]
We present a framework for learning a probabilistic predictive world model for real-world road environments.
While prior methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only.
arXiv Detail & Related papers (2023-01-12T02:07:26Z) - A generic diffusion-based approach for 3D human pose prediction in the
wild [68.00961210467479]
3D human pose forecasting, i.e., predicting a sequence of future human 3D poses given a sequence of past observed ones, is a challenging-temporal task.
We provide a unified formulation in which incomplete elements (no matter in the prediction or observation) are treated as noise and propose a conditional diffusion model that denoises them and forecasts plausible poses.
We investigate our findings on four standard datasets and obtain significant improvements over the state-of-the-art.
arXiv Detail & Related papers (2022-10-11T17:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.