Related papers: DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model

DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model

URL: http://arxiv.org/abs/2410.10429v1
Date: Mon, 14 Oct 2024 12:24:32 GMT
Title: DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model
Authors: Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, Xiaoxiao Long,
Abstract summary: DOME is a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving.
Score: 14.996395953240699
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose DOME, a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving. Compared to 2D video-based world models, the occupancy world model utilizes a native 3D representation, which features easily obtainable annotations and is modality-agnostic. This flexibility has the potential to facilitate the development of more advanced world models. Existing occupancy world models either suffer from detail loss due to discrete tokenization or rely on simplistic diffusion architectures, leading to inefficiencies and difficulties in predicting future occupancy with controllability. Our DOME exhibits two key features:(1) High-Fidelity and Long-Duration Generation. We adopt a spatial-temporal diffusion transformer to predict future occupancy frames based on historical context. This architecture efficiently captures spatial-temporal information, enabling high-fidelity details and the ability to generate predictions over long durations. (2)Fine-grained Controllability. We address the challenge of controllability in predictions by introducing a trajectory resampling method, which significantly enhances the model's ability to generate controlled predictions. Extensive experiments on the widely used nuScenes dataset demonstrate that our method surpasses existing baselines in both qualitative and quantitative evaluations, establishing a new state-of-the-art performance on nuScenes. Specifically, our approach surpasses the baseline by 10.5% in mIoU and 21.2% in IoU for occupancy reconstruction and by 36.0% in mIoU and 24.6% in IoU for 4D occupancy forecasting.

Related papers

Towards foundational LiDAR world models with efficient latent flow matching [9.86884512471034]
Existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built.<n>We conduct the first systematic domain transfer study across three demanding scenarios.<n>Given different amounts of fine-tuning data, our experiments show that a single pre-trained model can achieve up to 11% absolute improvement.
arXiv Detail & Related papers (2025-06-30T00:16:55Z)
Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving [27.94544631535978]
generative models learn the underlying data distribution and incorporate 3D scene priors.<n>Our experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches.
arXiv Detail & Related papers (2025-05-29T05:34:22Z)
LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [51.834607121538724]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z)
A Survey of World Models for Autonomous Driving [63.33363128964687]
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling. This paper systematically reviews recent advances in world models for autonomous driving.
arXiv Detail & Related papers (2025-01-20T04:00:02Z)
An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy. Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z)
D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow [47.361822281431586]
This report summarizes the second-place solution for the Predictive World Model Challenge held at the CVPR-2024 Workshop on Foundation Models for Autonomous Systems. We introduce D$2$-World, a novel World model that effectively forecasts future point clouds through Decoupled Dynamic flow. Our approach achieves state-of-the-art performance on the OpenScene Predictive World Model benchmark, securing second place, and trains more than 300% faster than the baseline model.
arXiv Detail & Related papers (2024-11-26T01:42:49Z)
WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making [40.53824201182517]
This paper introduces WHALE, a framework for learning generalizable world models. We present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability. We also propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-11-08T15:01:27Z)
OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries. OPUS incorporates a suite of non-trivial strategies to enhance model performance. Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z)
AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction [56.72301849123049]
We present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ dataset challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth.
arXiv Detail & Related papers (2024-07-01T16:32:15Z)
Regions are Who Walk Them: a Large Pre-trained Spatiotemporal Model Based on Human Mobility for Ubiquitous Urban Sensing [24.48869607589127]
We propose a largetemporal model based on trajectories (RAW) to tap into the rich information within human mobility data. Our proposed method, relying solely on human mobility data without additional features, exhibits certain level of relevance in user profiling and region analysis.
arXiv Detail & Related papers (2023-11-17T11:55:11Z)
Foundation Models for Generalist Geospatial Artificial Intelligence [3.7002058945990415]
This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive data. We have utilized this framework to create Prithvi, a transformer-based foundational model pre-trained on more than 1TB of multispectral satellite imagery.
arXiv Detail & Related papers (2023-10-28T10:19:55Z)
Generative Modeling with Phase Stochastic Bridges [49.4474628881673]
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. We introduce a novel generative modeling framework grounded in textbfphase space dynamics Our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.
arXiv Detail & Related papers (2023-10-11T18:38:28Z)
Predictive World Models from Real-World Partial Observations [66.80340484148931]
We present a framework for learning a probabilistic predictive world model for real-world road environments. While prior methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only.
arXiv Detail & Related papers (2023-01-12T02:07:26Z)
A generic diffusion-based approach for 3D human pose prediction in the wild [68.00961210467479]
3D human pose forecasting, i.e., predicting a sequence of future human 3D poses given a sequence of past observed ones, is a challenging-temporal task. We provide a unified formulation in which incomplete elements (no matter in the prediction or observation) are treated as noise and propose a conditional diffusion model that denoises them and forecasts plausible poses. We investigate our findings on four standard datasets and obtain significant improvements over the state-of-the-art.
arXiv Detail & Related papers (2022-10-11T17:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.