HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
- URL: http://arxiv.org/abs/2501.14729v1
- Date: Fri, 24 Jan 2025 18:59:51 GMT
- Title: HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
- Authors: Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai,
- Abstract summary: We present a unified Driving World Model named HERMES.
We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios.
HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%.
- Score: 59.675030933810106
- License:
- Abstract: Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model (LLM), enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.
Related papers
- The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process.
DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z) - Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial.
We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene.
We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z) - DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics.
Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z) - Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving.
We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation.
Our method can generate plausible and controllable 4D occupancy, paving the way for advancements in driving world generation and end-to-end planning.
arXiv Detail & Related papers (2024-08-26T11:53:09Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - Embodied Understanding of Driving Scenarios [44.21311841582762]
Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios.
Here, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.
ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities.
arXiv Detail & Related papers (2024-03-07T15:39:18Z) - OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [67.49461023261536]
We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space.
We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes.
OccWorld produces competitive planning results without using instance and map supervision.
arXiv Detail & Related papers (2023-11-27T17:59:41Z) - Neural World Models for Computer Vision [2.741266294612776]
We present a framework to train a world model and a policy, parameterised by deep neural networks.
We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes.
Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment.
arXiv Detail & Related papers (2023-06-15T14:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.