Related papers: HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

URL: http://arxiv.org/abs/2501.14729v2
Date: Wed, 12 Mar 2025 17:58:02 GMT
Title: HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Authors: Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai,
Abstract summary: We present a unified Driving World Model named HERMES.<n>We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios.<n> HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%.
Score: 59.675030933810106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.

Related papers

Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving [16.94669292450282]
This paper proposes FM4SU, a novel methodology for training a symbolic foundation model (FM) for scene understanding in autonomous driving. It leverages knowledge graphs (KGs) to capture sensory observation along with domain knowledge such as road topology, traffic rules, or complex interactions between traffic participants. The results demonstrate that fine-tuned models achieve significantly higher accuracy in all tasks.
arXiv Detail & Related papers (2025-03-24T14:38:25Z)
The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process. DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z)
Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene. We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z)
DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z)
Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving.<n>We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation.<n>Our method can generate plausible and controllable 4D occupancy, paving the way for advancements in driving world generation and end-to-end planning.
arXiv Detail & Related papers (2024-08-26T11:53:09Z)
Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving. LAW predicts future scene features based on current features and ego trajectories. This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z)
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion. DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z)
Embodied Understanding of Driving Scenarios [44.21311841582762]
Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Here, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities.
arXiv Detail & Related papers (2024-03-07T15:39:18Z)
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [67.49461023261536]
We learn a new framework of learning a world model, OccWorld, in the 3D Occupancy space. We simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. OccWorld produces competitive planning results without using instance and map supervision.
arXiv Detail & Related papers (2023-11-27T17:59:41Z)
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving [76.24483706445298]
We introduce DriveDreamer, a world model entirely derived from real-world driving scenarios. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.
arXiv Detail & Related papers (2023-09-18T13:58:42Z)
Neural World Models for Computer Vision [2.741266294612776]
We present a framework to train a world model and a policy, parameterised by deep neural networks. We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes. Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment.
arXiv Detail & Related papers (2023-06-15T14:58:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.