BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents
- URL: http://arxiv.org/abs/2407.05679v3
- Date: Wed, 30 Apr 2025 13:43:51 GMT
- Title: BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents
- Authors: Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiaofan Li, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang,
- Abstract summary: We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.<n>The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
- Score: 56.33989853438012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models have attracted increasing attention in autonomous driving for their ability to forecast potential future scenarios. In this paper, we propose BEVWorld, a novel framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model. The multi-modal tokenizer first encodes heterogeneous sensory data, and its decoder reconstructs the latent BEV tokens into LiDAR and surround-view image observations via ray-casting rendering in a self-supervised manner. This enables joint modeling and bidirectional encoding-decoding of panoramic imagery and point cloud data within a shared spatial representation. On top of this, the latent BEV sequence diffusion model performs temporally consistent forecasting of future scenes, conditioned on high-level action tokens, enabling scene-level reasoning over time. Extensive experiments demonstrate the effectiveness of BEVWorld on autonomous driving benchmarks, showcasing its capability in realistic future scene generation and its benefits for downstream tasks such as perception and motion prediction.
Related papers
- End-to-End Driving with Online Trajectory Evaluation via BEV World Model [52.10633338584164]
We propose an end-to-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation.
We validate our framework on the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-04-02T17:47:23Z) - Unified Human Localization and Trajectory Prediction with Monocular Vision [64.19384064365431]
MonoTransmotion is a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks.
We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs.
arXiv Detail & Related papers (2025-03-05T14:18:39Z) - SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and Dataset [101.51012770913627]
Bird's-eye view (BEV) perception has garnered significant attention in autonomous driving in recent years.
SimBEV is a randomized synthetic data generation tool that is extensively scalable and scalable.
SimBEV is used to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
arXiv Detail & Related papers (2025-02-04T00:00:06Z) - BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment [8.098296280937518]
We present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal.
By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment.
arXiv Detail & Related papers (2024-10-28T12:40:27Z) - CASPFormer: Trajectory Prediction from BEV Images with Deformable
Attention [4.9349065371630045]
We propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from spatialized Bird-Eye-View (BEV) images.
Our system can be integrated with any upstream perception module that is capable of generating BEV images.
We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics.
arXiv Detail & Related papers (2024-09-26T12:37:22Z) - From Bird's-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model [16.716345249091408]
We explore Bird's-Eye View generation, converting a BEV map into its corresponding multi-view street images.
Our approach comprises two main components: the Neural View Transformation and the Street Image Generation.
arXiv Detail & Related papers (2024-09-02T07:47:16Z) - GenAD: Generalized Predictive Model for Autonomous Driving [75.39517472462089]
We introduce the first large-scale video prediction model in the autonomous driving discipline.
Our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks.
It can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
arXiv Detail & Related papers (2024-03-14T17:58:33Z) - Driving into the Future: Multiview Visual Forecasting and Planning with
World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models.
Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z) - Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion [36.321494200830244]
Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion.
Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
arXiv Detail & Related papers (2023-11-02T06:21:56Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction
in Bird's-Eye View [14.113805629254191]
Bird's-eye view (BEV) representations are commonplace in perception for autonomous driving.
Existing approaches for BEV instance prediction rely on a multi-task auto-regressive coupled with post-processing to predict future instances.
We propose an efficient novel end-to-end framework named POWERBEV, which differs in several design choices aimed at reducing the inherent redundancy in previous methods.
arXiv Detail & Related papers (2023-06-19T08:11:05Z) - TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint
Perception and Prediction in Vision-Centric Autonomous Driving [45.785865869298576]
Vision-centric joint perception and prediction has become an emerging trend in autonomous driving research.
It predicts the future states of the participants in the surrounding environment from raw RGB images.
It is still a critical challenge to synchronize features obtained at multiple camera views and timestamps.
arXiv Detail & Related papers (2023-03-17T14:20:28Z) - DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception [14.968177102647783]
We propose an end-to-end framework, named DiffBEV, to exploit the potential of diffusion model to generate a more comprehensive BEV representation.
In practice, we design three types of conditions to guide the training of the diffusion model which denoises the coarse samples and refines the semantic feature.
We show that DiffBEV achieves a 25.9% mIoU on the nuScenes dataset, which is 6.2% higher than the best-performing existing approach.
arXiv Detail & Related papers (2023-03-15T02:42:48Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - Imagining The Road Ahead: Multi-Agent Trajectory Prediction via
Differentiable Simulation [17.953880589741438]
We develop a deep generative model built on a fully differentiable simulator for trajectory prediction.
We achieve state-of-the-art results on the INTERACTION dataset, using standard neural architectures and a standard variational training objective.
We name our model ITRA, for "Imagining the Road Ahead"
arXiv Detail & Related papers (2021-04-22T17:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.