Related papers: Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

URL: http://arxiv.org/abs/2503.09215v2
Date: Mon, 17 Mar 2025 08:07:46 GMT
Title: Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space
Authors: Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Fu Liu, Peng Jia, Xianpeng Lang, Xiaolong Sun,
Abstract summary: A driving World Model named EOT-WM is proposed in this paper, unifying Ego-Other vehicle Trajectories in videos.<n>The model can also predict unseen driving scenes with self-produced trajectories.
Score: 17.782501276072537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the end-to-end autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In addition, it remains a challenge to match multiple trajectories with each vehicle in the video to control the video generation. To address above issues, a driving World Model named EOT-WM is proposed in this paper, unifying Ego-Other vehicle Trajectories in videos. Specifically, we first project ego and other vehicle trajectories in the BEV space into the image coordinate to match each trajectory with its corresponding vehicle in the video. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

Related papers

VideoGAN-based Trajectory Proposal for Automated Vehicles [1.693200946453174]
We investigate whether a generative network (GAN) trained on videos of bird's-eye view (BEV) traffic scenarios can generate statistically accurate trajectories.<n>To this end, we propose a pipeline that uses low-resolution BEV occupancy grid videos as training data for a video generative model.<n>We obtain our best results within 100 GPU hours of training, with inference times under 20,ms.
arXiv Detail & Related papers (2025-06-19T10:57:44Z)
GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control [50.67481583744243]
We introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models.<n>We propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles.<n>Our method significantly outperforms existing models in both action accuracy and 3D spatial awareness.
arXiv Detail & Related papers (2025-05-28T14:46:51Z)
Challenger: Affordable Adversarial Driving Video Generation [36.949064774296076]
Challenger is a framework that produces physically plausible yet photorealistic adversarial driving videos.<n>As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios.
arXiv Detail & Related papers (2025-05-21T17:59:55Z)
Fully Unified Motion Planning for End-to-End Autonomous Driving [14.45403574889677]
Current end-to-end autonomous driving methods learn only from expert planning data collected from a single ego vehicle.<n>In any driving scenario, multiple high-quality trajectories from other vehicles coexist with a specific ego vehicle's trajectory.<n>We propose FUMP, a novel two-stage trajectory generation framework.
arXiv Detail & Related papers (2025-04-17T05:52:35Z)
The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process. DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z)
Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial.<n>We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene.<n>We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z)
Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model [83.31688383891871]
We propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes.<n>Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles.<n>It decouples spatial-temporal relationships and produces coherent driving videos.
arXiv Detail & Related papers (2024-12-06T18:59:56Z)
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
Driving Scene Synthesis on Free-form Trajectories with Generative Prior [39.24591650300784]
We propose a novel free-form driving view synthesis approach, dubbed DriveX. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.
arXiv Detail & Related papers (2024-12-02T17:07:53Z)
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving. This paper investigates the relationship between these two technologies. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)
GenAD: Generative End-to-End Autonomous Driving [13.332272121018285]
GenAD is a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling.
arXiv Detail & Related papers (2024-02-18T08:21:05Z)
BEVSeg2TP: Surround View Camera Bird's-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction [4.328789276903559]
Trajectory prediction is a key task for vehicle autonomy. There is a growing interest in learning-based trajectory prediction. We show that there is the potential to improve the performance of perception.
arXiv Detail & Related papers (2023-12-20T15:02:37Z)
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models. Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z)
An End-to-End Vehicle Trajcetory Prediction Framework [3.7311680121118345]
An accurate prediction of a future trajectory does not just rely on the previous trajectory, but also a simulation of the complex interactions between other vehicles nearby. Most state-of-the-art networks built to tackle the problem assume readily available past trajectory points. We propose a novel end-to-end architecture that takes raw video inputs and outputs future trajectory predictions.
arXiv Detail & Related papers (2023-04-19T15:42:03Z)
Generative AI-empowered Simulation for Autonomous Driving in Vehicular Mixed Reality Metaverses [130.15554653948897]
In vehicular mixed reality (MR) Metaverse, distance between physical and virtual entities can be overcome. Large-scale traffic and driving simulation via realistic data collection and fusion from the physical world is difficult and costly. We propose an autonomous driving architecture, where generative AI is leveraged to synthesize unlimited conditioned traffic and driving data in simulations.
arXiv Detail & Related papers (2023-02-16T16:54:10Z)
Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years. Data-driven simulation for autonomous driving has been a focal point of recent research. We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.