Related papers: Generating Multimodal Driving Scenes via Next-Scene Prediction

Generating Multimodal Driving Scenes via Next-Scene Prediction

URL: http://arxiv.org/abs/2503.14945v2
Date: Wed, 26 Mar 2025 13:45:56 GMT
Title: Generating Multimodal Driving Scenes via Next-Scene Prediction
Authors: Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang,
Abstract summary: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities.<n>We introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality.<n>Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.
Score: 24.84840824118813
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/

Related papers

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation [37.748111048944274]
Chain-of-Action (CoA) is a visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling.<n>CoA generates an entire trajectory by explicit backward reasoning with task-specific goals.<n>We observe CoA the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
arXiv Detail & Related papers (2025-06-11T17:59:13Z)
GPD-1: Generative Pre-training for Driving [77.06803277735132]
We propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. Our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning.
arXiv Detail & Related papers (2024-12-11T18:59:51Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
SceneMotion: From Agent-Centric Embeddings to Scene-Wide Forecasts [13.202036465220766]
Self-driving vehicles rely on multimodal motion forecasts to interact with their environment and plan safe maneuvers.<n>We introduce SceneMotion, an attention-based model for forecasting scene-wide motion modes of multiple traffic agents.<n>This module learns a scene-wide latent space from multiple agent-centric embeddings, enabling joint forecasting and interaction modeling.
arXiv Detail & Related papers (2024-08-02T18:49:14Z)
DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline. We propose dedicated representations to disentangle dynamic agents and static scene elements. Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z)
JointMotion: Joint Self-Supervision for Joint Motion Prediction [10.44846560021422]
JointMotion is a self-supervised pre-training method for joint motion prediction in self-driving vehicles. Our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3%, 8%, and 12%, respectively.
arXiv Detail & Related papers (2024-03-08T17:54:38Z)
Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z)
MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing [3.3031006227198003]
We present Modality-Agnostic Vision Transformer (MA-ViT), which aims to improve the performance of arbitrary modal attacks with the help of multi-modal data. Specifically, MA-ViT adopts the early fusion to aggregate all the available training modalities data and enables flexible testing of any given modal samples. Experiments demonstrate that the single model trained on MA-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-04-15T13:03:44Z)
Self-Supervised Representation Learning from Temporal Ordering of Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks. We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems. Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z)
Learning to Align Sequential Actions in the Wild [123.62879270881807]
We propose an approach to align sequential actions in the wild that involve diverse temporal variations. Our model accounts for both monotonic and non-monotonic sequences. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
arXiv Detail & Related papers (2021-11-17T18:55:36Z)
Instance-Aware Predictive Navigation in Multi-Agent Environments [93.15055834395304]
We propose an Instance-Aware Predictive Control (IPC) approach, which forecasts interactions between agents as well as future scene structures. We adopt a novel multi-instance event prediction module to estimate the possible interaction among agents in the ego-centric view. We design a sequential action sampling strategy to better leverage predicted states on both scene-level and instance-level.
arXiv Detail & Related papers (2021-01-14T22:21:25Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.