TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving
- URL: http://arxiv.org/abs/2505.09315v1
- Date: Wed, 14 May 2025 12:10:41 GMT
- Title: TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving
- Authors: Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, Sheng Sun,
- Abstract summary: We propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving.<n>TransDiffuser achieves PDMS of 94.85 on the NAVSIM benchmark, surpassing previous state-of-the-art methods without any anchor-based prior trajectories.
- Score: 16.338107803841257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, diffusion model has shown its potential across diverse domains from vision generation to language modeling. Transferring its capabilities to modern autonomous driving systems has also emerged as a promising direction.In this work, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving. The encoded scene information serves as the multi-modal conditional input of the denoising decoder. To tackle the mode collapse dilemma in generating high-quality diverse trajectories, we introduce a simple yet effective multi-modal representation decorrelation optimization mechanism during the training process.TransDiffuser achieves PDMS of 94.85 on the NAVSIM benchmark, surpassing previous state-of-the-art methods without any anchor-based prior trajectories.
Related papers
- Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling [54.94692733670454]
Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles.<n>We propose a lightweight yet highly accurate streaming-based trajectory forecasting approach.<n>Our approach significantly reduces inference latency, making it well-suited for real-world deployment.
arXiv Detail & Related papers (2026-03-02T13:44:23Z) - Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z) - Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback [33.09982089166203]
We introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models.<n>TrajHF refines multi-modal trajectory generation beyond conventional imitation learning.<n>It achieves PDMS of 93.95 on NavSim benchmark, significantly exceeding other methods.
arXiv Detail & Related papers (2025-03-13T14:56:17Z) - STGDPM:Vessel Trajectory Prediction with Spatio-Temporal Graph Diffusion Probabilistic Model [0.0]
Vessel trajectory prediction is a critical component for ensuring maritime traffic safety and avoiding collisions.<n>Due to the inherent uncertainty in vessel behavior, trajectory prediction systems must adopt a multimodal approach to accurately model potential future motion states.<n>We propose modeling interactions as dynamic graphs, replacing traditional aggregation-based techniques that rely on vessel states.
arXiv Detail & Related papers (2025-03-11T05:50:27Z) - Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers [11.075247758198762]
This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture.<n>We propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs.<n>We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting.
arXiv Detail & Related papers (2025-01-14T18:34:14Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - WcDT: World-centric Diffusion Transformer for Traffic Scene Generation [14.236973526112674]
We introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models and transformers.<n>Our proposed framework, termed the "World-Centric Diffusion Transformer"(WcDT), optimize the entire trajectory generation process.<n>Our results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories.
arXiv Detail & Related papers (2024-04-02T16:28:41Z) - Tractable Joint Prediction and Planning over Discrete Behavior Modes for
Urban Driving [15.671811785579118]
We show that we can parameterize autoregressive closed-loop models without retraining.
We propose fully reactive closed-loop planning over discrete latent modes.
Our approach also outperforms the previous state-of-the-art in CARLA on challenging dense traffic scenarios.
arXiv Detail & Related papers (2024-03-12T01:00:52Z) - Controllable Diverse Sampling for Diffusion Based Motion Behavior
Forecasting [11.106812447960186]
We introduce a novel trajectory generator named Controllable Diffusion Trajectory (CDT)
CDT integrates information and social interactions into a Transformer-based conditional denoising diffusion model to guide the prediction of future trajectories.
To ensure multimodality, we incorporate behavioral tokens to direct the trajectory's modes, such as going straight, turning right or left.
arXiv Detail & Related papers (2024-02-06T13:16:54Z) - Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs.
We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios.
Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Layout Sequence Prediction From Noisy Mobile Modality [53.49649231056857]
Trajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics.
Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities.
We propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories.
arXiv Detail & Related papers (2023-10-09T20:32:49Z) - MotionLM: Multi-Agent Motion Forecasting as Language Modeling [15.317827804763699]
We present MotionLM, a language model for multi-agent motion prediction.
Our approach bypasses post-hoc interactions where individual agent trajectory generation is conducted prior to interactive scoring.
The model's sequential factorization enables temporally causal conditional rollouts.
arXiv Detail & Related papers (2023-09-28T15:46:25Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion [88.45326906116165]
We present a new framework to formulate the trajectory prediction task as a reverse process of motion indeterminacy diffusion (MID)
We encode the history behavior information and the social interactions as a state embedding and devise a Transformer-based diffusion model to capture the temporal dependencies of trajectories.
Experiments on the human trajectory prediction benchmarks including the Stanford Drone and ETH/UCY datasets demonstrate the superiority of our method.
arXiv Detail & Related papers (2022-03-25T16:59:08Z) - Domain Generalization for Vision-based Driving Trajectory Generation [9.490923738117772]
We propose a domain generalization method for vision-based driving trajectory generation for autonomous vehicles in urban environments.
We leverage an adversarial learning approach to train a trajectory generator as the decoder.
We compare our proposed method with the state-of-the-art trajectory generation method and some recent domain generalization methods on both datasets and simulation.
arXiv Detail & Related papers (2021-09-22T07:49:07Z) - A Driving Behavior Recognition Model with Bi-LSTM and Multi-Scale CNN [59.57221522897815]
We propose a neural network model based on trajectories information for driving behavior recognition.
We evaluate the proposed model on the public BLVD dataset, achieving a satisfying performance.
arXiv Detail & Related papers (2021-03-01T06:47:29Z) - Haar Wavelet based Block Autoregressive Flows for Trajectories [129.37479472754083]
Prediction of trajectories such as that of pedestrians is crucial to the performance of autonomous agents.
We introduce a novel Haar wavelet based block autoregressive model leveraging split couplings.
We illustrate the advantages of our approach for generating diverse and accurate trajectories on two real-world datasets.
arXiv Detail & Related papers (2020-09-21T13:57:10Z) - SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction [72.37440317774556]
We propose advances that address two key challenges in future trajectory prediction.
multimodality in both training data and predictions and constant time inference regardless of number of agents.
arXiv Detail & Related papers (2020-07-26T08:17:10Z) - Diverse and Admissible Trajectory Forecasting through Multimodal Context
Understanding [46.52703817997932]
Multi-agent trajectory forecasting in autonomous driving requires an agent to accurately anticipate the behaviors of the surrounding vehicles and pedestrians.
We propose a model that synthesizes multiple input signals from the multimodal world.
We show a significant performance improvement over previous state-of-the-art methods.
arXiv Detail & Related papers (2020-03-06T13:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.