Related papers: DONUT: A Decoder-Only Model for Trajectory Prediction

DONUT: A Decoder-Only Model for Trajectory Prediction

URL: http://arxiv.org/abs/2506.06854v2
Date: Fri, 01 Aug 2025 14:07:37 GMT
Title: DONUT: A Decoder-Only Model for Trajectory Prediction
Authors: Markus Knoche, Daan de Geus, Bastian Leibe,
Abstract summary: We propose DONUT, a Decoder-Only Network for Unrolling Trajectories.<n>We encode historical trajectories and predict future trajectories with a single autoregressive model.<n>We achieve new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.
Score: 12.89335607622991
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Unlike existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, thereby enhancing performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the model the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future and further improves performance. Through experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.

Related papers

Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
Certified Human Trajectory Prediction [66.1736456453465]
We propose a certification approach tailored for trajectory prediction that provides guaranteed robustness.<n>To mitigate the inherent performance drop through certification, we propose a diffusion-based trajectory denoiser and integrate it into our method.<n>We demonstrate the accuracy and robustness of the certified predictors and highlight their advantages over the non-certified ones.
arXiv Detail & Related papers (2024-03-20T17:41:35Z)
Humanoid Locomotion as Next Token Prediction [84.21335675130021]
Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize commands not seen during training like walking backward.
arXiv Detail & Related papers (2024-02-29T18:57:37Z)
Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z)
Smooth-Trajectron++: Augmenting the Trajectron++ behaviour prediction model with smooth attention [0.0]
This work investigates the state-of-the-art trajectory forecasting model Trajectron++ which we enhance by incorporating a smoothing term in its attention module. This attention mechanism mimics human attention inspired by cognitive science research indicating limits to attention switching. We evaluate the performance of the resulting Smooth-Trajectron++ model and compare it to the original model on various benchmarks.
arXiv Detail & Related papers (2023-05-31T09:19:55Z)
Evaluation of Differentially Constrained Motion Models for Graph-Based Trajectory Prediction [1.1947990549568765]
This research investigates the performance of various motion models in combination with numerical solvers for the prediction task. The study shows that simpler models, such as low-order integrator models, are preferred over more complex, e.g., kinematic models, to achieve accurate predictions.
arXiv Detail & Related papers (2023-04-11T10:15:20Z)
TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer [44.93030718234555]
We develop a long-sequence car-following trajectory prediction model based on the attention-based Transformer model. We train and test our model with 112,597 real-world car-following events extracted from the Shanghai Naturalistic Driving Study (SH-NDS)
arXiv Detail & Related papers (2022-02-04T07:59:22Z)
You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction [52.442129609979794]
Recent deep learning approaches for trajectory prediction show promising performance. It remains unclear which features such black-box models actually learn to use for making predictions. This paper proposes a procedure that quantifies the contributions of different cues to model performance.
arXiv Detail & Related papers (2021-10-11T14:24:15Z)
Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z)
Learning to Predict Vehicle Trajectories with Model-based Planning [43.27767693429292]
We introduce a novel framework called PRIME, which stands for Prediction with Model-based Planning. Unlike recent prediction works that utilize neural networks to model scene context, PRIME is designed to generate accurate and feasibility-guaranteed future trajectory predictions. Our PRIME outperforms state-of-the-art methods in prediction accuracy, feasibility, and robustness under imperfect tracking.
arXiv Detail & Related papers (2021-03-06T04:49:24Z)
Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning [7.194382512848327]
We propose a new parametrization to supervised learning on state-action data to stably predict at longer horizons. Our results in simulated and experimental robotic tasks show that our trajectory-based models yield significantly more accurate long term predictions.
arXiv Detail & Related papers (2020-12-16T18:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.