Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding
- URL: http://arxiv.org/abs/2310.12970v1
- Date: Thu, 19 Oct 2023 17:59:01 GMT
- Title: Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding
- Authors: Zhejun Zhang, Alexander Liniger, Christos Sakaridis, Fisher Yu, Luc
Van Gool
- Abstract summary: Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
- Score: 121.08841110022607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The real-world deployment of an autonomous driving system requires its
components to run on-board and in real-time, including the motion prediction
module that predicts the future trajectories of surrounding traffic
participants. Existing agent-centric methods have demonstrated outstanding
performance on public benchmarks. However, they suffer from high computational
overhead and poor scalability as the number of agents to be predicted
increases. To address this problem, we introduce the K-nearest neighbor
attention with relative pose encoding (KNARPE), a novel attention mechanism
allowing the pairwise-relative representation to be used by Transformers. Then,
based on KNARPE we present the Heterogeneous Polyline Transformer with Relative
pose encoding (HPTR), a hierarchical framework enabling asynchronous token
update during the online inference. By sharing contexts among agents and
reusing the unchanged contexts, our approach is as efficient as scene-centric
methods, while performing on par with state-of-the-art agent-centric methods.
Experiments on Waymo and Argoverse-2 datasets show that HPTR achieves superior
performance among end-to-end methods that do not apply expensive
post-processing or model ensembling. The code is available at
https://github.com/zhejz/HPTR.
Related papers
- PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation.
We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel.
We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z) - SocialFormer: Social Interaction Modeling with Edge-enhanced Heterogeneous Graph Transformers for Trajectory Prediction [3.733790302392792]
SocialFormer is an agent interaction-aware trajectory prediction method.
We present a temporal encoder based on gated recurrent units (GRU) to model the temporal social behavior of agent movements.
We evaluate SocialFormer for the trajectory prediction task on the popular nuScenes benchmark and achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-05-06T19:47:23Z) - AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction.
Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations.
We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z) - SceneDM: Scene-level Multi-agent Trajectory Generation with Consistent
Diffusion Models [10.057312592344507]
We propose a novel framework based on diffusion models, called SceneDM, to generate joint and consistent future motions of all the agents in a scene.
SceneDM achieves state-of-the-art results on the Sim Agents Benchmark.
arXiv Detail & Related papers (2023-11-27T11:39:27Z) - A Hierarchical Hybrid Learning Framework for Multi-agent Trajectory
Prediction [4.181632607997678]
We propose a hierarchical hybrid framework of deep learning (DL) and reinforcement learning (RL) for multi-agent trajectory prediction.
In the DL stage, the traffic scene is divided into multiple intermediate-scale heterogenous graphs based on which Transformer-style GNNs are adopted to encode heterogenous interactions.
In the RL stage, we divide the traffic scene into local sub-scenes utilizing the key future points predicted in the DL stage.
arXiv Detail & Related papers (2023-03-22T02:47:42Z) - LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and
Trajectory Prediction [12.84508682310717]
We propose LatentFormer, a transformer-based model for predicting future vehicle trajectories.
We evaluate the proposed method on the nuScenes benchmark dataset and show that our approach achieves state-of-the-art performance and improves upon trajectory metrics by up to 40%.
arXiv Detail & Related papers (2022-03-03T17:44:58Z) - Masked Transformer for Neighhourhood-aware Click-Through Rate Prediction [74.52904110197004]
We propose Neighbor-Interaction based CTR prediction, which put this task into a Heterogeneous Information Network (HIN) setting.
In order to enhance the representation of the local neighbourhood, we consider four types of topological interaction among the nodes.
We conduct comprehensive experiments on two real world datasets and the experimental results show that our proposed method outperforms state-of-the-art CTR models significantly.
arXiv Detail & Related papers (2022-01-25T12:44:23Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Decoder Fusion RNN: Context and Interaction Aware Decoders for
Trajectory Prediction [53.473846742702854]
We propose a recurrent, attention-based approach for motion forecasting.
Decoder Fusion RNN (DF-RNN) is composed of a recurrent behavior encoder, an inter-agent multi-headed attention module, and a context-aware decoder.
We demonstrate the efficacy of our method by testing it on the Argoverse motion forecasting dataset and show its state-of-the-art performance on the public benchmark.
arXiv Detail & Related papers (2021-08-12T15:53:37Z) - End-to-end Contextual Perception and Prediction with Interaction
Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving.
To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture.
Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.