ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries
- URL: http://arxiv.org/abs/2208.01582v3
- Date: Mon, 19 Jun 2023 11:50:41 GMT
- Title: ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries
- Authors: Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue
Wang, Hang Zhao
- Abstract summary: We propose ViP3D, a query-based visual trajectory prediction pipeline.
It exploits rich information from raw videos to directly predict future trajectories of agents in a scene.
ViP3D employs sparse agent queries to detect, track, and predict throughout the pipeline.
- Score: 17.117542692443617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Perception and prediction are two separate modules in the existing autonomous
driving systems. They interact with each other via hand-picked features such as
agent bounding boxes and trajectories. Due to this separation, prediction, as a
downstream module, only receives limited information from the perception
module. To make matters worse, errors from the perception modules can propagate
and accumulate, adversely affecting the prediction results. In this work, we
propose ViP3D, a query-based visual trajectory prediction pipeline that
exploits rich information from raw videos to directly predict future
trajectories of agents in a scene. ViP3D employs sparse agent queries to
detect, track, and predict throughout the pipeline, making it the first fully
differentiable vision-based trajectory prediction approach. Instead of using
historical feature maps and trajectories, useful information from previous
timestamps is encoded in agent queries, which makes ViP3D a concise streaming
prediction method. Furthermore, extensive experimental results on the nuScenes
dataset show the strong vision-based prediction performance of ViP3D over
traditional pipelines and previous end-to-end models.
Related papers
- VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [10.748597086208145]
In this work, we propose a novel method that also incorporates visual input from surround-view cameras.
Our method achieves a latency of 53 ms, making it feasible for real-time processing.
Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance.
arXiv Detail & Related papers (2024-07-17T06:39:52Z) - Pre-training on Synthetic Driving Data for Trajectory Prediction [61.520225216107306]
We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting.
We adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them.
We conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies.
arXiv Detail & Related papers (2023-09-18T19:49:22Z) - XVTP3D: Cross-view Trajectory Prediction Using Shared 3D Queries for
Autonomous Driving [7.616422495497465]
Trajectory prediction with uncertainty is a critical and challenging task for autonomous driving.
We present a cross-view trajectory prediction method using shared 3D queries (XVTP3D)
The results of experiments on two publicly available datasets show that XVTP3D achieved state-of-the-art performance with consistent cross-view predictions.
arXiv Detail & Related papers (2023-08-17T03:35:13Z) - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z) - You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory
Prediction [52.442129609979794]
Recent deep learning approaches for trajectory prediction show promising performance.
It remains unclear which features such black-box models actually learn to use for making predictions.
This paper proposes a procedure that quantifies the contributions of different cues to model performance.
arXiv Detail & Related papers (2021-10-11T14:24:15Z) - Semantic Prediction: Which One Should Come First, Recognition or
Prediction? [21.466783934830925]
One of the primary downstream tasks is interpreting the scene's semantic composition and using it for decision-making.
There are two main ways to achieve the same outcome, given a pre-trained video prediction and pre-trained semantic extraction model.
We investigate these configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video prediction model and U-Net as the semantic extraction model on synthetic and real datasets.
arXiv Detail & Related papers (2021-10-06T15:01:05Z) - SLPC: a VRNN-based approach for stochastic lidar prediction and
completion in autonomous driving [63.87272273293804]
We propose a new LiDAR prediction framework that is based on generative models namely Variational Recurrent Neural Networks (VRNNs)
Our algorithm is able to address the limitations of previous video prediction frameworks when dealing with sparse data by spatially inpainting the depth maps in the upcoming frames.
We present a sparse version of VRNNs and an effective self-supervised training method that does not require any labels.
arXiv Detail & Related papers (2021-02-19T11:56:44Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z) - Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud
Forecasting for Sequential Pose Forecasting [106.3504366501894]
Self-driving vehicles and robotic manipulation systems often forecast future object poses by first detecting and tracking objects.
This detect-then-forecast pipeline is expensive to scale, as pose forecasting algorithms typically require labeled sequences of object poses.
We propose to first forecast 3D sensor data and then detect/track objects on the predicted point cloud sequences to obtain future poses.
This makes it less expensive to scale pose forecasting, as the sensor data forecasting task requires no labels.
arXiv Detail & Related papers (2020-03-18T17:54:28Z) - TTPP: Temporal Transformer with Progressive Prediction for Efficient
Action Anticipation [46.28067541184604]
Video action anticipation aims to predict future action categories from observed frames.
Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states.
This paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction framework.
arXiv Detail & Related papers (2020-03-07T07:59:42Z) - Deep Learning for Content-based Personalized Viewport Prediction of
360-Degree VR Videos [72.08072170033054]
In this paper, a deep learning network is introduced to leverage position data as well as video frame content to predict future head movement.
For optimizing data input into this neural network, data sample rate, reduced data, and long-period prediction length are also explored for this model.
arXiv Detail & Related papers (2020-03-01T07:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.