Learning Future Object Prediction with a Spatiotemporal Detection
Transformer
- URL: http://arxiv.org/abs/2204.10321v1
- Date: Thu, 21 Apr 2022 17:58:36 GMT
- Title: Learning Future Object Prediction with a Spatiotemporal Detection
Transformer
- Authors: Adam Tonderski, Joakim Johnander, Christoffer Petersson, and Kalle
{\AA}str\"om
- Abstract summary: We train a detection transformer to directly output future objects.
We extend existing transformers in two ways to capture scene dynamics.
Our final approach learns to capture the dynamics and make predictions on par with an oracle for 100 ms prediction horizons.
- Score: 1.1543275835002982
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We explore future object prediction -- a challenging problem where all
objects visible in a future video frame are to be predicted. We propose to
tackle this problem end-to-end by training a detection transformer to directly
output future objects. In order to make accurate predictions about the future,
it is necessary to capture the dynamics in the scene, both of other objects and
of the ego-camera. We extend existing detection transformers in two ways to
capture the scene dynamics. First, we experiment with three different
mechanisms that enable the model to spatiotemporally process multiple frames.
Second, we feed ego-motion information to the model via cross-attention. We
show that both of these cues substantially improve future object prediction
performance. Our final approach learns to capture the dynamics and make
predictions on par with an oracle for 100 ms prediction horizons, and
outperform baselines for longer prediction horizons.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early
Intent Prediction [3.158346511479111]
We focus on pedestrians' early intention prediction in which, from a current observation of an urban scene, the model predicts the future activity of pedestrians that approach the street.
Our method is based on a multi-modal transformer that encodes past observations and produces multiple predictions at different anticipation times.
arXiv Detail & Related papers (2022-10-26T13:47:23Z) - T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts.
Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z) - StretchBEV: Stretching Future Instance Prediction Spatially and
Temporally [0.0]
In self-driving cars, predicting future in terms of location and motion of all the agents around the vehicle is a crucial requirement for planning.
Recently, a new joint formulation of perception and prediction has emerged by fusing rich sensory information perceived from multiple cameras into a compact bird's-eye view representation to perform prediction.
However, the quality of future predictions degrades over time while extending to longer time horizons due to multiple plausible predictions.
In this work, we address this inherent uncertainty in future predictions with a temporal model.
arXiv Detail & Related papers (2022-03-25T13:28:44Z) - Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations.
We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents.
We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z) - Learning to Anticipate Egocentric Actions by Imagination [60.21323541219304]
We study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos.
Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.
arXiv Detail & Related papers (2021-01-13T08:04:10Z) - Future Frame Prediction of a Video Sequence [5.660207256468971]
The ability to predict, anticipate and reason about future events is the essence of intelligence.
The ability to predict, anticipate and reason about future events is the essence of intelligence.
arXiv Detail & Related papers (2020-08-31T15:31:02Z) - End-to-end Contextual Perception and Prediction with Interaction
Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving.
To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture.
Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.