TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early
Intent Prediction
- URL: http://arxiv.org/abs/2210.14714v1
- Date: Wed, 26 Oct 2022 13:47:23 GMT
- Title: TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early
Intent Prediction
- Authors: Nada Osman and Guglielmo Camporese and Lamberto Ballan
- Abstract summary: We focus on pedestrians' early intention prediction in which, from a current observation of an urban scene, the model predicts the future activity of pedestrians that approach the street.
Our method is based on a multi-modal transformer that encodes past observations and produces multiple predictions at different anticipation times.
- Score: 3.158346511479111
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human intention prediction is a growing area of research where an activity in
a video has to be anticipated by a vision-based system. To this end, the model
creates a representation of the past, and subsequently, it produces future
hypotheses about upcoming scenarios. In this work, we focus on pedestrians'
early intention prediction in which, from a current observation of an urban
scene, the model predicts the future activity of pedestrians that approach the
street. Our method is based on a multi-modal transformer that encodes past
observations and produces multiple predictions at different anticipation times.
Moreover, we propose to learn the attention masks of our transformer-based
model (Temporal Adaptive Mask Transformer) in order to weigh differently
present and past temporal dependencies. We investigate our method on several
public benchmarks for early intention prediction, improving the prediction
performances at different anticipation times compared to the previous works.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Humanoid Locomotion as Next Token Prediction [84.21335675130021]
Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories.
We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot.
Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize commands not seen during training like walking backward.
arXiv Detail & Related papers (2024-02-29T18:57:37Z) - Predictive Churn with the Set of Good Models [64.05949860750235]
We study the effect of conflicting predictions over the set of near-optimal machine learning models.
We present theoretical results on the expected churn between models within the Rashomon set.
We show how our approach can be used to better anticipate, reduce, and avoid churn in consumer-facing applications.
arXiv Detail & Related papers (2024-02-12T16:15:25Z) - Sinkhorn-Flow: Predicting Probability Mass Flow in Dynamical Systems
Using Optimal Transport [89.61692654941106]
We propose a new approach to predicting such mass flow over time using optimal transport.
We apply our approach to the task of predicting how communities will evolve over time in social network settings.
arXiv Detail & Related papers (2023-03-14T07:25:44Z) - Graph-based Spatial Transformer with Memory Replay for Multi-future
Pedestrian Trajectory Prediction [13.466380808630188]
We propose a model to forecast multiple paths based on a historical trajectory.
Our method can exploit the spatial information as well as correct the temporally inconsistent trajectories.
Our experiments show that the proposed model achieves state-of-the-art performance on multi-future prediction and competitive results for single-future prediction.
arXiv Detail & Related papers (2022-06-12T10:25:12Z) - Learning Future Object Prediction with a Spatiotemporal Detection
Transformer [1.1543275835002982]
We train a detection transformer to directly output future objects.
We extend existing transformers in two ways to capture scene dynamics.
Our final approach learns to capture the dynamics and make predictions on par with an oracle for 100 ms prediction horizons.
arXiv Detail & Related papers (2022-04-21T17:58:36Z) - StretchBEV: Stretching Future Instance Prediction Spatially and
Temporally [0.0]
In self-driving cars, predicting future in terms of location and motion of all the agents around the vehicle is a crucial requirement for planning.
Recently, a new joint formulation of perception and prediction has emerged by fusing rich sensory information perceived from multiple cameras into a compact bird's-eye view representation to perform prediction.
However, the quality of future predictions degrades over time while extending to longer time horizons due to multiple plausible predictions.
In this work, we address this inherent uncertainty in future predictions with a temporal model.
arXiv Detail & Related papers (2022-03-25T13:28:44Z) - FIERY: Future Instance Prediction in Bird's-Eye View from Surround
Monocular Cameras [33.08698074581615]
We present FIERY: a probabilistic future prediction model in bird's-eye view from monocular cameras.
Our approach combines the perception, sensor fusion and prediction components of a traditional autonomous driving stack.
We show that our model outperforms previous prediction baselines on the NuScenes and Lyft datasets.
arXiv Detail & Related papers (2021-04-21T12:21:40Z) - Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations.
We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents.
We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z) - LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving [139.33800431159446]
LookOut is an approach to jointly perceive the environment and predict a diverse set of futures from sensor data.
We show that our model demonstrates significantly more diverse and sample-efficient motion forecasting in a large-scale self-driving dataset.
arXiv Detail & Related papers (2021-01-16T23:19:22Z) - Multimodal semantic forecasting based on conditional generation of
future features [0.0]
This paper considers semantic forecasting in road-driving scenes.
Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames.
We propose to formulate multimodal forecasting as sampling of a multimodal generative model conditioned on the observed frames.
arXiv Detail & Related papers (2020-10-18T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.