Joint Hand Motion and Interaction Hotspots Prediction from Egocentric
Videos
- URL: http://arxiv.org/abs/2204.01696v1
- Date: Mon, 4 Apr 2022 17:59:03 GMT
- Title: Joint Hand Motion and Interaction Hotspots Prediction from Egocentric
Videos
- Authors: Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, Xiaolong Wang
- Abstract summary: We forecast future hand-object interactions given an egocentric video.
Instead of predicting action labels or pixels, we directly predict the hand motion trajectory and the future contact points on the next active object.
Our model performs hand and object interaction reasoning via the self-attention mechanism in Transformers.
- Score: 13.669927361546872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to forecast future hand-object interactions given an egocentric
video. Instead of predicting action labels or pixels, we directly predict the
hand motion trajectory and the future contact points on the next active object
(i.e., interaction hotspots). This relatively low-dimensional representation
provides a concrete description of future interactions. To tackle this task, we
first provide an automatic way to collect trajectory and hotspots labels on
large-scale data. We then use this data to train an Object-Centric Transformer
(OCT) model for prediction. Our model performs hand and object interaction
reasoning via the self-attention mechanism in Transformers. OCT also provides a
probabilistic framework to sample the future trajectory and hotspots to handle
uncertainty in prediction. We perform experiments on the Epic-Kitchens-55,
Epic-Kitchens-100, and EGTEA Gaze+ datasets, and show that OCT significantly
outperforms state-of-the-art approaches by a large margin. Project page is
available at https://stevenlsw.github.io/hoi-forecast .
Related papers
- AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation [14.734158936250918]
Short-Term object-interaction Anticipation is fundamental for wearable assistants or human robot interaction to understand user goals.
We improve the performance of STA predictions with two contributions.
First, we propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion.
Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot.
arXiv Detail & Related papers (2024-06-03T10:57:18Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds [79.00975648564483]
Trajectory forecasting models, employed in fields such as robotics, autonomous vehicles, and navigation, face challenges in real-world scenarios.
This dataset provides comprehensive data, including the locations of all agents, scene images, and point clouds, all from the robot's perspective.
The objective is to predict the future positions of agents relative to the robot using raw sensory input data.
arXiv Detail & Related papers (2023-11-05T18:59:31Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Anticipating Next Active Objects for Egocentric Videos [29.473527958651317]
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip.
We propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip.
arXiv Detail & Related papers (2023-02-13T13:44:52Z) - Interaction Region Visual Transformer for Egocentric Action Anticipation [18.873728614415946]
We propose a novel way to represent human-object interactions for egocentric action anticipation.
We model interactions between hands and objects using Spatial Cross-Attention.
We then infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens.
Using these tokens, we construct an interaction-centric video representation for action anticipation.
arXiv Detail & Related papers (2022-11-25T15:00:51Z) - You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory
Prediction [52.442129609979794]
Recent deep learning approaches for trajectory prediction show promising performance.
It remains unclear which features such black-box models actually learn to use for making predictions.
This paper proposes a procedure that quantifies the contributions of different cues to model performance.
arXiv Detail & Related papers (2021-10-11T14:24:15Z) - Large Scale Interactive Motion Forecasting for Autonomous Driving : The
Waymo Open Motion Dataset [84.3946567650148]
With over 100,000 scenes, each 20 seconds long at 10 Hz, our new dataset contains more than 570 hours of unique data over 1750 km of roadways.
We use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes for each road agent.
We introduce a new set of metrics that provides a comprehensive evaluation of both single agent and joint agent interaction motion forecasting models.
arXiv Detail & Related papers (2021-04-20T17:19:05Z) - End-to-end Contextual Perception and Prediction with Interaction
Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving.
To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture.
Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z) - Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction [57.56466850377598]
Reasoning over visual data is a desirable capability for robotics and vision-based applications.
In this paper, we present a framework on graph to uncover relationships in different objects in the scene for reasoning about pedestrian intent.
Pedestrian intent, defined as the future action of crossing or not-crossing the street, is a very crucial piece of information for autonomous vehicles.
arXiv Detail & Related papers (2020-02-20T18:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.