Forecasting Action through Contact Representations from First Person
Video
- URL: http://arxiv.org/abs/2102.00649v1
- Date: Mon, 1 Feb 2021 05:52:57 GMT
- Title: Forecasting Action through Contact Representations from First Person
Video
- Authors: Eadom Dessalene, Chinmaya Devaraj, Michael Maynord, Cornelia
Fermuller, and Yiannis Aloimonos
- Abstract summary: We introduce representations and models centered on contact, which we then use in action prediction and anticipation.
Using these annotations we train a module producing novel low-level representations of anticipated near future action.
On top of the Anticipation Module we apply Ego-OMG, a framework for action anticipation and prediction.
- Score: 7.10140895422075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human actions involving hand manipulations are structured according to the
making and breaking of hand-object contact, and human visual understanding of
action is reliant on anticipation of contact as is demonstrated by pioneering
work in cognitive science. Taking inspiration from this, we introduce
representations and models centered on contact, which we then use in action
prediction and anticipation. We annotate a subset of the EPIC Kitchens dataset
to include time-to-contact between hands and objects, as well as segmentations
of hands and objects. Using these annotations we train the Anticipation Module,
a module producing Contact Anticipation Maps and Next Active Object
Segmentations - novel low-level representations providing temporal and spatial
characteristics of anticipated near future action. On top of the Anticipation
Module we apply Egocentric Object Manipulation Graphs (Ego-OMG), a framework
for action anticipation and prediction. Ego-OMG models longer term temporal
semantic relations through the use of a graph modeling transitions between
contact delineated action states. Use of the Anticipation Module within Ego-OMG
produces state-of-the-art results, achieving 1st and 2nd place on the unseen
and seen test sets, respectively, of the EPIC Kitchens Action Anticipation
Challenge, and achieving state-of-the-art results on the tasks of action
anticipation and action prediction over EPIC Kitchens. We perform ablation
studies over characteristics of the Anticipation Module to evaluate their
utility.
Related papers
- PEAR: Phrase-Based Hand-Object Interaction Anticipation [20.53329698350243]
First-person hand-object interaction anticipation aims to predict the interaction process based on current scenes and prompts.
Existing research typically anticipates only interaction intention while neglecting manipulation.
We propose a novel model, PEAR, which jointly anticipates interaction intention and manipulation.
arXiv Detail & Related papers (2024-07-31T10:28:49Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment.
modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality.
GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Enhancing Next Active Object-based Egocentric Action Anticipation with
Guided Attention [45.60789439017625]
Short-term action anticipation (STA) in first-person videos is a challenging task.
We propose a novel approach that applies a guided attention mechanism between objects.
Our method, GANO, is a multi-modal, end-to-end, single transformer-based network.
arXiv Detail & Related papers (2023-05-22T11:56:10Z) - Graphing the Future: Activity and Next Active Object Prediction using
Graph-based Activity Representations [0.0]
We present a novel approach for the visual prediction of human-object interactions in videos.
We aim at predicting (a)the class of the on-going human-object interaction and (b) the class of the next active object(s) (NAOs)
High prediction accuracy was obtained for both action prediction and NAO forecasting.
arXiv Detail & Related papers (2022-09-12T12:32:24Z) - Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects.
We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model.
This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Pose And Joint-Aware Action Recognition [87.4780883700755]
We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder.
Our joint selector module re-weights the joint information to select the most discriminative joints for the task.
We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets.
arXiv Detail & Related papers (2020-10-16T04:43:34Z) - Egocentric Object Manipulation Graphs [8.759425622561334]
Ego-OMG is a novel representation for activity and modeling anticipation of near future actions.
It integrates semantic temporal structure, short-term dynamics, and representations for appearance.
Code will be released upon acceptance of Ego-OMG.
arXiv Detail & Related papers (2020-06-05T02:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.