Interaction Region Visual Transformer for Egocentric Action Anticipation
- URL: http://arxiv.org/abs/2211.14154v7
- Date: Thu, 11 Jan 2024 15:11:41 GMT
- Title: Interaction Region Visual Transformer for Egocentric Action Anticipation
- Authors: Debaditya Roy, Ramanathan Rajendiran and Basura Fernando
- Abstract summary: We propose a novel way to represent human-object interactions for egocentric action anticipation.
We model interactions between hands and objects using Spatial Cross-Attention.
We then infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens.
Using these tokens, we construct an interaction-centric video representation for action anticipation.
- Score: 18.873728614415946
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human-object interaction is one of the most important visual cues and we
propose a novel way to represent human-object interactions for egocentric
action anticipation. We propose a novel transformer variant to model
interactions by computing the change in the appearance of objects and human
hands due to the execution of the actions and use those changes to refine the
video representation. Specifically, we model interactions between hands and
objects using Spatial Cross-Attention (SCA) and further infuse contextual
information using Trajectory Cross-Attention to obtain environment-refined
interaction tokens. Using these tokens, we construct an interaction-centric
video representation for action anticipation. We term our model InAViT which
achieves state-of-the-art action anticipation performance on large-scale
egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms
other visual transformer-based methods including object-centric video
representation. On the EK100 evaluation server, InAViT is the top-performing
method on the public leaderboard (at the time of submission) where it
outperforms the second-best model by 3.3% on mean-top5 recall.
Related papers
- G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis [57.07638884476174]
G-HOP is a denoising diffusion based generative prior for hand-object interactions.
We represent the human hand via a skeletal distance field to obtain a representation aligned with the signed distance field for the object.
We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis.
arXiv Detail & Related papers (2024-04-18T17:59:28Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Human-Object Interaction Prediction in Videos through Gaze Following [9.61701724661823]
We design a framework to detect current HOIs and anticipate future HOIs in videos.
We propose to leverage human information since people often fixate on an object before interacting with it.
Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life.
arXiv Detail & Related papers (2023-06-06T11:36:14Z) - Holistic Interaction Transformer Network for Action Detection [15.667833703317124]
"HIT" network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream.
Our method significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets.
arXiv Detail & Related papers (2022-10-23T10:19:37Z) - Joint Hand Motion and Interaction Hotspots Prediction from Egocentric
Videos [13.669927361546872]
We forecast future hand-object interactions given an egocentric video.
Instead of predicting action labels or pixels, we directly predict the hand motion trajectory and the future contact points on the next active object.
Our model performs hand and object interaction reasoning via the self-attention mechanism in Transformers.
arXiv Detail & Related papers (2022-04-04T17:59:03Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.