Distillation of Human-Object Interaction Contexts for Action Recognition
- URL: http://arxiv.org/abs/2112.09448v1
- Date: Fri, 17 Dec 2021 11:39:44 GMT
- Title: Distillation of Human-Object Interaction Contexts for Action Recognition
- Authors: Muna Almushyti and Frederick W. Li
- Abstract summary: We learn human-object relationships by exploiting the interaction of their local and global contexts.
We propose the Global-Local Interaction Distillation Network (GLIDN), learning human and object interactions through space and time.
GLIDN encodes humans and objects into graph nodes and learns local and global relations via graph attention network.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling spatial-temporal relations is imperative for recognizing human
actions, especially when a human is interacting with objects, while multiple
objects appear around the human differently over time. Most existing action
recognition models focus on learning overall visual cues of a scene but
disregard informative fine-grained features, which can be captured by learning
human-object relationships and interactions. In this paper, we learn
human-object relationships by exploiting the interaction of their local and
global contexts. We hence propose the Global-Local Interaction Distillation
Network (GLIDN), learning human and object interactions through space and time
via knowledge distillation for fine-grained scene understanding. GLIDN encodes
humans and objects into graph nodes and learns local and global relations via
graph attention network. The local context graphs learn the relation between
humans and objects at a frame level by capturing their co-occurrence at a
specific time step. The global relation graph is constructed based on the
video-level of human and object interactions, identifying their long-term
relations throughout a video sequence. More importantly, we investigate how
knowledge from these graphs can be distilled to their counterparts for
improving human-object interaction (HOI) recognition. We evaluate our model by
conducting comprehensive experiments on two datasets including Charades and
CAD-120 datasets. We have achieved better results than the baselines and
counterpart approaches.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.