Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition
- URL: http://arxiv.org/abs/2108.08633v1
- Date: Thu, 19 Aug 2021 11:57:27 GMT
- Title: Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition
- Authors: Ning Wang, Guangming Zhu, Liang Zhang, Peiyi Shen, Hongsheng Li, Cong
Hua
- Abstract summary: In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
- Score: 55.7731053128204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For a given video-based Human-Object Interaction scene, modeling the
spatio-temporal relationship between humans and objects are the important cue
to understand the contextual information presented in the video. With the
effective spatio-temporal relationship modeling, it is possible not only to
uncover contextual information in each frame but also to directly capture
inter-time dependencies. It is more critical to capture the position changes of
human and objects over the spatio-temporal dimension when their appearance
features may not show up significant changes over time. The full use of
appearance features, the spatial location and the semantic information are also
the key to improve the video-based Human-Object Interaction recognition
performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks
(STIGPN) are constructed, which encode the videos with a graph composed of
human and object nodes. These nodes are connected by two types of relations:
(i) spatial relations modeling the interactions between human and the
interacted objects within each frame. (ii) inter-time relations capturing the
long range dependencies between human and the interacted objects across frame.
With the graph, STIGPN learn spatio-temporal features directly from the whole
video-based Human-Object Interaction scenes. Multi-modal features and a
multi-stream fusion strategy are used to enhance the reasoning capability of
STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and
Something-Else, are used to evaluate the proposed architectures, and the
state-of-the-art performance demonstrates the superiority of STIGPN.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos [9.159660801125812]
Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects.
In this work, we propose a novel end-to-end category to scenery framework, CATS.
We construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories.
arXiv Detail & Related papers (2024-07-01T02:42:55Z) - Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video.
Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images.
We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z) - A Skeleton-aware Graph Convolutional Network for Human-Object
Interaction Detection [14.900704382194013]
We propose a skeleton-aware graph convolutional network for human-object interaction detection, named SGCN4HOI.
Our network exploits the spatial connections between human keypoints and object keypoints to capture their fine-grained structural interactions via graph convolutions.
It fuses such geometric features with visual features and spatial configuration features obtained from human-object pairs.
arXiv Detail & Related papers (2022-07-11T15:20:18Z) - Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object
Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input.
We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z) - Distillation of Human-Object Interaction Contexts for Action Recognition [0.0]
We learn human-object relationships by exploiting the interaction of their local and global contexts.
We propose the Global-Local Interaction Distillation Network (GLIDN), learning human and object interactions through space and time.
GLIDN encodes humans and objects into graph nodes and learns local and global relations via graph attention network.
arXiv Detail & Related papers (2021-12-17T11:39:44Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos [13.25502885135043]
Analyzing the interactions between humans and objects from a video includes identification of relationships between humans and the objects present in the video.
We present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture truth at multiple granularities in a video.
We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO.
arXiv Detail & Related papers (2020-12-17T05:44:07Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.