A Graph-based Interactive Reasoning for Human-Object Interaction
Detection
- URL: http://arxiv.org/abs/2007.06925v1
- Date: Tue, 14 Jul 2020 09:29:03 GMT
- Title: A Graph-based Interactive Reasoning for Human-Object Interaction
Detection
- Authors: Dongming Yang and Yuexian Zou
- Abstract summary: We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
- Score: 71.50535113279551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) detection devotes to learn how humans interact
with surrounding objects via inferring triplets of < human, verb, object >.
However, recent HOI detection methods mostly rely on additional annotations
(e.g., human pose) and neglect powerful interactive reasoning beyond
convolutions. In this paper, we present a novel graph-based interactive
reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs, in
which interactive semantics implied among visual targets are efficiently
exploited. The proposed model consists of a project function that maps related
targets from convolution space to a graph-based semantic space, a message
passing process propagating semantics among all nodes and an update function
transforming the reasoned nodes back to convolution space. Furthermore, we
construct a new framework to assemble in-Graph models for detecting HOIs,
namely in-GraphNet. Beyond inferring HOIs using instance features respectively,
the framework dynamically parses pairwise interactive semantics among visual
targets by integrating two-level in-Graphs, i.e., scene-wide and instance-wide
in-Graphs. Our framework is end-to-end trainable and free from costly
annotations like human pose. Extensive experiments show that our proposed
framework outperforms existing HOI detection methods on both V-COCO and
HICO-DET benchmarks and improves the baseline about 9.4% and 15% relatively,
validating its efficacy in detecting HOIs.
Related papers
- Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object
Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input.
We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection [81.32280287658486]
We propose a novel one-stage method, namely Glance and Gaze Network (GGNet)
GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps.
We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
arXiv Detail & Related papers (2021-04-12T08:01:04Z) - Zero-Shot Human-Object Interaction Recognition via Affordance Graphs [3.867143522757309]
We propose a new approach for Zero-Shot Human-Object Interaction Recognition.
Our approach makes use of knowledge external to the image content in the form of a graph.
We evaluate our approach on several datasets and show that it outperforms the current state of the art.
arXiv Detail & Related papers (2020-09-02T13:14:44Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.