Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2104.05269v1
- Date: Mon, 12 Apr 2021 08:01:04 GMT
- Title: Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection
- Authors: Xubin Zhong, Xian Qu, Changxing Ding and Dacheng Tao
- Abstract summary: We propose a novel one-stage method, namely Glance and Gaze Network (GGNet)
GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps.
We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
- Score: 81.32280287658486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern human-object interaction (HOI) detection approaches can be divided
into one-stage methods and twostage ones. One-stage models are more efficient
due to their straightforward architectures, but the two-stage models are still
advantageous in accuracy. Existing one-stage models usually begin by detecting
predefined interaction areas or points, and then attend to these areas only for
interaction prediction; therefore, they lack reasoning steps that dynamically
search for discriminative cues. In this paper, we propose a novel one-stage
method, namely Glance and Gaze Network (GGNet), which adaptively models a set
of actionaware points (ActPoints) via glance and gaze steps. The glance step
quickly determines whether each pixel in the feature maps is an interaction
point. The gaze step leverages feature maps produced by the glance step to
adaptively infer ActPoints around each pixel in a progressive manner. Features
of the refined ActPoints are aggregated for interaction prediction. Moreover,
we design an actionaware approach that effectively matches each detected
interaction with its associated human-object pair, along with a novel hard
negative attentive loss to improve the optimization of GGNet. All the above
operations are conducted simultaneously and efficiently for all pixels in the
feature maps. Finally, GGNet outperforms state-of-the-art methods by
significant margins on both V-COCO and HICODET benchmarks. Code of GGNet is
available at https: //github.com/SherlockHolmes221/GGNet.
Related papers
- Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Effective Actor-centric Human-object Interaction Detection [20.564689533862524]
We propose a novel actor-centric framework to detect Human-Object Interaction in images.
Our method achieves the state-of-the-art on the challenging V-COCO and HICO-DET benchmarks.
arXiv Detail & Related papers (2022-02-24T10:24:44Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z) - GID-Net: Detecting Human-Object Interaction with Global and Instance
Dependency [67.95192190179975]
We introduce a two-stage trainable reasoning mechanism, referred to as GID block.
GID-Net is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch.
We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-11T11:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.