HODN: Disentangling Human-Object Feature for HOI Detection
- URL: http://arxiv.org/abs/2308.10158v2
- Date: Thu, 7 Dec 2023 08:04:48 GMT
- Title: HODN: Disentangling Human-Object Feature for HOI Detection
- Authors: Shuman Fang, Zhiwen Lin, Ke Yan, Jie Li, Xianming Lin, Rongrong Ji
- Abstract summary: We propose a Human and Object Disentangling Network (HODN) to model the Human-Object Interaction (HOI) relationships explicitly.
Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions.
Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det Linking datasets.
- Score: 51.48164941412871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Human-Object Interaction (HOI) detection is to detect humans and
their interactions with surrounding objects, where transformer-based methods
show dominant advances currently. However, these methods ignore the
relationship among humans, objects, and interactions: 1) human features are
more contributive than object ones to interaction prediction; 2) interactive
information disturbs the detection of objects but helps human detection. In
this paper, we propose a Human and Object Disentangling Network (HODN) to model
the HOI relationships explicitly, where humans and objects are first detected
by two disentangling decoders independently and then processed by an
interaction decoder. Considering that human features are more contributive to
interaction, we propose a Human-Guide Linking method to make sure the
interaction decoder focuses on the human-centric regions with human features as
the positional embeddings. To handle the opposite influences of interactions on
humans and objects, we propose a Stop-Gradient Mechanism to stop interaction
gradients from optimizing the object detection but to allow them to optimize
the human detection. Our proposed method achieves competitive performance on
both the V-COCO and the HICO-Det datasets. It can be combined with existing
methods easily for state-of-the-art results.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Learn to Predict How Humans Manipulate Large-sized Objects from
Interactive Motions [82.90906153293585]
We propose a graph neural network, HO-GCN, to fuse motion data and dynamic descriptors for the prediction task.
We show the proposed network that consumes dynamic descriptors can achieve state-of-the-art prediction results and help the network better generalize to unseen objects.
arXiv Detail & Related papers (2022-06-25T09:55:39Z) - Detecting Human-to-Human-or-Object (H2O) Interactions with DIABOLO [29.0200561485714]
We propose a new interaction dataset to deal with both types of human interactions: Human-to-Human-or-Object (H2O)
In addition, we introduce a novel taxonomy of verbs, intended to be closer to a description of human body attitude in relation to the surrounding targets of interaction.
We propose DIABOLO, an efficient subject-centric single-shot method to detect all interactions in one forward pass.
arXiv Detail & Related papers (2022-01-07T11:00:11Z) - GTNet:Guided Transformer Network for Detecting Human-Object Interactions [10.809778265707916]
The human-object interaction (HOI) detection task refers to localizing humans, localizing objects, and predicting the interactions between each human-object pair.
For detecting HOI, it is important to utilize relative spatial configurations and object semantics to find salient spatial regions of images.
This issue is addressed by the novel self-attention based guided transformer network, GTNet.
arXiv Detail & Related papers (2021-08-02T02:06:33Z) - Human Object Interaction Detection using Two-Direction Spatial
Enhancement and Exclusive Object Prior [28.99655101929647]
Human-Object Interaction (HOI) detection aims to detect visual relations between human and objects in images.
Non-interactive human-object pair can be easily mis-grouped and misclassified as an action.
We propose a spatial enhancement approach to enforce fine-level spatial constraints in two directions.
arXiv Detail & Related papers (2021-05-07T07:18:27Z) - HOTR: End-to-End Human-Object Interaction Detection with Transformers [26.664864824357164]
We present a novel framework, referred to by HOTR, which directly predicts a set of human, object, interaction> triplets from an image.
Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
arXiv Detail & Related papers (2021-04-28T10:10:29Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.