Human-Object Interaction Detection via Disentangled Transformer
- URL: http://arxiv.org/abs/2204.09290v1
- Date: Wed, 20 Apr 2022 08:15:04 GMT
- Title: Human-Object Interaction Detection via Disentangled Transformer
- Authors: Desen Zhou, Zhichao Liu, Jian Wang, Leshan Wang, Tao Hu, Errui Ding,
Jingdong Wang
- Abstract summary: We present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks.
Our method outperforms prior work on two public HOI benchmarks by a sizeable margin.
- Score: 63.46358684341105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction Detection tackles the problem of joint localization
and classification of human object interactions. Existing HOI transformers
either adopt a single decoder for triplet prediction, or utilize two parallel
decoders to detect individual objects and interactions separately, and compose
triplets by a matching process. In contrast, we decouple the triplet prediction
into human-object pair detection and interaction classification. Our main
motivation is that detecting the human-object instances and classifying
interactions accurately needs to learn representations that focus on different
regions. To this end, we present Disentangled Transformer, where both encoder
and decoder are disentangled to facilitate learning of two sub-tasks. To
associate the predictions of disentangled decoders, we first generate a unified
representation for HOI triplets with a base decoder, and then utilize it as
input feature of each disentangled decoder. Extensive experiments show that our
method outperforms prior work on two public HOI benchmarks by a sizeable
margin. Code will be available.
Related papers
- Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Neural-Logic Human-Object Interaction Detection [67.4993347702353]
We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities.
Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the human, action, object> triplet and constitute novel interactions.
We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities.
arXiv Detail & Related papers (2023-11-16T11:47:53Z) - HODN: Disentangling Human-Object Feature for HOI Detection [51.48164941412871]
We propose a Human and Object Disentangling Network (HODN) to model the Human-Object Interaction (HOI) relationships explicitly.
Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions.
Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det Linking datasets.
arXiv Detail & Related papers (2023-08-20T04:12:50Z) - Relational Context Learning for Human-Object Interaction Detection [34.319471023763384]
We propose the multiplex relation network (MUREN) that performs rich context exchange between three decoder branches.
The proposed method learns comprehensive relational contexts for discovering HOI instances.
arXiv Detail & Related papers (2023-04-11T06:01:10Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - GEN-VLKT: Simplify Association and Enhance Interaction Understanding for
HOI Detection [17.92210977820113]
We propose Guided-Embedding Network(GEN) to attain a two-branch pipeline without post-matching.
For the association, previous two-branch methods suffer from complex and costly post-matching.
For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery.
arXiv Detail & Related papers (2022-03-26T01:04:13Z) - HOTR: End-to-End Human-Object Interaction Detection with Transformers [26.664864824357164]
We present a novel framework, referred to by HOTR, which directly predicts a set of human, object, interaction> triplets from an image.
Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
arXiv Detail & Related papers (2021-04-28T10:10:29Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.