Visual Compositional Learning for Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2007.12407v2
- Date: Sun, 4 Oct 2020 12:47:58 GMT
- Title: Visual Compositional Learning for Human-Object Interaction Detection
- Authors: Zhi Hou, Xiaojiang Peng, Yu Qiao, Dacheng Tao
- Abstract summary: Human-Object interaction (HOI) detection aims to localize and infer relationships between human and objects in an image.
It is challenging because an enormous number of possible combinations of objects and verbs types forms a long-tail distribution.
We devise a deep Visual Compositional Learning framework, which is a simple yet efficient framework to effectively address this problem.
- Score: 111.05263071111807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object interaction (HOI) detection aims to localize and infer
relationships between human and objects in an image. It is challenging because
an enormous number of possible combinations of objects and verbs types forms a
long-tail distribution. We devise a deep Visual Compositional Learning (VCL)
framework, which is a simple yet efficient framework to effectively address
this problem. VCL first decomposes an HOI representation into object and verb
specific features, and then composes new interaction samples in the feature
space via stitching the decomposed features. The integration of decomposition
and composition enables VCL to share object and verb features among different
HOI samples and images, and to generate new interaction samples and new types
of HOI, and thus largely alleviates the long-tail distribution problem and
benefits low-shot or zero-shot HOI detection. Extensive experiments demonstrate
that the proposed VCL can effectively improve the generalization of HOI
detection on HICO-DET and V-COCO and outperforms the recent state-of-the-art
methods on HICO-DET. Code is available at https://github.com/zhihou7/VCL.
Related papers
- Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - Compositional Learning in Transformer-Based Human-Object Interaction
Detection [6.630793383852106]
Long-tailed distribution of labeled instances is a primary challenge in HOI detection.
Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning.
We creatively propose a transformer-based framework for compositional HOI learning.
arXiv Detail & Related papers (2023-08-11T06:41:20Z) - HOICLIP: Efficient Knowledge Transfer for HOI Detection with
Vision-Language Models [30.279621764192843]
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions.
Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors.
We propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization.
arXiv Detail & Related papers (2023-03-28T07:54:54Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Affordance Transfer Learning for Human-Object Interaction Detection [106.37536031160282]
We introduce an affordance transfer learning approach to jointly detect HOIs with novel objects and recognize affordances.
Specifically, HOI representations can be decoupled into a combination of affordance and object representations.
With the proposed affordance transfer learning, the model is also capable of inferring the affordances of novel objects from known affordance representations.
arXiv Detail & Related papers (2021-04-07T02:37:04Z) - Detecting Human-Object Interaction via Fabricated Compositional Learning [106.37536031160282]
Human-Object Interaction (HOI) detection is a fundamental task for high-level scene understanding.
Human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples.
We propose Fabricated Compositional Learning (FCL) to address the problem of open long-tailed HOI detection.
arXiv Detail & Related papers (2021-03-15T08:52:56Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.