End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge
Distillation
- URL: http://arxiv.org/abs/2204.03541v1
- Date: Fri, 1 Apr 2022 07:27:19 GMT
- Title: End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge
Distillation
- Authors: Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai
Sun, Rongrong Ji
- Abstract summary: We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously.
We propose a novel end-to-end zero-shot HOI Detection framework via vision-language knowledge distillation.
Our method outperforms the previous SOTA by 8.92% on unseen mAP and 10.18% on overall mAP.
- Score: 86.41437210485932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing Human-Object Interaction~(HOI) Detection methods rely heavily
on full annotations with predefined HOI categories, which is limited in
diversity and costly to scale further. We aim at advancing zero-shot HOI
detection to detect both seen and unseen HOIs simultaneously. The fundamental
challenges are to discover potential human-object pairs and identify novel HOI
categories. To overcome the above challenges, we propose a novel end-to-end
zero-shot HOI Detection (EoID) framework via vision-language knowledge
distillation. We first design an Interactive Score module combined with a
Two-stage Bipartite Matching algorithm to achieve interaction distinguishment
for human-object pairs in an action-agnostic manner. Then we transfer the
distribution of action probability from the pretrained vision-language teacher
as well as the seen ground truth to the HOI model to attain zero-shot HOI
classification. Extensive experiments on HICO-Det dataset demonstrate that our
model discovers potential interactive pairs and enables the recognition of
unseen HOIs. Finally, our method outperforms the previous SOTA by 8.92% on
unseen mAP and 10.18% on overall mAP under UA setting, by 6.02% on unseen mAP
and 9.1% on overall mAP under UC setting. Moreover, our method is generalizable
to large-scale object detection data to further scale up the action sets. The
source code will be available at: https://github.com/mrwu-mac/EoID.
Related papers
- Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement.
Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z) - FreeA: Human-object Interaction Detection using Free Annotation Labels [9.537338958326181]
We propose a novel self-adaption language-driven HOI detection method, termed as FreeA, without labeling.
FreeA matches image features of human-object pairs with HOI text templates, and a priori knowledge-based mask method is developed to suppress improbable interactions.
Experiments on two benchmark datasets show that FreeA state-of-the-art performance among weakly supervised HOI models.
arXiv Detail & Related papers (2024-03-04T08:38:15Z) - Exploring Self- and Cross-Triplet Correlations for Human-Object
Interaction Detection [38.86053346974547]
We propose to explore Self- and Cross-Triplet Correlations for HOI detection.
Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge.
Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations.
arXiv Detail & Related papers (2024-01-11T05:38:24Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Decoupling Object Detection from Human-Object Interaction Recognition [37.133695677465376]
DEFR is a DEtection-FRee method to recognize Human-Object Interactions (HOI) at image level without using object location or human pose.
We propose two findings to boost the performance of the detection-free approach, which significantly outperforms the detection-assisted state of the arts.
arXiv Detail & Related papers (2021-12-13T03:01:49Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Novel Human-Object Interaction Detection via Adversarial Domain
Generalization [103.55143362926388]
We study the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios.
The challenge mainly stems from the large compositional space of objects and predicates, which leads to the lack of sufficient training data for all the object-predicate combinations.
We propose a unified framework of adversarial domain generalization to learn object-invariant features for predicate prediction.
arXiv Detail & Related papers (2020-05-22T22:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.