Weakly-supervised HOI Detection via Prior-guided Bi-level Representation
Learning
- URL: http://arxiv.org/abs/2303.01313v1
- Date: Thu, 2 Mar 2023 14:41:31 GMT
- Title: Weakly-supervised HOI Detection via Prior-guided Bi-level Representation
Learning
- Authors: Bo Wan, Yongfei Liu, Desen Zhou, Tinne Tuytelaars, Xuming He
- Abstract summary: Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks.
One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only.
This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal.
We develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations.
- Score: 66.00600682711995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human object interaction (HOI) detection plays a crucial role in
human-centric scene understanding and serves as a fundamental building-block
for many vision tasks. One generalizable and scalable strategy for HOI
detection is to use weak supervision, learning from image-level annotations
only. This is inherently challenging due to ambiguous human-object
associations, large search space of detecting HOIs and highly noisy training
signal. A promising strategy to address those challenges is to exploit
knowledge from large-scale pretrained models (e.g., CLIP), but a direct
knowledge distillation strategy~\citep{liao2022gen} does not perform well on
the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI
representation capable of incorporating the prior knowledge at both image level
and HOI instance level, and adopt a self-taught mechanism to prune incorrect
human-object associations. Experimental results on HICO-DET and V-COCO show
that our method outperforms the previous works by a sizable margin, showing the
efficacy of our HOI representation.
Related papers
- HASSOD: Hierarchical Adaptive Self-Supervised Object Detection [29.776467276826747]
Hierarchical Adaptive Self-Supervised Object Detection (HASSOD) is a novel approach that learns to detect objects and understand their compositions without human supervision.
We employ a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations.
HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures.
arXiv Detail & Related papers (2024-02-05T18:59:41Z) - Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge
Distillation at Multiple Levels [52.50670006414656]
We employ CLIP, a large-scale pre-trained vision-language model, for knowledge distillation on multiple levels.
To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions.
The model achieves strong performance, which is even comparable with some fully-supervised and weakly-supervised methods.
arXiv Detail & Related papers (2023-09-10T16:27:54Z) - Compositional Learning in Transformer-Based Human-Object Interaction
Detection [6.630793383852106]
Long-tailed distribution of labeled instances is a primary challenge in HOI detection.
Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning.
We creatively propose a transformer-based framework for compositional HOI learning.
arXiv Detail & Related papers (2023-08-11T06:41:20Z) - Semi-supervised learning made simple with self-supervised clustering [65.98152950607707]
Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations.
We propose a conceptually simple yet empirically powerful approach to turn clustering-based self-supervised methods into semi-supervised learners.
arXiv Detail & Related papers (2023-06-13T01:09:18Z) - On Higher Adversarial Susceptibility of Contrastive Self-Supervised
Learning [104.00264962878956]
Contrastive self-supervised learning (CSL) has managed to match or surpass the performance of supervised learning in image and video classification.
It is still largely unknown if the nature of the representation induced by the two learning paradigms is similar.
We identify the uniform distribution of data representation over a unit hypersphere in the CSL representation space as the key contributor to this phenomenon.
We devise strategies that are simple, yet effective in improving model robustness with CSL training.
arXiv Detail & Related papers (2022-07-22T03:49:50Z) - Knowledge Guided Bidirectional Attention Network for Human-Object
Interaction Detection [3.0915392100355192]
We argue that the independent use of the bottom-up parsing strategy in HOI is counter-intuitive and could lead to the diffusion of attention.
We introduce a novel knowledge-guided top-down attention into HOI, and propose to model the relation parsing as a "look and search" process.
We implement the process via unifying the bottom-up and top-down attention in a single encoder-decoder based model.
arXiv Detail & Related papers (2022-07-16T16:42:49Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - Heterogeneous Contrastive Learning: Encoding Spatial Information for
Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations.
We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z) - Detecting Human-Object Interaction with Mixed Supervision [0.0]
Human object interaction (HOI) detection is an important task in image understanding and reasoning.
We propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning.
Our method is evaluated on the challenging HICO-DET dataset.
arXiv Detail & Related papers (2020-11-10T08:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.