Semantics-Guided Contrastive Network for Zero-Shot Object detection
- URL: http://arxiv.org/abs/2109.06062v1
- Date: Sat, 4 Sep 2021 03:32:15 GMT
- Title: Semantics-Guided Contrastive Network for Zero-Shot Object detection
- Authors: Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and
Qinghua Zheng
- Abstract summary: Zero-shot object detection (ZSD) is a new challenge in computer vision.
We develop ContrastZSD, a framework that brings contrastive learning mechanism into the realm of zero-shot detection.
Our method outperforms the previous state-of-the-art on both ZSD and generalized ZSD tasks.
- Score: 67.61512036994458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot object detection (ZSD), the task that extends conventional
detection models to detecting objects from unseen categories, has emerged as a
new challenge in computer vision. Most existing approaches tackle the ZSD task
with a strict mapping-transfer strategy, which may lead to suboptimal ZSD
results: 1) the learning process of those models ignores the available unseen
class information, and thus can be easily biased towards the seen categories;
2) the original visual feature space is not well-structured and lack of
discriminative information. To address these issues, we develop a novel
Semantics-Guided Contrastive Network for ZSD, named ContrastZSD, a detection
framework that first brings contrastive learning mechanism into the realm of
zero-shot detection. Particularly, ContrastZSD incorporates two
semantics-guided contrastive learning subnets that contrast between
region-category and region-region pairs respectively. The pairwise contrastive
tasks take advantage of additional supervision signals derived from both ground
truth label and pre-defined class similarity distribution. Under the guidance
of those explicit semantic supervision, the model can learn more knowledge
about unseen categories to avoid the bias problem to seen concepts, while
optimizing the data structure of visual features to be more discriminative for
better visual-semantic alignment. Extensive experiments are conducted on two
popular benchmarks for ZSD, i.e., PASCAL VOC and MS COCO. Results show that our
method outperforms the previous state-of-the-art on both ZSD and generalized
ZSD tasks.
Related papers
- Joint Salient Object Detection and Camouflaged Object Detection via
Uncertainty-aware Learning [47.253370009231645]
We introduce an uncertainty-aware learning pipeline to explore the contradictory information of salient object detection (SOD) and camouflaged object detection (COD)
Our solution leads to both state-of-the-art performance and informative uncertainty estimation.
arXiv Detail & Related papers (2023-07-10T15:49:37Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Resolving Semantic Confusions for Improved Zero-Shot Detection [6.72910827751713]
We propose a generative model incorporating a triplet loss that acknowledges the degree of dissimilarity between classes.
A cyclic-consistency loss is also enforced to ensure that generated visual samples of a class highly correspond to their own semantics.
arXiv Detail & Related papers (2022-12-12T18:11:48Z) - DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET.
We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images.
We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Dual Contrastive Learning for General Face Forgery Detection [64.41970626226221]
We propose a novel face forgery detection framework, named Dual Contrastive Learning (DCL), which constructs positive and negative paired data.
To explore the essential discrepancies, Intra-Instance Contrastive Learning (Intra-ICL) is introduced to focus on the local content inconsistencies prevalent in the forged faces.
arXiv Detail & Related papers (2021-12-27T05:44:40Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Attribute-Induced Bias Eliminating for Transductive Zero-Shot Learning [144.94728981314717]
We propose a novel Attribute-Induced Bias Eliminating (AIBE) module for Transductive ZSL.
For the visual bias between two domains, the Mean-Teacher module is first leveraged to bridge the visual representation discrepancy between two domains.
An attentional graph attribute embedding is proposed to reduce the semantic bias between seen and unseen categories.
Finally, for the semantic-visual bias in the unseen domain, an unseen semantic alignment constraint is designed to align visual and semantic space in an unsupervised manner.
arXiv Detail & Related papers (2020-05-31T02:08:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.