RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
- URL: http://arxiv.org/abs/2312.06299v1
- Date: Mon, 11 Dec 2023 11:06:32 GMT
- Title: RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
- Authors: Jiashuo Fan, Yaoyuan Liang, Leyao Liu, Shaolun Huang, and Lei Zhang
- Abstract summary: We introduce a novel approach to novel object captioning which employs relative contrastive learning to learn visual and semantic alignment.
We evaluate our approach on two datasets and show that our proposed RCA-NOC approach outperforms state-of-the-art methods by a large margin.
- Score: 18.13275250206568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel approach to novel object captioning which
employs relative contrastive learning to learn visual and semantic alignment.
Our approach maximizes compatibility between regions and object tags in a
contrastive manner. To set up a proper contrastive learning objective, for each
image, we augment tags by leveraging the relative nature of positive and
negative pairs obtained from foundation models such as CLIP. We then use the
rank of each augmented tag in a list as a relative relevance label to contrast
each top-ranked tag with a set of lower-ranked tags. This learning objective
encourages the top-ranked tags to be more compatible with their image and text
context than lower-ranked tags, thus improving the discriminative ability of
the learned multi-modality representation. We evaluate our approach on two
datasets and show that our proposed RCA-NOC approach outperforms
state-of-the-art methods by a large margin, demonstrating its effectiveness in
improving vision-language representation for novel object captioning.
Related papers
- A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition [6.468625143772815]
We propose a unified label-aware token-level contrastive learning framework.
Our approach enriches the context by utilizing label semantics as suffix prompts.
It simultaneously optimize context-native and context-label contrastive learning objectives.
arXiv Detail & Related papers (2024-04-26T06:19:21Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Tag2Text: Guiding Vision-Language Model via Image Tagging [32.30893277821682]
This paper presents Tag2Text, a vision language pre-training framework, which introduces image tagging into vision-language models.
In contrast to prior works which utilize object tags either manually labeled or automatically detected with an off-the-shelf detector with limited performance, our approach explicitly learns an image tagger using tags parsed from image-paired text.
arXiv Detail & Related papers (2023-03-10T02:16:35Z) - CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.