Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment
- URL: http://arxiv.org/abs/2211.07275v1
- Date: Mon, 14 Nov 2022 11:12:19 GMT
- Title: Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment
- Authors: Junyang Wang, Yi Zhang, Ming Yan, Ji Zhang, Jitao Sang
- Abstract summary: We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
- Score: 23.072180427273544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero-shot
transfer capabilities in cross-modal correlation tasks such as visual
classification and image retrieval. However, its performance in cross-modal
generation tasks like zero-shot image captioning remains unsatisfied. In this
work, we discuss that directly employing CLIP for zero-shot image captioning
relies more on the textual modality in context and largely ignores the visual
information, which we call \emph{contextual language prior}. To address this,
we propose Cross-modal Language Models (CLMs) to facilitate unsupervised
cross-modal learning. We further propose Anchor Augment to guide the generative
model's attention to the fine-grained information in the representation of
CLIP. Experiments on MS COCO and Flickr 30K validate the promising performance
of proposed approach in both captioning quality and computational efficiency.
Related papers
- Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification
with Cross-Modal Retrieval [29.838375158101027]
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability.
We propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble.
X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training.
arXiv Detail & Related papers (2023-08-29T13:02:35Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping [20.67415815472257]
We propose a zero-shot method from association to generation for image captioning and video captioning.
Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
arXiv Detail & Related papers (2023-04-26T04:06:20Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.