Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training
- URL: http://arxiv.org/abs/2401.02347v1
- Date: Thu, 4 Jan 2024 16:43:46 GMT
- Title: Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training
- Authors: Longtian Qiu, Shan Ning, Xuming He
- Abstract summary: We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
- Score: 14.340740609933437
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Image captioning aims at generating descriptive and meaningful textual
descriptions of images, enabling a broad range of vision-language applications.
Prior works have demonstrated that harnessing the power of Contrastive Image
Language Pre-training (CLIP) offers a promising approach to achieving zero-shot
captioning, eliminating the need for expensive caption annotations. However,
the widely observed modality gap in the latent space of CLIP harms the
performance of zero-shot captioning by breaking the alignment between paired
image-text features. To address this issue, we conduct an analysis on the CLIP
latent space which leads to two findings. Firstly, we observe that the CLIP's
visual feature of image subregions can achieve closer proximity to the paired
caption due to the inherent information loss in text descriptions. In addition,
we show that the modality gap between a paired image-text can be empirically
modeled as a zero-mean Gaussian distribution. Motivated by the findings, we
propose a novel zero-shot image captioning framework with text-only training to
reduce the modality gap. In particular, we introduce a subregion feature
aggregation to leverage local region information, which produces a compact
visual representation for matching text representation. Moreover, we
incorporate a noise injection and CLIP reranking strategy to boost captioning
performance. We also extend our framework to build a zero-shot VQA pipeline,
demonstrating its generality. Through extensive experiments on common
captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that
our method achieves remarkable performance improvements. Code is available at
https://github.com/Artanic30/MacCap.
Related papers
- IFCap: Image-like Retrieval and Frequency-based Entity Filtering for
Zero-shot Captioning [3.8265756182141346]
We propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap.
Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features.
arXiv Detail & Related papers (2024-09-26T16:47:32Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment [23.072180427273544]
We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
arXiv Detail & Related papers (2022-11-14T11:12:19Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.