CgT-GAN: CLIP-guided Text GAN for Image Captioning
- URL: http://arxiv.org/abs/2308.12045v1
- Date: Wed, 23 Aug 2023 10:25:37 GMT
- Title: CgT-GAN: CLIP-guided Text GAN for Image Captioning
- Authors: Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu and Xiangnan He
- Abstract summary: We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
- Score: 48.276753091051035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The large-scale visual-language pre-trained model, Contrastive Language-Image
Pre-training (CLIP), has significantly improved image captioning for scenarios
without human-annotated image-caption pairs. Recent advanced CLIP-based image
captioning without human annotations follows a text-only training paradigm,
i.e., reconstructing text from shared embedding space. Nevertheless, these
approaches are limited by the training/inference gap or huge storage
requirements for text embeddings. Given that it is trivial to obtain images in
the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates
images into the training process to enable the model to "see" real visual
modality. Particularly, we use adversarial training to teach CgT-GAN to mimic
the phrases of an external text corpus and CLIP-based reward to provide
semantic guidance. The caption generator is jointly rewarded based on the
caption naturalness to human language calculated from the GAN's discriminator
and the semantic guidance reward computed by the CLIP-based reward module. In
addition to the cosine similarity as the semantic guidance reward (i.e.,
CLIP-cos), we further introduce a novel semantic guidance reward called
CLIP-agg, which aligns the generated caption with a weighted text embedding by
attentively aggregating the entire corpus. Experimental results on three
subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms
state-of-the-art methods significantly across all metrics. Code is available at
https://github.com/Lihr747/CgtGAN.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Text-Only Training for Image Captioning using Noise-Injected CLIP [23.384962328773753]
We consider the task of image-captioning using only the CLIP model and additional text data at training time.
Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar.
arXiv Detail & Related papers (2022-11-01T16:36:01Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.