ClipCap: CLIP Prefix for Image Captioning
- URL: http://arxiv.org/abs/2111.09734v1
- Date: Thu, 18 Nov 2021 14:49:15 GMT
- Title: ClipCap: CLIP Prefix for Image Captioning
- Authors: Ron Mokady, Amir Hertz, and Amit H. Bermano
- Abstract summary: We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
- Score: 6.69087470775851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning is a fundamental task in vision-language understanding,
where the model predicts a textual informative caption to a given input image.
In this paper, we present a simple approach to address this task. We use CLIP
encoding as a prefix to the caption, by employing a simple mapping network, and
then fine-tunes a language model to generate the image captions. The recently
proposed CLIP model contains rich semantic features which were trained with
textual context, making it best for vision-language perception. Our key idea is
that together with a pre-trained language model (GPT2), we obtain a wide
understanding of both visual and textual data. Hence, our approach only
requires rather quick training to produce a competent captioning model. Without
additional annotations or pre-training, it efficiently generates meaningful
captions for large-scale and diverse datasets. Surprisingly, our method works
well even when only the mapping network is trained, while both CLIP and the
language model remain frozen, allowing a lighter architecture with less
trainable parameters. Through quantitative evaluation, we demonstrate our model
achieves comparable results to state-of-the-art methods on the challenging
Conceptual Captions and nocaps datasets, while it is simpler, faster, and
lighter. Our code is available in
https://github.com/rmokady/CLIP_prefix_caption.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - DreamLIP: Language-Image Pre-training with Long Captions [42.4063624671045]
We re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM)
Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs.
It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs.
arXiv Detail & Related papers (2024-03-25T17:59:42Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image
Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.
Most existing methods emphasize the user context fusion process by memory networks or transformers.
We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.