Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP
Guided Reinforcement Learning
- URL: http://arxiv.org/abs/2402.13936v1
- Date: Wed, 21 Feb 2024 17:05:06 GMT
- Title: Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP
Guided Reinforcement Learning
- Authors: Antoine Chaffin, Ewa Kijak, Vincent Claveau
- Abstract summary: Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training.
Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions.
We propose a new image captioning training strategy that makes use of GT captions in different ways.
- Score: 9.443456804893207
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training image captioning models using teacher forcing results in very
generic samples, whereas more distinctive captions can be very useful in
retrieval applications or to produce alternative texts describing images for
accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval
similarity score between the generated caption and the input image as reward to
guide the training, leading to more distinctive captions. Recent studies show
that pre-trained cross-modal retrieval models can be used to provide this
reward, completely eliminating the need for reference captions. However, we
argue in this paper that Ground Truth (GT) captions can still be useful in this
RL framework. We propose a new image captioning model training strategy that
makes use of GT captions in different ways. Firstly, they can be used to train
a simple MLP discriminator that serves as a regularization to prevent reward
hacking and ensures the fluency of generated captions, resulting in a textual
GAN setup extended for multimodal inputs. Secondly, they can serve as
additional trajectories in the RL strategy, resulting in a teacher forcing loss
weighted by the similarity of the GT to the image. This objective acts as an
additional learning signal grounded to the distribution of the GT captions.
Thirdly, they can serve as strong baselines when added to the pool of captions
used to compute the proposed contrastive reward to reduce the variance of
gradient estimate. Experiments on MS-COCO demonstrate the interest of the
proposed training strategy to produce highly distinctive captions while
maintaining high writing quality.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.