Guiding Image Captioning Models Toward More Specific Captions
- URL: http://arxiv.org/abs/2307.16686v1
- Date: Mon, 31 Jul 2023 14:00:12 GMT
- Title: Guiding Image Captioning Models Toward More Specific Captions
- Authors: Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen
- Abstract summary: We show that it is possible to generate more specific captions with minimal changes to the training process.
We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions.
- Score: 32.36062034676917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning is conventionally formulated as the task of generating
captions for images that match the distribution of reference image-caption
pairs. However, reference captions in standard captioning datasets are short
and may not uniquely identify the images they describe. These problems are
further exacerbated when models are trained directly on image-alt text pairs
collected from the internet. In this work, we show that it is possible to
generate more specific captions with minimal changes to the training process.
We implement classifier-free guidance for an autoregressive captioning model by
fine-tuning it to estimate both conditional and unconditional distributions
over captions. The guidance scale applied at decoding controls a trade-off
between maximizing $p(\mathrm{caption}|\mathrm{image})$ and
$p(\mathrm{image}|\mathrm{caption})$. Compared to standard greedy decoding,
decoding with a guidance scale of 2 substantially improves reference-free
metrics such as CLIPScore (0.808 vs. 0.775) and caption$\to$image retrieval
performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens
standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We
further explore the use of language models to guide the decoding process,
obtaining small improvements over the Pareto frontier of reference-free vs.
reference-based captioning metrics that arises from classifier-free guidance,
and substantially improving the quality of captions generated from a model
trained only on minimally curated web data.
Related papers
- Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation [9.552642210681489]
We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board.
We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example.
arXiv Detail & Related papers (2023-10-25T14:10:08Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Noise-aware Learning from Web-crawled Image-Text Data for Image
Captioning [6.101765622702223]
Noise-aware Captioning (NoC) framework learns rich knowledge from the whole web-crawled data while being less affected by the noises.
This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal.
An in-depth analysis shows the effectiveness of our framework in handling noise.
arXiv Detail & Related papers (2022-12-27T17:33:40Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.