Cross-Domain Image Captioning with Discriminative Finetuning
- URL: http://arxiv.org/abs/2304.01662v1
- Date: Tue, 4 Apr 2023 09:33:16 GMT
- Title: Cross-Domain Image Captioning with Discriminative Finetuning
- Authors: Roberto Dess\`i, Michele Bevilacqua, Eleonora Gualdoni, Nathanael
Carraz Rakotonirina, Francesca Franzon, Marco Baroni
- Abstract summary: Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
- Score: 20.585138136033905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural captioners are typically trained to mimic human-generated references
without optimizing for any specific communication goal, leading to problems
such as the generation of vague captions. In this paper, we show that
fine-tuning an out-of-the-box neural captioner with a self-supervised
discriminative communication objective helps to recover a plain, visually
descriptive language that is more informative about image contents. Given a
target image, the system must learn to produce a description that enables an
out-of-the-box text-conditioned image retriever to identify such image among a
set of candidates. We experiment with the popular ClipCap captioner, also
replicating the main results with BLIP. In terms of similarity to ground-truth
human descriptions, the captions emerging from discriminative finetuning lag
slightly behind those generated by the non-finetuned model, when the latter is
trained and tested on the same caption dataset. However, when the model is used
without further tuning to generate captions for out-of-domain datasets, our
discriminatively-finetuned captioner generates descriptions that resemble human
references more than those produced by the same captioner without finetuning.
We further show that, on the Conceptual Captions dataset, discriminatively
finetuned captions are more helpful than either vanilla ClipCap captions or
ground-truth captions for human annotators tasked with an image discrimination
task.
Related papers
- Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Inserting Faces inside Captions: Image Captioning with Attention Guided Merging [0.0]
We introduce AstroCaptions, a dataset for the image captioning task.
We propose a novel post-processing method to insert identified people's names inside the caption.
arXiv Detail & Related papers (2024-03-20T08:38:25Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Fine-Grained Image Captioning with Global-Local Discriminative Objective [80.73827423555655]
We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions.
We evaluate the proposed method on the widely used MS-COCO dataset.
arXiv Detail & Related papers (2020-07-21T08:46:02Z) - Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC)
ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant.
We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.