Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion
- URL: http://arxiv.org/abs/2306.11593v1
- Date: Tue, 20 Jun 2023 15:13:02 GMT
- Title: Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion
- Authors: Simone Bianco and Luigi Celona and Marco Donzella and Paolo Napoletano
- Abstract summary: State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
- Score: 17.99150939602917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-The-Art (SoTA) image captioning models often rely on the Microsoft
COCO (MS-COCO) dataset for training. This dataset contains annotations provided
by human annotators, who typically produce captions averaging around ten
tokens. However, this constraint presents a challenge in effectively capturing
complex scenes and conveying detailed information. Furthermore, captioning
models tend to exhibit bias towards the ``average'' caption, which captures
only the more general aspects. What would happen if we were able to
automatically generate longer captions, thereby making them more detailed?
Would these captions, evaluated by humans, be more or less representative of
the image content compared to the original MS-COCO captions? In this paper, we
present a novel approach to address previous challenges by showcasing how
captions generated from different SoTA models can be effectively fused,
resulting in richer captions. Our proposed method leverages existing models
from the literature, eliminating the need for additional training. Instead, it
utilizes an image-text based metric to rank the captions generated by SoTA
models for a given image. Subsequently, the top two captions are fused using a
Large Language Model (LLM). Experimental results demonstrate the effectiveness
of our approach, as the captions generated by our model exhibit higher
consistency with human judgment when evaluated on the MS-COCO test set. By
combining the strengths of various SoTA models, our method enhances the quality
and appeal of image captions, bridging the gap between automated systems and
the rich, informative nature of human-generated descriptions. This advance
opens up new possibilities for generating captions that are more suitable for
the training of both vision-language and captioning models.
Related papers
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - Inserting Faces inside Captions: Image Captioning with Attention Guided Merging [0.0]
We introduce AstroCaptions, a dataset for the image captioning task.
We propose a novel post-processing method to insert identified people's names inside the caption.
arXiv Detail & Related papers (2024-03-20T08:38:25Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Learning Distinct and Representative Styles for Image Captioning [24.13549951795951]
We propose a Discrete Mode Learning (DML) paradigm for image captioning.
Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings"
In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet.
arXiv Detail & Related papers (2022-09-17T03:25:46Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.