CLID: Controlled-Length Image Descriptions with Limited Data
- URL: http://arxiv.org/abs/2211.14835v2
- Date: Sun, 21 Jan 2024 11:13:31 GMT
- Title: CLID: Controlled-Length Image Descriptions with Limited Data
- Authors: Elad Hirsch and Ayellet Tal
- Abstract summary: This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one.
Since existing image captioning datasets contain mostly short captions, generating long captions is challenging.
To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions.
- Score: 14.857590796528902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controllable image captioning models generate human-like image descriptions,
enabling some kind of control over the generated captions. This paper focuses
on controlling the caption length, i.e. a short and concise description or a
long and detailed one. Since existing image captioning datasets contain mostly
short captions, generating long captions is challenging. To address the
shortage of long training examples, we propose to enrich the dataset with
varying-length self-generated captions. These, however, might be of varying
quality and are thus unsuitable for conventional training. We introduce a novel
training strategy that selects the data points to be used at different times
during the training. Our method dramatically improves the length-control
abilities, while exhibiting SoTA performance in terms of caption quality. Our
approach is general and is shown to be applicable also to paragraph generation.
Related papers
- LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We then help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding.
Our method demonstrates superior performance in long-text-image retrieval tasks.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.