Length-Controllable Image Captioning
- URL: http://arxiv.org/abs/2007.09580v1
- Date: Sun, 19 Jul 2020 03:40:51 GMT
- Title: Length-Controllable Image Captioning
- Authors: Chaorui Deng, Ning Ding, Mingkui Tan, Qi Wu
- Abstract summary: We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
- Score: 67.2079793803317
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The last decade has witnessed remarkable progress in the image captioning
task; however, most existing methods cannot control their captions,
\emph{e.g.}, choosing to describe the image either roughly or in detail. In
this paper, we propose to use a simple length level embedding to endow them
with this ability. Moreover, due to their autoregressive nature, the
computational complexity of existing models increases linearly as the length of
the generated captions grows. Thus, we further devise a non-autoregressive
image captioning approach that can generate captions in a length-irrelevant
complexity. We verify the merit of the proposed length level embedding on three
models: two state-of-the-art (SOTA) autoregressive models with different types
of decoder, as well as our proposed non-autoregressive model, to show its
generalization ability. In the experiments, our length-controllable image
captioning models not only achieve SOTA performance on the challenging MS COCO
dataset but also generate length-controllable and diverse image captions.
Specifically, our non-autoregressive model outperforms the autoregressive
baselines in terms of controllability and diversity, and also significantly
improves the decoding efficiency for long captions. Our code and models are
released at \textcolor{magenta}{\texttt{https://github.com/bearcatt/LaBERT}}.
Related papers
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - CLID: Controlled-Length Image Descriptions with Limited Data [14.857590796528902]
This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one.
Since existing image captioning datasets contain mostly short captions, generating long captions is challenging.
To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions.
arXiv Detail & Related papers (2022-11-27T14:18:40Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Semi-Autoregressive Transformer for Image Captioning [17.533503295862808]
We introduce a semi-autoregressive model for image captioning(dubbed as SATIC)
It keeps the autoregressive property in global but generates words parallelly in local.
Experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles.
arXiv Detail & Related papers (2021-06-17T12:36:33Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.