Exploring Discrete Diffusion Models for Image Captioning
- URL: http://arxiv.org/abs/2211.11694v1
- Date: Mon, 21 Nov 2022 18:12:53 GMT
- Title: Exploring Discrete Diffusion Models for Image Captioning
- Authors: Zixin Zhu, Yixuan Wei, Jianfeng Wang, Zhe Gan, Zheng Zhang, Le Wang,
Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
- Abstract summary: We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
- Score: 104.69608826164216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The image captioning task is typically realized by an auto-regressive method
that decodes the text tokens one by one. We present a diffusion-based
captioning model, dubbed the name DDCap, to allow more decoding flexibility.
Unlike image generation, where the output is continuous and redundant with a
fixed length, texts in image captions are categorical and short with varied
lengths. Therefore, naively applying the discrete diffusion model to text
decoding does not work well, as shown in our experiments. To address the
performance gap, we propose several key techniques including best-first
inference, concentrated attention mask, text length prediction, and image-free
training. On COCO without additional caption pre-training, it achieves a CIDEr
score of 117.8, which is +5.0 higher than the auto-regressive baseline with the
same architecture in the controlled setting. It also performs +26.8 higher
CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption
infilling task. With 4M vision-language pre-training images and the base-sized
model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the
best well-developed auto-regressive frameworks. The code is available at
https://github.com/buxiangzhiren/DDCap.
Related papers
- Bounding and Filling: A Fast and Flexible Framework for Image Captioning [5.810020749348207]
We introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques.
Our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr while speeding up 9.22x.
arXiv Detail & Related papers (2023-10-15T16:17:20Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning [0.0]
Inspired by the recent success of the denoising diffusion model on image synthesis tasks, we apply denoising diffusion probabilistic models to text generation in image captioning tasks.
We show that our CLIP-Diffusion-LM is capable of generating image captions using significantly fewer inference steps than autoregressive models.
arXiv Detail & Related papers (2022-10-10T10:55:53Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.