CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning
- URL: http://arxiv.org/abs/2210.04559v1
- Date: Mon, 10 Oct 2022 10:55:53 GMT
- Title: CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning
- Authors: Shitong Xu
- Abstract summary: Inspired by the recent success of the denoising diffusion model on image synthesis tasks, we apply denoising diffusion probabilistic models to text generation in image captioning tasks.
We show that our CLIP-Diffusion-LM is capable of generating image captions using significantly fewer inference steps than autoregressive models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning task has been extensively researched by previous work.
However, limited experiments focus on generating captions based on
non-autoregressive text decoder. Inspired by the recent success of the
denoising diffusion model on image synthesis tasks, we apply denoising
diffusion probabilistic models to text generation in image captioning tasks. We
show that our CLIP-Diffusion-LM is capable of generating image captions using
significantly fewer inference steps than autoregressive models. On the Flickr8k
dataset, the model achieves 0.1876 BLEU-4 score. By training on the combined
Flickr8k and Flickr30k dataset, our model achieves 0.2470 BLEU-4 score. Our
code is available at https://github.com/xu-shitong/diffusion-image-captioning.
Related papers
- Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [52.93036326078229]
Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging.
In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task.
Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
arXiv Detail & Related papers (2023-05-17T17:59:16Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - What the DAAM: Interpreting Stable Diffusion Using Cross Attention [39.97805685586423]
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation.
They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature.
We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork.
We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
arXiv Detail & Related papers (2022-10-10T17:55:41Z) - On Distillation of Guided Diffusion Models [94.95228078141626]
We propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from.
For standard diffusion models trained on the pixelspace, our approach is able to generate images visually comparable to that of the original model.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps.
arXiv Detail & Related papers (2022-10-06T18:03:56Z) - Improving Diffusion Model Efficiency Through Patching [0.0]
We find that adding a simple ViT-style patching transformation can considerably reduce a diffusion model's sampling time and memory usage.
We justify our approach both through an analysis of diffusion model objective, and through empirical experiments on LSUN Church, ImageNet 256, and FFHQ 1024.
arXiv Detail & Related papers (2022-07-09T18:21:32Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.