DiffCap: Exploring Continuous Diffusion on Image Captioning
- URL: http://arxiv.org/abs/2305.12144v1
- Date: Sat, 20 May 2023 09:02:10 GMT
- Title: DiffCap: Exploring Continuous Diffusion on Image Captioning
- Authors: Yufeng He, Zefan Cai, Xu Gan, Baobao Chang
- Abstract summary: We propose a novel method DiffCap to apply continuous diffusions on image captioning.
Our method transforms discrete tokens in a natural way and applies continuous diffusion on them to successfully fuse extracted image features.
Our experiments on COCO dataset demonstrate that our method uses a much simpler structure to achieve comparable results to the previous non-autoregressive works.
- Score: 16.572887005727555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current image captioning works usually focus on generating descriptions in an
autoregressive manner. However, there are limited works that focus on
generating descriptions non-autoregressively, which brings more decoding
diversity. Inspired by the success of diffusion models on generating
natural-looking images, we propose a novel method DiffCap to apply continuous
diffusions on image captioning. Unlike image generation where the output is
fixed-size and continuous, image description length varies with discrete
tokens. Our method transforms discrete tokens in a natural way and applies
continuous diffusion on them to successfully fuse extracted image features for
diffusion caption generation. Our experiments on COCO dataset demonstrate that
our method uses a much simpler structure to achieve comparable results to the
previous non-autoregressive works. Apart from quality, an intriguing property
of DiffCap is its high diversity during generation, which is missing from many
autoregressive models. We believe our method on fusing multimodal features in
diffusion language generation will inspire more researches on multimodal
language generation tasks for its simplicity and decoding flexibility.
Related papers
- Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling [49.41822427811098]
We present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors.
Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables.
We show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
arXiv Detail & Related papers (2024-05-31T17:41:11Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image
Captioning [36.4086473737433]
We propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion.
To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model.
In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network.
arXiv Detail & Related papers (2023-09-10T08:55:24Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation [138.98095392584693]
We introduce Auto-Regressive Diffusion (AR-Diffusion) to account for the inherent sequential characteristic of natural language.
AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps.
In a series of experiments on various text generation tasks, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models.
arXiv Detail & Related papers (2023-05-16T15:10:22Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.