Semantic-Conditional Diffusion Networks for Image Captioning
- URL: http://arxiv.org/abs/2212.03099v1
- Date: Tue, 6 Dec 2022 16:08:16 GMT
- Title: Semantic-Conditional Diffusion Networks for Image Captioning
- Authors: Jianjie Luo and Yehao Li and Yingwei Pan and Ting Yao and Jianlin Feng
and Hongyang Chao and Tao Mei
- Abstract summary: We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
- Score: 116.86677915812508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances on text-to-image generation have witnessed the rise of
diffusion models which act as powerful generative models. Nevertheless, it is
not trivial to exploit such latent variable models to capture the dependency
among discrete words and meanwhile pursue complex visual-language alignment in
image captioning. In this paper, we break the deeply rooted conventions in
learning Transformer-based encoder-decoder, and propose a new diffusion model
based paradigm tailored for image captioning, namely Semantic-Conditional
Diffusion Networks (SCD-Net). Technically, for each input image, we first
search the semantically relevant sentences via cross-modal retrieval model to
convey the comprehensive semantic information. The rich semantics are further
regarded as semantic prior to trigger the learning of Diffusion Transformer,
which produces the output sentence in a diffusion process. In SCD-Net, multiple
Diffusion Transformer structures are stacked to progressively strengthen the
output sentence with better visional-language alignment and linguistical
coherence in a cascaded manner. Furthermore, to stabilize the diffusion
process, a new self-critical sequence training strategy is designed to guide
the learning of SCD-Net with the knowledge of a standard autoregressive
Transformer model. Extensive experiments on COCO dataset demonstrate the
promising potential of using diffusion models in the challenging image
captioning task. Source code is available at
\url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}.
Related papers
- LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image
Captioning [36.4086473737433]
We propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion.
To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model.
In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network.
arXiv Detail & Related papers (2023-09-10T08:55:24Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Improving Diffusion-based Image Translation using Asymmetric Gradient
Guidance [51.188396199083336]
We present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance.
Our model's adaptability allows it to be implemented with both image-fusion and latent-dif models.
Experiments show that our method outperforms various state-of-the-art models in image translation tasks.
arXiv Detail & Related papers (2023-06-07T12:56:56Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.