Improving Diffusion-Based Image Synthesis with Context Prediction
- URL: http://arxiv.org/abs/2401.02015v1
- Date: Thu, 4 Jan 2024 01:10:56 GMT
- Title: Improving Diffusion-Based Image Synthesis with Context Prediction
- Authors: Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang,
Zheming Cai, Wentao Zhang, Bin Cui
- Abstract summary: Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
- Score: 49.186366441954846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models are a new class of generative models, and have dramatically
promoted image generation with unprecedented quality and diversity. Existing
diffusion models mainly try to reconstruct input image from a corrupted one
with a pixel-wise or feature-wise constraint along spatial axes. However, such
point-based reconstruction may fail to make each predicted pixel/feature fully
preserve its neighborhood context, impairing diffusion-based image synthesis.
As a powerful source of automatic supervisory signal, context has been well
studied for learning representations. Inspired by this, we for the first time
propose ConPreDiff to improve diffusion-based image synthesis with context
prediction. We explicitly reinforce each point to predict its neighborhood
context (i.e., multi-stride features/tokens/pixels) with a context decoder at
the end of diffusion denoising blocks in training stage, and remove the decoder
for inference. In this way, each point can better reconstruct itself by
preserving its semantic connections with neighborhood context. This new
paradigm of ConPreDiff can generalize to arbitrary discrete and continuous
diffusion backbones without introducing extra parameters in sampling procedure.
Extensive experiments are conducted on unconditional image generation,
text-to-image generation and image inpainting tasks. Our ConPreDiff
consistently outperforms previous methods and achieves a new SOTA text-to-image
generation results on MS-COCO, with a zero-shot FID score of 6.21.
Related papers
- DiffHarmony: Latent Diffusion Model Meets Image Harmonization [11.500358677234939]
Diffusion models have promoted the rapid development of image-to-image translation tasks.
Fine-tuning pre-trained latent diffusion models from scratch is computationally intensive.
In this paper, we adapt a pre-trained latent diffusion model to the image harmonization task to generate harmonious but potentially blurry initial images.
arXiv Detail & Related papers (2024-04-09T09:05:23Z) - DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models [18.44432223381586]
Recently, a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks.
In these techniques, two or more randomly selected natural images are mixed together to generate an augmented image.
We propose DiffuseMix, a novel data augmentation technique that leverages a diffusion model to reshape training images.
arXiv Detail & Related papers (2024-04-05T05:31:02Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - ADIR: Adaptive Diffusion for Image Reconstruction [46.838084286784195]
We propose a conditional sampling scheme that exploits the prior learned by diffusion models.
We then combine it with a novel approach for adapting pretrained diffusion denoising networks to their input.
We show that our proposed adaptive diffusion for image reconstruction' approach achieves a significant improvement in the super-resolution, deblurring, and text-based editing tasks.
arXiv Detail & Related papers (2022-12-06T18:39:58Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - ImageBART: Bidirectional Context with Multinomial Diffusion for
Autoregressive Image Synthesis [15.006676130258372]
Autoregressive models incorporate context in a linear 1D order by attending only to previously synthesized image patches above or to the left.
We propose a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process.
Our approach can take unrestricted, user-provided masks into account to perform local image editing.
arXiv Detail & Related papers (2021-08-19T17:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.